Social Media Data Mining
Social Media Data Mining
An Introduction
Reza Zafarani
Mohammad Ali Abbasi
Huan Liu
By permission of Cambridge University Press, this preprint is free. Users can make
one hardcopy for personal use, but not for further copying or distribution (either print or
electronically). Users may link freely to the books website: http://dmml.asu.edu/smm, but
may not post this preprint on other web sites.
Contents
Introduction
1.1 What is Social Media Mining . . . .
1.2 New Challenges for Mining . . . . .
1.3 Book Overview and Readers Guide
1.4 Summary . . . . . . . . . . . . . . . .
1.5 Bibliographic Notes . . . . . . . . . .
1.6 Exercises . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
16
16
18
21
22
24
Essentials
27
Graph Essentials
2.1 Graph Basics . . . . . . . . . . . . . . . .
2.1.1 Nodes . . . . . . . . . . . . . . .
2.1.2 Edges . . . . . . . . . . . . . . . .
2.1.3 Degree and Degree Distribution
2.2 Graph Representation . . . . . . . . . .
2.3 Types of Graphs . . . . . . . . . . . . . .
2.4 Connectivity in Graphs . . . . . . . . . .
2.5 Special Graphs . . . . . . . . . . . . . . .
2.5.1 Trees and Forests . . . . . . . . .
2.5.2 Special Subgraphs . . . . . . . .
2.5.3 Complete Graphs . . . . . . . . .
2.5.4 Planar Graphs . . . . . . . . . . .
2.5.5 Bipartite Graphs . . . . . . . . .
29
30
30
31
32
35
37
39
44
44
45
47
47
48
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.6
2.7
2.8
2.9
3
Network Measures
3.1 Centrality . . . . . . . . . . . .
3.1.1 Degree Centrality . . . .
3.1.2 Eigenvector Centrality .
3.1.3 Katz Centrality . . . . .
3.1.4 PageRank . . . . . . . .
3.1.5 Betweenness Centrality
3.1.6 Closeness Centrality . .
3.1.7 Group Centrality . . . .
3.2 Transitivity and Reciprocity . .
3.2.1 Transitivity . . . . . . .
3.2.2 Reciprocity . . . . . . .
3.3 Balance and Status . . . . . . .
3.4 Similarity . . . . . . . . . . . . .
3.4.1 Structural Equivalence .
3.4.2 Regular Equivalence . .
3.5 Summary . . . . . . . . . . . . .
3.6 Bibliographic Notes . . . . . . .
3.7 Exercises . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Network Models
4.1 Properties of Real-World Networks
4.1.1 Degree Distribution . . . .
4.1.2 Clustering Coefficient . . .
4.1.3 Average Path Length . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
48
49
49
49
53
56
57
61
64
66
68
69
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
73
74
74
75
78
80
82
84
85
87
87
91
92
95
96
98
101
102
103
.
.
.
.
105
106
106
109
109
4.2
4.3
4.4
4.5
4.6
4.7
5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
135
137
141
142
144
144
145
148
150
151
154
157
159
160
162
166
167
169
II
6
173
Community Analysis
6.1 Community Detection . . . . . . . . . . . . . . . . .
6.1.1 Community Detection Algorithms . . . . . .
6.1.2 Member-Based Community Detection . . . .
6.1.3 Group-Based Community Detection . . . . .
6.2 Community Evolution . . . . . . . . . . . . . . . . .
6.2.1 How Networks Evolve . . . . . . . . . . . . .
6.2.2 Community Detection in Evolving Networks
6.3 Community Evaluation . . . . . . . . . . . . . . . . .
6.3.1 Evaluation with Ground Truth . . . . . . . .
6.3.2 Evaluation without Ground Truth . . . . . .
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
6.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . .
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . .
Information Diffusion in Social Media
7.1 Herd Behavior . . . . . . . . . . . . . . . . . .
7.1.1 Bayesian Modeling of Herd Behavior
7.1.2 Intervention . . . . . . . . . . . . . . .
7.2 Information Cascades . . . . . . . . . . . . . .
7.2.1 Independent Cascade Model (ICM) .
7.2.2 Maximizing the Spread of Cascades .
7.2.3 Intervention . . . . . . . . . . . . . . .
7.3 Diffusion of Innovations . . . . . . . . . . . .
7.3.1 Innovation Characteristics . . . . . . .
7.3.2 Diffusion of Innovations Models . . .
7.3.3 Modeling Diffusion of Innovations . .
7.3.4 Intervention . . . . . . . . . . . . . . .
7.4 Epidemics . . . . . . . . . . . . . . . . . . . .
7.4.1 Definitions . . . . . . . . . . . . . . . .
7.4.2 SI Model . . . . . . . . . . . . . . . . .
7.4.3 SIR Model . . . . . . . . . . . . . . . .
7.4.4 SIS Model . . . . . . . . . . . . . . . .
7.4.5 SIRS Model . . . . . . . . . . . . . . .
7.4.6 Intervention . . . . . . . . . . . . . . .
7.5 Summary . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
175
179
179
181
188
197
198
201
204
204
209
211
212
214
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
217
220
222
224
225
226
229
231
232
232
233
236
239
240
242
243
245
246
248
249
251
7.6
7.7
III
8
Applications
Influence and Homophily
8.1 Measuring Assortativity . . . . . . . . . . . . . . . . . .
8.1.1 Measuring Assortativity for Nominal Attributes
8.1.2 Measuring Assortativity for Ordinal Attributes .
8.2 Influence . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 Measuring Influence . . . . . . . . . . . . . . . .
8.2.2 Modeling Influence . . . . . . . . . . . . . . . . .
8.3 Homophily . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1 Measuring Homophily . . . . . . . . . . . . . . .
8.3.2 Modeling Homophily . . . . . . . . . . . . . . .
8.4 Distinguishing Influence and Homophily . . . . . . . .
8.4.1 Shuffle Test . . . . . . . . . . . . . . . . . . . . .
8.4.2 Edge-Reversal Test . . . . . . . . . . . . . . . . .
8.4.3 Randomization Test . . . . . . . . . . . . . . . .
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . .
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
257
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
259
261
262
265
268
268
273
278
278
278
280
280
281
282
285
286
287
289
290
292
292
293
300
303
303
304
306
309
309
310
9.5
9.6
9.7
10 Behavior Analytics
10.1 Individual Behavior . . . . . . . . . . .
10.1.1 Individual Behavior Analysis .
10.1.2 Individual Behavior Modeling
10.1.3 Individual Behavior Prediction
10.2 Collective Behavior . . . . . . . . . . .
10.2.1 Collective Behavior Analysis .
10.2.2 Collective Behavior Modeling .
10.2.3 Collective Behavior Prediction
10.3 Summary . . . . . . . . . . . . . . . . .
10.4 Bibliographic Notes . . . . . . . . . . .
10.5 Exercises . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
311
314
315
316
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
319
319
321
326
327
332
332
337
338
340
341
343
Bibliography
345
Index
373
Preface
To the Instructors
The book is designed for a one-semester course for senior undergraduate
or graduate students. Though it is mainly written for students with a
10
Reza Zafarani
Mohammad Ali Abbasi
Huan Liu
Tempe, AZ
August, 2013
11
Acknowledgments
In the past several years, enormous pioneering research has been performed by numerous researchers in the interdisciplinary fields of data
mining, social computing, social network analysis, network science, computer science, and the social sciences. We are truly dwarfed by the depth,
breadth, and extent of the literature, which not only made it possible for
us to complete a text on this emerging topic social media mining but also
made it a seemingly endless task. In the process, we have been fortunate
in drawing inspiration and obtaining great support and help from many
people to whom we are indebted.
We would like to express our tremendous gratitude to the current and
former members of the Data Mining and Machine Learning laboratory
at Arizona State University (ASU); in particular, Nitin Agrawal, Salem
Alelyani, Geoffrey Barbier, William Cole, Zhuo Feng, Magdiel GalanOliveras, Huiji Gao, Pritam Gundecha, Xia (Ben) Hu, Isaac Jones, Shamanth
Kumar, Fred Morstatter, Sai Thejasvee Moturu, Ashwin Rajadesingan,
Suhas Ranganath, Jiliang Tang, Lei Tang, Xufei Wang, and Zheng Zhao.
Without their impressive accomplishments and continuing strides in advancing research in data mining, machine learning, and social computing, this book would have not been possible. Their stimulating thoughts,
creative ideas, friendly aggressiveness, willingness to extend the research
frontier, and cool company during our struggling moments (Arizona could
be scorchingly hot in some months), directly and indirectly, offered us encouragement, drive, passion, and ideas, as well as critiques in the process
toward the completion of the book.
This book project stemmed from a course on social computing offered
in 2008 at ASU. It was a seminar course that enjoyed active participation by graduate students and bright undergraduates with intelligent and
provocative minds. Lively discussion and heated arguments were fixtures
of the seminar course. Since then, it has become a regular course, evolving
into a focused theme on social media mining. Teaching assistants, students,
and guest speakers in these annual courses were of significant help to us in
choosing topics to include, determining the depth and extent of each topic,
and offering feedback on lecture materials such as homework problems,
slides, course projects, and reading materials.
We would like to especially thank Denny Abraham Cheriyan, Nitin
Ahuja, Amy Baldwin, Sai Prasanna Baskaran, Gaurav Pandey, Prerna
12
13
14
Chapter
Introduction
With the rise of social media, the web has become a vibrant and lively Social Media
realm in which billions of individuals all around the globe interact, share,
post, and conduct numerous daily activities. Information is collected,
curated, and published by citizen journalists and simultaneously shared Citizen
or consumed by thousands of individuals, who give spontaneous feed- Journalism
back. Social media enables us to be connected and interact with each other
anywhere and anytime allowing us to observe human behavior in an
unprecedented scale with a new lens. This social media lens provides us
with golden opportunities to understand individuals at scale and to mine
human behavioral patterns otherwise impossible. As a byproduct, by understanding individuals better, we can design better computing systems
tailored to individuals needs that will serve them and society better. This
new social media world has no geographical boundaries and incessantly
churns out oceans of data. As a result, we are facing an exacerbated problem of big data drowning in data, but thirsty for knowledge. Can data
mining come to the rescue?
Unfortunately, social media data is significantly different from the traditional data that we are familiar with in data mining. Apart from enormous
size, the mainly user-generated data is noisy and unstructured, with abundant social relations such as friendships and followers-followees. This
new type of data mandates new computational data analysis approaches
that can combine social theories with statistical and data mining methods. The pressing demand for new techniques ushers in and entails a new
interdisciplinary field social media mining.
15
1.1
Social Atom
Social Molecule
Data Scientist
Social media shatters the boundaries between the real world and the virtual
world. We can now integrate social theories with computational methods
to study how individuals (also known as social atoms) interact and how
communities (i.e., social molecules) form. The uniqueness of social media
data calls for novel data mining techniques that can effectively handle usergenerated content with rich social relations. The study and development
of these new techniques are under the purview of social media mining,
an emerging discipline under the umbrella of data mining. Social Media Mining is the process of representing, analyzing, and extracting actionable
patterns from social media data.
Social Media Mining, introduces basic concepts and principal algorithms
suitable for investigating massive social media data; it discusses theories and methodologies from different disciplines such as computer science, data mining, machine learning, social network analysis, network
science, sociology, ethnography, statistics, optimization, and mathematics.
It encompasses the tools to formally represent, measure, model, and mine
meaningful patterns from large-scale social media data.
Social media mining cultivates a new kind of data scientist who is well
versed in social and computational theories, specialized to analyze recalcitrant social media data, and skilled to help bridge the gap from what we
know (social and computational theories) to what we want to know about
the vast social media world with computational tools.
1.2
Social media mining is an emerging field where there are more problems
than ready solutions. Equipped with interdisciplinary concepts and theories, fundamental principles, and state-of-the-art algorithms, we can stand
on the shoulders of the giants and embark on solving challenging problems
and developing novel data mining techniques and scalable computational
algorithms. In general, social media can be considered a world of social
atoms (i.e., individuals), entities (e.g., content, sites, networks, etc.), and
interactions between individuals and entities. Social theories and social
norms govern the interactions between individuals and entities. For effective social media mining, we collect information about individuals and
16
The data has a power-law distribution and more often than not, data is not independent and identically distributed (i.i.d.) as generally assumed in data mining.
17
1.3
This book consists of three parts. Part I, Essentials, outlines ways to represent social media data and provides an understanding of fundamental
elements of social media mining. Part II, Communities and Interactions,
discusses how communities can be found in social media and how interactions occur and information propagates in social media. Part III, Applications, offers some novel illustrative applications of social media mining.
Throughout the book, we use examples to explain how things work and to
deepen the understanding of abstract concepts and profound algorithms.
These examples show in a tangible way how theories are applied or ideas
are materialized in discovering meaningful patterns in social media data.
Consider an online social networking site with millions of members
in which members have the opportunity to befriend one another, send
messages to each other, and post content on the site. Facebook, LinkedIn,
and Twitter are exemplars of such sites. To make sense of data from these
sites, we resort to social media mining to answer corresponding questions.
In Part I: Essentials (Chapters 25), we learn to answer questions such as
the following:
1. Who are the most important people in a social network?
2. How do people befriend others?
3. How can we find interesting patterns in user-generated content?
These essentials come into play in Part II: Communities and Interactions
(Chapters 6 and 7) where we attempt to analyze how communities are
formed, how they evolve, and how the qualities of detected communities
are evaluated. We show ways in which information diffusion in social
media can be studied. We aim to answer general questions such as the
following:
1. How can we identify communities in a social network?
18
Figure 1.1: Dependency between Book Chapters. Arrows show dependencies and colors represent book parts.
basics and tangible examples of this emerging field and understanding the
potentials and opportunities that social media mining can offer.
20
1.4
Summary
1.5
Bibliographic Notes
For historical notes on social media sites and challenges in social media,
refer to [81, 173, 141, 150, 115]. Kaplan and Haenlein [141] provide a categorization of social media sites into collaborative projects, blogs, content
communities, social networking sites, virtual game worlds, and virtual
social worlds. Our definition of social media is a rather abstract one whose
elements are social atoms (individuals), entities, and interactions. A more
detailed abstraction can be found in the work of [149]. They consider the
seven building blocks of social media to be identity, conversation, sharing, presence, relationships, reputation, and groups. They argue that the
amount of attention that sites give to these building blocks makes them
different in nature. For instance, YouTube provides more functionality in
terms of groups than LinkedIn.
Social media mining brings together techniques from many disciplines.
General references that can accompany this book and help readers better
understand the material in this book can be found in data mining and web
mining [120, 280, 92, 174, 51], machine learning [40], and pattern recognition [75] texts, as well as network science and social network analysis
[78, 253, 212, 140, 28] textbooks. For relevant references on optimization
refer to [44, 219, 228, 207] and for algorithms to [61, 151]. For general
references on social research methods consult [36, 47]. Note that these are
generic references and more specific references are provided at the end of
each chapter. This book discusses nonmultimedia data in social media.
For multimedia data analysis refer to [49].
Recent developments in social media mining can be found in journal articles in IEEE Transactions on Knowledge and Data Engineering
(TKDE), ACM Transactions on Knowledge Discovery from Data (TKDD),
ACM Transactions on Intelligent Systems and Technology (TIST), Social
Network Analysis and Mining (SNAM), Knowledge and Information Systems (KAIS), ACM Transactions on the Web (TWEB), Data Mining and
Knowledge Discovery (DMKD), World Wide Web Journal, Social Networks, Internet Mathematics, IEEE Intelligent Systems, and SIGKDD Exploration. Conference papers can be found in proceedings of Knowledge
Discovery and Data Mining (KDD), World Wide Web (WWW), Association for Computational Linguistics (ACL), Conference on Information
and Knowledge Management (CIKM), International Conference on Data
Mining (ICDM), Internet Measuring Conference (IMC), International Con22
23
1.6
Twitter
Pandora
Del.icio.us
Meetup
Exercises
10. Rumors spread rapidly on social media. Can you think of some
method to block the spread of rumors on social media?
25
26
Part I
Essentials
27
Chapter
Graph Essentials
We live in a connected world in which networks are intertwined with
our daily life. Networks of air and land transportation help us reach our
destinations; critical infrastructure networks that distribute water and electricity are essential for our society and economy to function; and networks
of communication help disseminate information at an unprecedented rate.
Finally, our social interactions form social networks of friends, family, and
colleagues. Social media attests to the growing body of these social networks in which individuals interact with one another through friendships,
email, blogposts, buying similar products, and many other mechanisms.
Social media mining aims to make sense of these individuals embedded
in networks. These connected networks can be conveniently represented
using graphs. As an example, consider a set of individuals on a social
networking site where we want to find the most influential individual.
Each individual can be represented using a node (circle) and two individuals
who know each other can be connected with an edge (line). In Figure 2.1,
we show a set of seven individuals and their friendships. Consider a
hypothetical social theory that states that the more individuals you know,
the more influential you are. This theory in our graph translates to the
individual with the maximum degree (the number of edges connected to
its corresponding node) being the most influential person. Therefore, in
this network Juan is the most influential individual because he knows
four others, which is more than anyone else. This simple scenario is
an instance of many problems that arise in social media, which can be
solved by modeling the problem as a graph. This chapter formally details
29
2.1
Graph Basics
2.1.1
Nodes
All graphs have fundamental building blocks. One major part of any
graph is the set of nodes. In a graph representing friendship, these nodes
represent people, and any pair of connected people denotes the friendship
between them. Depending on the context, these nodes are called vertices
or actors. For example, in a web graph, nodes represent websites, and the
connections between nodes indicate web-links between them. In a social
setting, these nodes are called actors. The mathematical representation for
30
V = {v1 , v2 , . . . , vn },
(2.1)
2.1.2
Edges
Another important element of any graph is the set of edges. Edges connect
nodes. In a social setting, where nodes represent social entities such as
people, edges indicate inter-node relationships and are therefore known
as relationships or (social) ties . The edge set is usually represented using E, Relationships
E = {e1 , e2 , . . . , em },
and Ties
(2.2)
Neighborhood
2.1.3
The number of edges connected to one node is the degree of that node.
Degree of a node vi is often denoted using di . In the case of directed edges,
nodes have in-degrees (edges pointing toward the node) and out-degrees
(edges pointing away from the node). These values are presented using
din
and dout
, respectively. In social media, degree represents the number
i
i
of friends a given user has. For example, on Facebook, degree represents
the users number of friends, and on Twitter in-degree and out-degree
represent the number of followers and followees, respectively. In any
undirected graph, the summation of all node degrees is equal to twice the
number of edges.
Theorem 2.1. The summation of degrees in an undirected graph is twice the
number of edges,
X
di = 2|E|.
(2.3)
i
Proof. Any edge has two endpoints; therefore, when calculating the degrees di and d j for any connected nodes vi and v j , the edge between them
contributes 1 to both di and d j ; hence, if thePedge is removed,
P di and d j become di 1 and d j 1, and the summation k dk becomes k dk 2. Hence,
by removal of all m edges, the degree summation becomes smaller by 2m.
However, we know that when all edges are removed the degree summation
becomes zero; therefore, the degree summation is 2 m = 2|E|.
Lemma 2.1. In an undirected graph, there are an even number of nodes having
odd degree.
32
Proof. The result can be derived from the previous theorem directly because
the summation of degrees is even: 2|E|. Therefore, when nodes with even
degree are removed from this summation, the summation of nodes with
odd degree should also be even; hence, there must exist an even number
of nodes with odd degree.
Lemma 2.2. In any directed graph, the summation of in-degrees is equal to the
summation of out-degrees,
X
X
dout
=
din
(2.4)
j .
i
i
Degree Distribution
In very large graphs, distribution of node degrees (degree distribution) is an
important attribute to consider. The degree distribution plays an important
role in describing the network being studied. Any distribution can be
described by its members. In our case, these are the degrees of all nodes
in the graph. The degree distribution pd (or P(d), or P(dv = d)) gives the
probability that a randomly
P selected node v has degree d. Because pd is a
probability distribution d=0 pd = 1. In a graph with n nodes, pd is defined
as
nd
(2.5)
pd = ,
n
where nd is the number of nodes with degree d. An important, commonly
performed procedure is to plot a histogram of the degree distribution, in
which the x-axis represents the degree (d) and the y-axis represents either
(1) the number of nodes having that degree (nd ) or (2) the fraction of nodes
having that degree (pd ).
Example 2.1. For the graph provided in Figure 2.1, the degree distribution pd for
d = {1, 2, 3, 4} is
p1 = 17 , p2 = 74 , p3 = 17 , p4 = 17 .
(2.6)
Because we have four nodes have degree 2, and degrees 1, 3, and 4 are observed
once.
33
Figure 2.3: Facebook Degree Distribution for the US and Global Users.
There exist many users with few friends and a few users with many friends.
This is due to a power-law degree distribution.
Power-law
Distribution
Example 2.2. On social networking sites, friendship relationships can be represented by a large graph. In this graph, nodes represent individuals and edges
represent friendship relationships. We can compute the degrees and plot the degree
distribution using a graph where the x-axis is the degree and the y-axis is the fraction of nodes with that degree.1 The degree distribution plot for Facebook in May
2012 is shown in Figure 2.3. A general trend observable on social networking sites
is that there exist many users with few connections and there exist a handful of
users with very large numbers of friends. This is commonly called the power-law
degree distribution.
As previously discussed, any graph G can be represented as a pair
G(V, E), where V is the node set and E is the edge set. Since edges are
between nodes, we have
E V V.
(2.7)
Graphs can also have subgraphs. For any graph G(V, E), a graph G0 (V 0 , E0 )
1
34
2.2
(2.8)
(2.9)
Graph Representation
We have demonstrated the visual representation of graphs. This representation, although clear to humans, cannot be used effectively by computers
or manipulated using mathematical tools. We therefore seek representations that can store the node and edge sets in a way that (1) does not lose
information, (2) can be manipulated easily by computers, and (3) can have
mathematical methods applied easily.
Adjacency Matrix
A simple way of representing graphs is to use an adjacency matrix (also
known as a sociomatrix). Figure 2.4 depicts an example of a graph and its Sociomatrix
corresponding adjacency matrix. A value of 1 in the adjacency matrix indicates a connection between nodes vi and v j , and a 0 denotes no connection
between the two nodes. When generalized, any real number can be used
to show the strength of connections between two nodes. The adjacency
matrix gives a natural mathematical representation for graphs. Note that
35
2.3
Types of Graphs
In general, there are many basic types of graphs. In this section we discuss
several basic types of graphs.
Null Graph. A null graph is a graph where the node set is empty (there
are no nodes). Obviously, since there are no nodes, there are also no edges.
Formally,
G(V, E), V = E = .
(2.11)
Empty Graph. An empty or edgeless graph is one where the edge set is
empty:
G(V, E), E = .
(2.12)
Note that the node set can be non-empty. A null graph is an empty
graph but not vice versa.
Directed/Undirected/Mixed Graphs. Graphs that we have discussed thus
far rarely had directed edges. As mentioned, graphs that only have directed edges are called directed graphs and ones that only have undirected
ones are called undirected graphs. Mixed graphs have both directed and
undirected edges. In directed graphs, we can have two edges between i
and j (one from i to j and one from j to i), whereas in undirected graphs
only one edge can exist. As a result, the adjacency matrix for directed
graphs is not in general symmetric (i connected to j does not mean j is
connected to i, i.e., Ai, j , A j,i ), whereas the adjacency matrix for undirected
graphs is symmetric (A = AT ).
In social media, there are many directed and undirected networks. For
instance, Facebook is an undirected network in which if Jon is a friend
of Mary, then Mary is also a friend of Jon. Twitter is a directed network,
where follower relationships are not bidirectional. One direction is called
followers, and the other is denoted as following.
37
2.4
Connectivity in Graphs
e(v1 , v2 ) E.
(2.13)
Two edges e1 (a, b) and e2 (c, d) are incident when they share one endpoint
(i.e., are connected via a node):
e1 (a, b) is incident to e2 (c, d)
(a = c) (a = d) (b = c) (b = d).
(2.14)
Figure 2.6 depicts adjacent nodes and incident edges in a sample graph.
In a directed graph, two edges are incident if the ending of one is the
beginning of the other; that is, the edge directions must match for edges to
be incident.
39
Figure 2.6: Adjacent Nodes and Incident Edges. In this graph u and v, as
well as v and w, are adjacent nodes, and edges (u, v) and (v, w) are incident
edges.
Traversing an Edge. An edge in a graph can be traversed when one starts
at one of its end-nodes, moves along the edge, and stops at its other endnode. So, if an edge e(a, b) connects nodes a and b, then visiting e can start
at a and end at b. Alternatively, in an undirected graph we can start at b
and end the visit at a.
Random Walk
Walk, Path, Trail, Tour, and Cycle. A walk is a sequence of incident edges
traversed one after another. In other words, if in a walk one traverses
edges e1 (v1 , v2 ), e2 (v2 , v3 ), e3 (v3 , v4 ), . . . , en (vn , vn+1 ), we have v1 as the walks
starting node and vn+1 as the walks ending node. When a walk does
not end where it started (v1 , vn+1 ) then it is called an open walk. When
a walk returns to where it was started (v1 = vn+1 ), it is called a closed
walk. Similarly, a walk can be denoted as a sequence of nodes, v1 , v2 ,
v3 , . . . , vn . In this representation, the edges that are traversed are e1 (v1 , v2 ),
e2 (v2 , v3 ), . . . , en1 (vn1 , vn ). The length of a walk is the number of edges
traversed during the walk and in our case is n 1.
A trail is a walk where no edge is traversed more than once; therefore,
all walk edges are distinct. A closed trail (one that ends where it started)
is called a tour or circuit.
A walk where nodes and edges are distinct is called a path, and a closed
path is called a cycle. The length of a path or cycle is the number of edges
traversed in the path or cycle. In a directed graph, we have directed paths
because traversal of edges is only allowed in the direction of the edges. In
Figure 2.7, v4 , v3 , v6 , v4 , v2 is a walk; v4 , v3 is a path; v4 , v3 , v6 , v4 , v2 is a trail;
and v4 , v3 , v6 , v4 is both a tour and a cycle.
A graph has a Hamiltonian cycle if it has a cycle such that all the nodes
in the graph are visited. It has an Eulerian tour if all the edges are traversed
only once. Examples of a Hamiltonian cycle and an Eulerian tour are
provided in Figure 2.8.
One can perform a random walk on a weighted graph, where nodes
are visited randomly. The weight of an edge, in this case, defines the
40
Figure 2.7: Walk, Path, Trail, Tour, and Cycle. In this figure, v4 , v3 , v6 , v4 , v2
is a walk; v4 , v3 is a path; v4 , v3 , v6 , v4 , v2 is a trail; and v4 , v3 , v6 , v4 is both a
tour and a cycle.
41
probability of traversing it. For this to work correctly, we must make sure
that for all edges that start at vi we have
X
wi,x = 1, i, j wi,j 0.
(2.15)
The random walk procedure is outlined in Algorithm 2.1. The algorithm starts at a node v0 and visits its adjacent nodes based on the transition
probability (weight) assigned to edges connecting them. This procedure
is performed for t steps (provided to the algorithm); therefore, a walk of
length t is generated by the random walk.
Connectivity. A node vi is connected to node v j (or v j is reachable from
vi ) if it is adjacent to it or there exists a path from vi to v j . A graph is
connected if there exists a path between any pair of nodes in it. In a directed
graph, the graph is weakly connected if there exists a path between any
pair of nodes, without following the edge directions (i.e., directed edges
are replaced with undirected edges). The graph is strongly connected if
there exists a directed path (following edge directions) between any pair
of nodes. Figure 2.9 shows examples of connected, disconnected, weakly
connected, and strongly connected graphs.
Components. A component in an undirected graph is a subgraph such
that, there exists a path between every pair of nodes inside the component.
42
est path between nodes vi and v j as li,j . The concept of the neighborhood
of a node vi can be generalized using shortest paths. An n-hop neighborhood
of node vi is the set of nodes that are within n hops distance from node vi .
That is, their shortest path to vi has length less than or equal to n.
Diameter. The diameter of a graph is defined as the length of the longest
shortest path between any pair of nodes in the graph. It is defined only for
connected graphs because the shortest path might not exist in disconnected
graphs. Formally, for a graph G, the diameter is defined as
diameterG =
2.5
max li,j .
(vi ,v j )VV
(2.16)
Special Graphs
Using general concepts defined thus far, many special graphs can be defined. These special graphs can be used to model different problems. We
review some well-known special graphs and their properties in this section.
2.5.1
Trees are special cases of undirected graphs. A tree is a graph structure that
has no cycle in it. In a tree, there is exactly one path between any pair of
nodes. A graph consisting of set of disconnected trees is called a forest. A
forest is shown in Figure 2.11.
Figure 2.12: Minimum Spanning Tree. Nodes represent cities and values
assigned to edges represent geographical distance between cities. Highlighted edges are roads that are built in a way that minimizes their total
length.
In a tree with |V| nodes, we have |E| = |V| 1 edges. This can be proved
by contradiction (see Exercises).
2.5.2
Special Subgraphs
Some subgraphs are frequently used because of their properties. Two such
subgraphs are discussed here.
Spanning Tree. For any connected graph, the spanning tree is a subgraph and a tree that includes all the nodes of the graph. Obviously,
when the original graph is not a tree, then its spanning tree includes
all the nodes, but not all the edges. There may exist multiple spanning trees for a graph. For a weighted graph and one of its spanning
trees, the weight of that spanning tree is the summation of the edge
weights in the tree. Among the many spanning trees found for a
weighted graph, the one with the minimum weight is called the minimum spanning tree (MST) .
For example, consider a set of cities, where roads need to be built to
connect them. We know the distance between each pair of cities. We
45
can represent each city with a node and the distance between these
nodes using an edge between them labeled with the distance. This
graph-based view is shown in Figure 2.12. In this graph, nodes v1 ,
v2 , . . . , v9 represent cities, and the values attached to edges represent
the distance between them. Note that edges only represent distances
(potential roads!), and roads may not exist between these cities. Due
to construction costs, the government needs to minimize the total
mileage of roads built and, at the same time, needs to guarantee that
there is a path (i.e., a set of roads) that connects every two cities.
The minimum spanning tree is a solution to this problem. The edges
in the MST represent roads that need to be built to connect all of
the cities at the minimum length possible. Figure 2.2 highlights the
minimum spanning tree.
Terminal Nodes
2.5.3
Complete Graphs
A complete graph is a graph where for a set of nodes V, all possible edges
exist in the graph. In other words, all pairs of nodes are connected with an
edge. Hence,
!
|V|
|E| =
.
(2.17)
2
Complete graphs with n nodes are often denoted as Kn . K1 , K2 , K3 , and
K4 are shown in Figure 2.14.
2.5.4
Planar Graphs
A graph that can be drawn in such a way that no two edges cross each
other (other than the endpoints) is called planar. A graph that is not planar
is denoted as nonplanar. Figure 2.15 shows an example of a planar graph
and a nonplanar graph.
47
2.5.5
Bipartite Graphs
A bipartite graph G(V, E) is a graph where the node set can be partitioned
into two sets such that, for all edges, one endpoint is in one set and the
other endpoint is in the other set. In other words, edges connect nodes in
these two sets, but there exist no edges between nodes that belong to the
same set. Formally,
V = VL VR ,
VL VR = ,
E VL VR .
(2.18)
(2.19)
(2.20)
2.5.6
Regular Graphs
A regular graph is one in which all nodes have the same degree. A regular
graph where all nodes have degree 2 is called a 2-regular graph. More
generally, a graph where all nodes have degree k is called a k-regular graph.
48
2.5.7
Bridges
2.6
Graph Algorithms
In this section, we review some well-known algorithms for graphs, although they are only a small fraction of the plethora of algorithms related
to graphs.
2.6.1
Graph/Tree Traversal
Among the most useful algorithms for graphs are the traversal algorithms
for graphs, and special subgraphs, such as trees. Consider a social media
49
2.6.2
1. All nodes are initially unvisited. From the unvisited set of nodes,
the one that has the minimum shortest path length is selected. We
denote this node as smallest in the algorithm.
2. For this node, we check all its neighbors that are still unvisited.
For each unvisited neighbor, we check if its current distance can be
improved by considering the shortest path that goes through smallest. This can be performed by comparing its current shortest path
length (distance(neighbor)) to the path length that goes through smallest (distance(smallest)+w(smallest, neighbor)). This condition is checked
in Line 17.
54
2.6.3
with its endpoint). This process is iterated until the graph is fully spanned.
An example of Prims algorithm is provided in Figure 2.21.
2.6.4
Flow Network
A flow network G(V, E, C)2 is a directed weighted graph, where we have
the following:
e(u, v) E, c(u, v) 0 defines the edge capacity.
When (u, v) E, (v, u) < E (opposite flow is impossible).
Source and Sink
s defines the source node and t defines the sink node. An infinite
supply of flow is connected to the source.
A sample flow network, along with its capacities, is shown in Figure 2.22.
Flow
Given edges with certain capacities, we can fill these edges with the flow
up to their capacities. This is known as the capacity constraint. Furthermore,
we should guarantee that the flow that enters any node other than source
s and sink t is equal to the flow that exits it so that no flow is lost (flow
conservation constraint). Formally,
(u, v) E, f (u, v) 0 defines the flow passing through that edge.
(u, v) E, 0 f (u, v) c(u, v) (capacity constraint).
P
P
v V, v < {s, t}, k:(k,v)E f (k, v) = l:(v,l)E f (v, l) (flow conservation
constraint).
2
58
Flow Quantity
The flow quantity (or value of the flow) in any network is the amount
of outgoing flow from the source minus the incoming flow to the source.
Alternatively, one can compute this value by subtracting the outgoing flow
from the sink from its incoming value:
X
X
X
X
flow =
f (s, v)
f (v, s) =
f (v, t)
f (t, v).
(2.21)
v
Example 2.4. The flow quantity for the example in Figure 2.23 is 19:
X
X
flow =
f (s, v)
f (v, s) = (11 + 8) 0 = 19.
v
(2.22)
Our goal is to find the flow assignments to each edge with the maximum
flow quantity. This can be achieved by a maximum flow algorithm. A wellestablished one is the Ford-Fulkerson algorithm [90].
Ford-Fulkerson Algorithm
The intuition behind this algorithm is as follows: Find a path from source
to sink such that there is unused capacity for all edges in the path. Use that
capacity (the minimum capacity unused among all edges on the path) to
increase the flow. Iterate until no other path is available.
59
Weak link
In the residual graph, when edges are in the same direction as the original
graph, their capacity shows how much more flow can be pushed along
that edge in the original graph. When edges are in the opposite direction,
their capacities show how much flow can be pushed back on the original
graph edge. So, by finding a flow in the residual, we can augment the flow
in the original graph. Any simple path from s to t in the residual graph
is an augmenting path. Since all capacities in the residual are positive,
these paths can augment flows in the original, thus increasing the flow.
The amount of flow that can be pushed along this path is equal to the
minimum capacity along the path, since the edge with that capacity limits
the amount of flow being pushed.3 Given flow f (u, v) in the original graph
and flow fR (u, v) and fR (v, u) in the residual graph, we can augment the
flow as follows:
faugmented (u, v) = f (u, v) + fR (u, v) fR (v, u).
3
60
(2.24)
2.6.5
Matching
(2.25)
2.6.6
Cut-Edges
Bridge Detection
As discussed in Section 2.5.7, bridges or cut edges are edges whose removal
makes formerly connected components disconnected. Here we list a simple algorithm for detecting bridges. This algorithm is computationally
expensive, but quite intuitive. More efficient algorithms have been described for the same task.
Since we know that, by removing bridges, formerly connected components become disconnected, one simple algorithm is to remove edges one
by one and test if the connected components become disconnected. This
algorithm is outlined in Algorithm 2.7.
The disconnectedness of a component whose edge e(u, v) is removed
can be analyzed by means of any graph traversal algorithm (e.g., BFS or
4
The proof is omitted here and is a direct result from the minimum-cut/maximum flow
theorem not discussed in this chapter.
64
65
2.7
Summary
This chapter covered the fundamentals of graphs, starting with a presentation of the fundamental building blocks required for graphs: first nodes
and edges, and then properties of graphs such as degree and degree distribution. Any graph must be represented using some data structure for
computability. This chapter covered three well-established techniques: adjacency matrix, adjacency list, and edge list. Due to the sparsity of social
networks, both adjacency list and edge list are more efficient and save significant space when compared to adjacency matrix. We then described various types of graphs: null and empty graphs, directed/undirected/mixed
graphs, simple/multigraphs, and weighted graphs. Signed graphs are examples of weighted graphs that can be used to represent contradictory
behavior.
We discussed connectivity in graphs and concepts such as paths, walks,
trails, tours, and cycles. Components are connected subgraphs. We discussed strongly and weakly connected components. Given the connectivity of a graph, one is able to compute the shortest paths between different
nodes. The longest shortest path in the graph is known as the diameter.
Special graphs can be formed based on the way nodes are connected and
the degree distributions. In complete graphs, all nodes are connected to
all other nodes, and in regular graphs, all nodes have an equal degree. A
tree is a graph with no cycle. We discussed two special trees: the spanning tree and the Steiner tree. Bipartite graphs can be partitioned into two
sets of nodes, with edges between these sets and no edges inside these
sets. Affiliation networks are examples of bipartite graphs. Bridges are
single-point-of-failure edges that can make previously connected graphs
disconnected.
In the section on graph algorithms, we covered a variety of useful techniques. Traversal algorithms provide an ordering of the nodes of a graph.
These algorithms are particularly useful in checking whether a graph is
connected or in generating paths. Shortest path algorithms find paths
with the shortest length between a pair of nodes; Dijkstras algorithm is
an example. Spanning tree algorithms provide subgraphs that span all the
nodes and select edges that sum up to a minimum value; Prims algorithm
is an example. The Ford-Fulkerson algorithm, is one of the maximum flow
algorithms. It finds the maximum flow in a weighted capacity graph. Maximum bipartite matching is an application of maximum flow that solves
66
67
2.8
Bibliographic Notes
The algorithms detailed in this chapter are from three well-known fields:
graph theory, network science, and social network analysis. Interested
readers can get better insight regarding the topics in this chapter by referring to general references in graph theory [43, 301, 71], algorithms and
algorithm design [151, 61], network science [212], and social network analysis [294].
Other algorithms not discussed in this chapter include graph coloring
[139], (quasi) clique detection [2], graph isomorphism [191], topological
sort algorithms [61], and the traveling salesman problem (TSP) [61], among
others. In graph coloring, one aims to color elements of the graph such as
nodes and edges such that certain constraints are satisfied. For instance,
in node coloring the goal is to color nodes such that adjacent nodes have
different colors. Cliques are complete subgraphs. Unfortunately, solving
many problems related to cliques, such as finding a clique that has more
that a given number of nodes, is NP-complete. In clique detection, the
goal is to solve similar clique problems efficiently or provide approximate
solutions. In graph isomorphism, given two graphs G and G0 , our goal is
to find a mapping f from nodes of G to G0 such that for any two nodes
of G that are connected, their mapped nodes in G0 are connected as well.
In topological sort algorithms, a linear ordering of nodes is found in a
directed graph such that for any directed edge (u, v) in the graph, node u
comes before node v in the ordering. In the traveling salesman problem
(TSP), we are provided cities and pairwise distances between them. In
graph theory terms, we are given a weighted graph where nodes represent
cities and edge weights represent distances between cities. The problem is
to find the shortest walk that visits all cities and returns to the origin city.
Other noteworthy shortest path algorithms such as the A [122], the
Bellman-Ford [32], and all-pair shortest path algorithms such as FloydWarshalls [89] are employed extensively in other literature.
In spanning tree computation, the Kruskal Algorithm [156] or Boruvka
[204] are also well-established algorithms.
General references for flow algorithms, other algorithms not discussed
in this chapter such as the Push-Relabel algorithm, and their optimality
can be found in [61, 7].
68
2.9
Exercises
Graph Basics
1. Given a directed graph G(V, E) and its adjacency matrix A, we propose
two methods to make G undirected,
A0i j = min(1, Ai j + A ji ),
A0i j
(2.26)
= Ai j A ji ,
(2.27)
where A0i, j is the (i, j) entry of the undirected adjacency matrix. Discuss the advantages and disadvantages of each method.
Graph Representation
2. Is it possible to have the following degrees in a graph with 7 nodes?
{4, 4, 4, 3, 5, 7, 2}.
(2.28)
A =
0
1
1
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
1
1
0
1
1
0
0
0
0
Special Graphs
4. Prove that |E| = |V| 1 in trees.
69
0
0
1
0
1
1
1
0
0
0
0
1
1
0
1
1
0
0
0
0
0
1
1
0
1
1
0
0
0
0
1
1
1
0
1
0
0
0
0
0
0
1
1
0
1
0
0
0
0
0
0
0
1
0
(2.29)
Graph/Network Algorithms
5. Consider the tree shown in Figure 2.28. Traverse the graph using
both BFS and DFS and list the order in which nodes are visited in
each algorithm.
6. For a tree and a node v, under what condition is v visited sooner by
BFS than DFS? Provide details.
7. For a real-world social network, is BFS or DFS more desirable? Provide details.
8. Compute the shortest path between any pair of nodes using Dijkstras
algorithm for the graph in Figure 2.29.
71
72
Chapter
Network Measures
In February 2012, Kobe Bryant, the American basketball star, joined Chinese microblogging site Sina Weibo. Within a few hours, more than 100,000
followers joined his page, anxiously waiting for his first microblogging post
on the site. The media considered the tremendous number of followers
Kobe Bryant received as an indication of his popularity in China. In this
case, the number of followers measured Bryants popularity among Chinese social media users. In social media, we often face similar tasks in
which measuring different structural properties of a social media network
can help us better understand individuals embedded in it. Corresponding measures need to be designed for these tasks. This chapter discusses
measures for social media networks.
When mining social media, a graph representation is often used. This
graph shows friendships or user interactions in a social media network.
Given this graph, some of the questions we aim to answer are as follows:
Who are the central figures (influential individuals) in the network?
What interaction patterns are common in friends?
Who are the like-minded users and how can we find these similar
individuals?
To answer these and similar questions, one first needs to define measures for quantifying centrality, level of interactions, and similarity, among
other qualities. These measures take as input a graph representation of a
73
3.1
Centrality
3.1.1
Degree Centrality
In real-world interactions, we often consider people with many connections to be important. Degree centrality transfers the same idea into a
measure. The degree centrality measure ranks nodes with more connections higher in terms of centrality. The degree centrality Cd for node vi in
an undirected graph is
Cd (vi ) = di ,
(3.1)
where di is the degree (number of adjacent edges) of node vi . In directed
graphs, we can either use the in-degree, the out-degree, or the combination
as the degree centrality value:
Cd (vi ) = din
(prestige),
i
out
Cd (vi ) = di
(gregariousness),
in
out
Cd (vi ) = di + di .
Prominence or
Prestige
(3.2)
(3.3)
(3.4)
(3.7)
Example 3.1. Figure 3.1 shows a sample graph. In this graph, degree centrality
for node v1 is Cd (v1 ) = d1 = 8, and for all others, it is Cd (v j ) = d j = 1, j , 1.
3.1.2
Eigenvector Centrality
Eigenvector centrality tries to generalize degree centrality by incorporating the importance of the neighbors (or incoming neighbors in directed
graphs). It is defined for both directed and undirected graphs. To keep
track of neighbors, we can use the adjacency matrix A of a graph. Let ce (vi )
denote the eigenvector centrality of node vi . We want the centrality of vi to
be a function of its neighbors centralities. We posit that it is proportional
to the summation of their centralities,
n
1X
A j,i ce (v j ),
ce (vi ) =
j=1
(3.8)
where is some fixed constant. Assuming Ce = (Ce (v1 ), Ce (v2 ), . . . , Ce (vn ))T
is the centrality vectors for all nodes, we can rewrite Equation 3.8 as
Ce = AT Ce .
Perron-Frobenius
Theorem
(3.9)
76
0 1 0
A = 1 0 1 .
(3.10)
0 1 0
Based on Equation 3.9, we need to solve Ce = ACe , or
(A I)Ce = 0.
Assuming Ce = [u1 u2 u3 ]T ,
1
0
1
0
0
1
0
1
0
(3.11)
u1 0
u2 = 0 .
u3
0
= 0,
(3.12)
(3.13)
or equivalently,
()(2 1) 1() = 2 3 = (2 2 ) = 0.
(3.14)
0 2
u1 0
1
0
(3.15)
1
0 2
1 u2 = 0 .
u3
0
0
1
0 2
77
u1 1/2
Ce = u2 = 2/2 ,
u3
1/2
(3.16)
which denotes that node v2 is the most central node and nodes v1 and v3 have equal
centrality values.
Example 3.3. For the graph shown in Figure 3.2(b), the adjacency matrix is as
follows:
0 1 0 1 0
1 0 1 1 1
A = 0 1 0 1 0 .
(3.17)
1 1 1 0 0
0 1 0 0 0
The eigenvalues of A are (1.74, 1.27, 0.00, +0.33, +2.68). For eigenvector
centrality, the largest eigenvalue is selected: 2.68. The corresponding eigenvector
is the eigenvector centrality vector and is
Ce =
0.4119
0.5825
0.4119
0.5237
0.2169
(3.18)
3.1.3
Katz Centrality
A major problem with eigenvector centrality arises when it considers directed graphs (see Problem 1 in the Exercises). Centrality is only passed
on when we have (outgoing) edges, and in special cases such as when a
node is in a directed acyclic graph, centrality becomes zero, even though
the node can have many edges connected to it. In this case, the problem
can be rectified by adding a bias term to the centrality value. The bias term
is added to the centrality values for all nodes no matter how they are
situated in the network (i.e., irrespective of the network topology). The
78
The first term is similar to eigenvector centrality, and its effect is controlled by constant . The second term , is the bias term that avoids zero
centrality values. We can rewrite Equation 3.19 in a vector form,
CKatz = AT CKatz + 1,
(3.20)
where 1 is a vector of all 1s. Taking the first term to the left hand side and
factoring CKatz ,
CKatz = (I AT )1 1.
(3.21)
Since we are inverting a matrix here, not all values are acceptable.
When = 0, the eigenvector centrality part is removed, and all nodes get
the same centrality value . However, once gets larger, the effect of
is reduced, and when det(I AT ) = 0, the matrix I AT becomes noninvertible and the centrality values diverge. The det(I AT ) first becomes Divergence in
0 when = 1/, where is the largest eigenvalue2 of AT . In practice, Centrality
Computation
< 1/ is selected so that centralities are computed correctly.
Example 3.4. For the graph shown in Figure 3.3, the adjacency matrix is as
When det(I AT ) = 0, it can be rearranged as det(AT 1 I) = 0, which is basically
the characteristic equation. This equation first becomes zero when the largest eigenvalue
equals 1 , or equivalently = 1/.
2
79
follows:
A =
0
1
1
1
0
1
0
1
1
1
1
1
0
1
1
1
1
1
0
0
0
1
1
0
0
= AT .
(3.22)
The eigenvalues of A are (1.68, 1.0, 1.0, +0.35, +3.32). The largest eigenvalue of A is = 3.32. We assume = 0.25 < 1/ and = 0.2. Then, Katz
centralities are
1.14
1.31
(3.23)
CKatz = (I AT )1 1 = 1.31 .
1.14
0.85
Thus, nodes v2 , and v3 have the highest Katz centralities.
3.1.4
PageRank
Similar to eigenvector centrality, Katz centrality encounters some challenges. A challenge that happens in directed graphs is that, once a node
becomes an authority (high centrality), it passes all its centrality along all
of its out-links. This is less desirable, because not everyone known by
a well known person is well known. To mitigate this problem, one can
divide the value of passed centrality by the number of outgoing links (outdegree) from that node such that each connected neighbor gets a fraction
of the source nodes centrality:
n
X
Cp (v j )
Cp (vi ) =
A j,i out + .
(3.24)
d
j
j=1
This equation is only defined when dout
is nonzero. Thus, assuming
j
out
that all nodes have positive out-degrees (d j > 0)3 , Equation 3.24 can be
reformulated in matrix format,
Cp = AT D1 Cp + 1,
3
(3.25)
When dout
= 0, we know that since the out-degree is zero, i, A j,i = 0. This makes the
j
term inside the summation 00 . We can fix this problem by setting dout
= 1 since the node
j
will not contribute any centrality to any other nodes.
80
(3.26)
where D = diag(dout
, dout
, . . . , dout
n ) is a diagonal matrix of degrees. The
2
1
centrality measure is known as the PageRank centrality measure and is
used by the Google search engine as a measure for ordering webpages. PageRank and Google
Webpages and their links represent an enormous web-graph. PageRank Web Search
defines a centrality measure for the nodes (webpages) in this web-graph.
When a user queries Google, webpages that match the query and have
higher PageRank values are shown first. Similar to Katz centrality, in
practice, < 1 is selected, where is the largest eigenvalue of AT D1 .
In undirected graphs, the largest eigenvalue of AT D1 is = 1; therefore,
< 1.
Example 3.5. For the graph shown in
follows,
0 1
1 0
A = 0 1
1 0
1 1
1
0
1
0
0
1
1
1
0
0
2.14
2.13
Cp = (I AT D1 )1 1 = 2.14 .
1.45
2.13
81
(3.27)
(3.28)
3.1.5
Betweenness Centrality
2
st
s,t,v
s,t,v
i
Cb (vi )
.
2 n1
2
(3.31)
Computing Betweenness
In betweenness centrality (Equation 3.29), we compute shortest paths between all pairs of nodes to compute the betweenness value. If an algorithm
such as Dijkstras is employed, it needs to be run for all nodes, because
82
s=v1 ,t=v4
s=v1 ,t=v5
s=v3 ,t=v4
s=v3 ,t=v5
s=v4 ,t=v5
= 2 3.5 = 7,
Cb (v3 ) = 2 (
(3.34)
0 + 0 + (1/2) + 0 + (1/2) + 0 )
|{z}
|{z}
|{z}
|{z}
|{z}
|{z}
s=v1 ,t=v2
s=v1 ,t=v4
s=v1 ,t=v5
= 2 1.0 = 2,
s=v2 ,t=v4
s=v2 ,t=v5
s=v4 ,t=v5
(3.35)
83
(3.36)
0 + 0 + 0 + 0 + 0 + (1/2) )
|{z}
|{z}
|{z}
|{z}
|{z}
|{z}
s=v1 ,t=v2
s=v1 ,t=v3
s=v1 ,t=v4
s=v2 ,t=v3
s=v2 ,t=v4
s=v3 ,t=v4
= 2 0.5 = 1,
(3.37)
3.1.6
P
s,t,vi
st (vi )
st
Closeness Centrality
In closeness centrality, the intuition is that the more central nodes are, the
more quickly they can reach other nodes. Formally, these nodes should
have a smaller average shortest path length to other nodes. Closeness
centrality is defined as
1
Cc (vi ) = ,
(3.38)
lv
i
P
where lv = 1
v ,v li, j is node vis average shortest path length to other
i
n1
nodes. The smaller the average shortest path length, the higher the centrality for the node.
Example 3.8. For nodes in Figure 3.5, the closeness centralities are as follows:
Cc (v1 )
Cc (v2 )
Cc (v3 ) = Cb (v4 )
Cc (v5 )
= 1 / ( (1 + 2 + 2 + 3)/4 ) = 0.5,
= 1 / ( (1 + 1 + 1 + 2)/4 ) = 0.8,
= 1 / ( (1 + 1 + 2 + 2)/4 ) = 0.66,
= 1 / ( (1 + 1 + 2 + 3)/4 ) = 0.57.
(3.39)
(3.40)
(3.41)
(3.42)
3.1.7
Group Centrality
All centrality measures defined so far measure centrality for a single node.
These measures can be generalized for a group of nodes. In this section,
we discuss how degree centrality, closeness centrality, and betweenness
centrality can be generalized for a group of nodes. Let S denote the set of
nodes to be measured for centrality. Let V S denote the set of nodes not
in the group.
Group Degree Centrality
Group degree centrality is defined as the number of nodes from outside
the group that are connected to group members. Formally,
group
Cd
(3.43)
Similar to degree centrality, we can define connections in terms of outdegrees or in-degrees in directed graphs. We can also normalize this value.
In the best case, group members are connected to all other nonmembers.
group
Thus, the maximum value of Cd (S) is |V S|. So dividing group degree
centrality value by |V S| normalizes it.
Group Betweenness Centrality
Similar to betweeness centrality, we can define group betweenness centrality as
X (S)
st
group
,
(3.44)
Cb
(S) =
st
s,t,s<S,t<S
where st (S) denotes the number of shortest paths between s and t that
pass through members of S. In the best case, all shortest paths between s
and t pass through
members of S, and therefore, the maximum value for
group
|VS|
Cb
(S) is 2 2 . Similar to betweenness centrality, we can normalize
group betweenness centrality by dividing it by the maximum value.
Group Closeness Centrality
Closeness centrality for groups can be defined as
group
Cc
(S) =
1
,
lgroup
S
(3.45)
P
group
1
where lS
= |VS|
v j <S lS,v j and lS,v j is the length of the shortest path
between a group S and a nonmember v j V S. This length can be
defined in multiple ways. One approach is to find the closest member in S
to v j :
lS,v j = min lvi ,v j .
(3.46)
vi S
One can also use the maximum distance or the average distance to
compute this value.
Example 3.10. Consider the graph in Figure 3.7. Let S = {v2 , v3 }. Group degree
centrality for S is
group
Cd (S) = 3,
(3.47)
86
3.2
3.2.1
Transitivity
(3.48)
(Number of Triangles) 6
.
|Paths of Length 2|
(3.49)
Since every triangle has six closed paths of length 2, we can rewrite
Equation 3.49 as
C=
(Number of Triangles) 3
.
Number of Connected Triples of Nodes
(3.50)
C =
(3.51)
v2 v1 v4 ,v2 v3 v4
di
2
Example 3.12. Figure 3.10 shows how the local clustering coefficient changes for
3.2.2
Reciprocity
2 X
Ai, j A j,i ,
|E| i,j,i<j
2
1
Tr(A2 ),
|E| 2
1
Tr(A2 ),
=
|E|
1
=
Tr(A2 ),
(3.53)
m
P
where Tr(A) = A1,1 + A2,2 + + An,n = ni=1 Ai,i and mPis the number of edges
in the network. Note that the maximum value for i, j Ai, j A j,i is m when all
directed edges are reciprocated.
=
Example 3.13. For the graph shown in Figure 3.11, the adjacency matrix is
A = 1
1
0
1
1
0
0
1
1
1
2
R = Tr(A ) = Tr 0
m
4 1
1
1
0
0
1
0
(3.54)
Its reciprocity is
3.3
2 1
= = .
4 2
(3.55)
positive edges (i.e., all friends) more frequently than ones with two positive
edges and one negative edge (i.e., a friends friend is an enemy). Assume
we observe a signed graph that represents friends/foes or social status.
Can we measure the consistency of attitudes that individual have toward
one another?
To measure consistency in an individuals attitude, one needs to utilize
theories from social sciences to define what is a consistent attitude. In this
section, we discuss two theories, social balance and social status, that can
help determine consistency in observed signed networks. Social balance
theory is used when edges represent friends/foes, and social status theory
is employed when they represent status.
Social Balance Theory
This theory, also known as structural balance theory, discusses consistency Structural Balance
in friend/foe relationships among individuals. Informally, social balance Theory
theory says friend/foe relationships are consistent when
The friend of my friend is my friend,
The friend of my enemy is my enemy,
The enemy of my enemy is my friend,
The enemy of my friend is my enemy.
We demonstrate a graph representation of social balance theory in Figure 3.12. In this figure, positive edges demonstrate friendships and negative ones demonstrate enemies. Triangles that are consistent based on
this theory are denoted as balanced and triangles that are inconsistent as
unbalanced. Let wi j denote the value of the edge between nodes vi and v j . Balanced and
Then, for a triangle of nodes vi , v j , and vk , it is consistent based on social Unbalanced
Triangles
balance theory; that is, it is balanced if and only if
wi j w jk wki 0.
(3.56)
Figure 3.12: Sample Graphs for Social Balance Theory. In balanced triangles, there are an even number of negative edges.
Social Status Theory
Social status theory measures how consistent individuals are in assigning
status to their neighbors. It can be summarized as follows:
If X has a higher status than Y and Y has a higher status than Z, then
X should have a higher status than Z.
We show this theory using two graphs in Figure 3.13. In this figure,
nodes represent individuals. Positive and negative signs show higher or
lower status depending on the arrow direction. A directed positive edge
from node X to node Y shows that Y has a higher status than X, and a
negative one shows the reverse. In the figure on the left, v2 has a higher
status than v1 and v3 has a higher status than v2 , so based on status theory,
v3 should have a higher status than v1 ; however, we see that v1 has a
higher status in our configuration.4 Based on social status theory, this is
implausible, and thus this configuration is unbalanced. The graph on the
right shows a balanced configuration with respect to social status theory.
4
Here, we start from v1 and follow the edges. One can start from a different node, and
the result should remain the same.
94
Figure 3.13: Sample Graphs for Social Status Theory. The left-hand graph
is an unbalanced configuration, and the right-hand graph is a balanced
configuration.
In the example provided in Figure 3.13, social status is defined for the
most general example: a set of three connected nodes (a triad). However,
social status can be generalized to other graphs. For instance, in a cycle of
n nodes, where n 1 consecutive edges are positive and the last edge is
negative, social status theory considers the cycle balanced.
Note that the identical configuration can be considered balanced by
social balance theory and unbalanced based on social status theory (see
Exercises).
3.4
Similarity
3.4.1
Structural Equivalence
(3.57)
For large networks, this value can increase rapidly, because nodes may
share many neighbors. Generally, similarity is attributed to a value that is
bounded and is usually in the range [0, 1]. Various normalization proceJaccard Similarity and dures can take place such as the Jaccard similarity or the cosine similarity:
Cosine Similarity
|N(vi ) N(v j )|
,
|N(vi ) N(v j )|
(3.58)
|N(vi ) N(v j )|
Cosine (vi , v j ) = p
.
|N(vi )||N(v j )|
(3.59)
Jaccard (vi , v j ) =
Jaccard (v2 , v5 ) =
96
(3.60)
(3.61)
di d j
.
n
We
(3.62)
significance (vi , v j ) =
k
X
Ai,k A j,k
(Ai,k A j,k Ai A j )
k
X
=
(Ai,k A j,k Ai A j Ai A j + Ai A j )
k
X
=
(Ai,k A j,k Ai,k A j Ai A j,k + Ai A j )
k
97
X
=
(Ai,k Ai )(A j,k A j ),
(3.63)
P
P
where Ai = n1 k Ai,k . The term k (Ai,k Ai )(A j,k A j ) is basically the
covariance between Ai and A j . The covariance can be normalized by the
multiplication of variances,
pearson (vi , v j ) =
Pearson Correlation
significance (vi , v j )
qP
2
2
k (Ai,k Ai )
k (A j,k A j )
P
(A
A
)
(A
A
)
i,k
i
j,k
j
k
k
pP
(3.64)
which is called the Pearson correlation coefficient. Its value, unlike the other
two measures, is in the range [1, 1]. A positive correlation value denotes
that when vi befriends an individual vk , v j is also likely to befriend vk .
A negative value denotes the opposite (i.e., when vi befriends vk , it is
unlikely for v j to befriend vk ). A zero value denotes that there is no linear
relationship between the befriending behavior of vi and v j .
3.4.2
Regular Equivalence
98
Figure 3.15: Regular Equivalence. Solid lines denote edges, and dashed
lines denote similarities between nodes. In regular equivalence, similarity between nodes vi and v j is replaced by similarity between (a) their
neighbors vk and vl or between (b) neighbor vk and node v j .
Unfortunately, this formulation is self-referential because solving for i
and j requires solving for k and l, solving for k and l requires solving for
their neighbors, and so on. So, we relax this formulation and assume that
node vi is similar to node v j when v j is similar to vi s neighbors vk . This is
shown in Figure 3.15(b). Formally,
X
regular (vi , v j ) =
Ai,k regular (vk , v j ).
(3.66)
k
(3.67)
A node is highly similar to itself. To make sure that our formulation guarantees this, we can add an identity matrix to this vector format.
Adding an identity matrix will add 1 to all diagonal entries, which represent self-similarities regular (vi , vi ):
regular = Aregular + I.
(3.68)
By rearranging, we get
regular = (I A)1 ,
which we can use to find the regular equivalence similarity.
99
(3.69)
Note the similarity between Equation 3.69 and that of Katz centrality
(Equation 3.21). As with Katz centrality, we must be careful how we choose
for convergence. A common practice is to select an such that < 1/,
where is the largest eigenvalue of A.
Example 3.15. For the graph depicted in Figure 3.14, the adjacency matrix is
A =
0
1
1
0
0
1
0
1
1
0
1
1
0
0
0
0
1
0
0
1
0
0
1
0
1
0
0
0
1
0
(3.70)
The largest eigenvalue of A is 2.43. We set = 0.4 < 1/2.43, and we compute
(I 0.4A)1 , which is the similarity matrix,
regular = (I 0.4A)1
1.43
0.73
0.73
0.26
0.26
0.16
0.73
1.63
0.80
0.56
0.32
0.26
0.73
0.80
1.63
0.32
0.56
0.26
0.26
0.56
0.32
1.31
0.23
0.46
0.26
0.32
0.56
0.23
1.31
0.46
0.16
0.26
0.26
0.46
0.46
1.27
(3.71)
Any row or column of this matrix shows the similarity of a node to other nodes.
We can see that node v1 is the most similar (other than itself) to nodes v2 and v3 .
Furthermore, nodes v2 and v3 have the highest similarity in this graph.
100
3.5
Summary
In this chapter, we discussed measures for a social media network. Centrality measures attempt to find the most central node within a graph.
Degree centrality assumes that the node with the maximum degree is the
most central individual. In directed graphs, prestige and gregariousness
are variants of degree centrality. Eigenvector centrality generalizes degree
centrality and considers individuals who know many important nodes
as central. Based on the Perron-Frobenius theorem, eigenvector centrality is determined by computing the eigenvector of the adjacency matrix.
Katz centrality solves some of the problems with eigenvector centrality
in directed graphs by adding a bias term. PageRank centrality defines
a normalized version of Katz centrality. The Google search engine uses
PageRank as a measure to rank webpages. Betweenness centrality assumes
that central nodes act as hubs connecting other nodes, and closeness centrality implements the intuition that central nodes are close to all other
nodes. Node centrality measures can be generalized to a group of nodes
using group degree centrality, group betweenness centrality, and group
closeness centrality.
Linking between nodes (e.g., befriending in social media) is the most
commonly observed phenomenon in social media. Linking behavior is
analyzed in terms of its transitivity and its reciprocity. Transitivity is
when a friend of my friend is my friend. The transitivity of linking
behavior is analyzed by means of the clustering coefficient. The global
clustering coefficient analyzes transitivity within a network, and the local
clustering coefficient performs that for a node. Transitivity is commonly
considered for closed triads of edges. For loops of length 2, the problem is
simplified and is called reciprocity. In other words, reciprocity is when if
you become my friend, Ill be yours.
To analyze if relationships are consistent in social media, we used various social theories to validate outcomes. Social balance and social status
are two such theories.
Finally, we analyzed node similarity measures. In structural equivalence, two nodes are considered similar when they share neighborhoods.
We discussed cosine similarity and Jaccard similarity in structural equivalence. In regular equivalence, nodes are similar when their neighborhoods
are similar.
101
3.6
Bibliographic Notes
(3.72)
102
3.7
Exercises
Centrality
1. Come up with an example of a directed connected graph in which
eigenvector centrality becomes zero for some nodes. Describe when
this happens.
2. Does have any effect on the order of centralities? In other words, if
for one value of the centrality value of node vi is greater than that of
v j , is it possible to change in a way such that v j s centrality becomes
larger than that of vi s?
3. In PageRank, what values can we select to guarantee that centrality
values are calculated correctly (i.e., values do not diverge)?
4. Calculate PageRank values for this graph when
= 1, = 0
= 0.85, = 1
= 0, = 1
Discuss the effects of different values of and for this particular
problem.
5. Consider a full n-tree. This is a tree in which every node other than
the leaves has n children. Calculate the betweenness centrality for
the root node, internal nodes, and leaves.
103
Similarity
10. In Figure 3.6,
Compute node similarity using Jaccard and cosine similarity for
nodes v5 and v4 .
Find the most similar node to v7 using regular equivalence.
104
Chapter
Network Models
In May 2011, Facebook had 721 million users, represented by a graph of 721
million nodes. A Facebook user at the time had an average of 190 friends;
that is, all Facebook users, taken into account, had a total of 68.5 billion
friendships (i.e., edges). What are the principal underlying processes that
help initiate these friendships? More importantly, how can these seemingly
independent friendships form this complex friendship network?
In social media, many social networks contain millions of nodes and
billions of edges. These complex networks have billions of friendships,
the reasons for existence of most of which are obscure. Humbled by the
complexity of these networks and the difficulty of independently analyzing
each one of these friendships, we can design models that generate, on a
smaller scale, graphs similar to real-world networks. On the assumption
that these models simulate properties observed in real-world networks
well, the analysis of real-world networks boils down to a cost-efficient
measuring of different properties of simulated networks. In addition,
these models
allow for a better understanding of phenomena observed in realworld networks by providing concrete mathematical explanations
and
allow for controlled experiments on synthetic networks when realworld networks are not available.
We discuss three principal network models in this chapter: the random
graph model, the small-world model, and the preferential attachment model.
105
4.1
Real-world networks share common characteristics. When designing network models, we aim to devise models that can accurately describe these
networks by mimicking these common characteristics. To determine these
characteristics, a common practice is to identify their attributes and show
that measurements for these attributes are consistent across networks. In
particular, three network attributes exhibit consistent measurements across
real-world networks: degree distribution, clustering coefficient, and average
path length. As we recall, degree distribution denotes how node degrees
are distributed across a network. The clustering coefficient measures transitivity of a network. Finally, average path length denotes the average
distance (shortest path length) between pairs of nodes. We discuss how
these three attributes behave in real-world networks next.
4.1.1
Degree Distribution
Consider the distribution of wealth among individuals. Most individuals have an average amount of capital, whereas a few are considered
extremely wealthy. In fact, we observe exponentially more individuals
with an average amount of capital than wealthier ones. Similarly, consider
the population of cities. A few metropolitan areas are densely populated,
whereas other cities have an average population size. In social media, we
observe the same phenomenon regularly when measuring popularity or
interestingness for entities. For instance,
Many sites are visited less than a thousand times a month, whereas
a few are visited more than a million times daily.
Most social media users are active on a few sites, whereas a few
individuals are active on hundreds of sites.
There are exponentially more modestly priced products for sale compared to expensive ones.
106
(4.1)
(4.2)
107
Facebook
0.14 (with 100 friends)
Flickr
0.31
LiveJournal
0.33
Orkut
0.17
YouTube
0.13
4.1.2
Clustering Coefficient
4.1.3
Table 4.2: Average Path Length in Real-World Networks (from [46, 284,
198])
Web
16.12
Facebook
4.7
Flickr
5.67
LiveJournal
5.88
Orkut
4.25
YouTube
5.10
4.2
Random Graphs
We start with the most basic assumption on how friendships can be formed:
Edges (i.e., friendships) between nodes (i.e., individuals) are formed
randomly.
Small-world and Six
Degrees of Separation
G(n, p)
n
2
!
.
(4.3)
1
. One can think
The uniform random graph selection probability is ||
of the probability of uniformly selecting a graph as an analog to p, the
probability of selecting an edge in G(n, p).
The second model was introduced by Paul Erdos and Alfred Renyi [83]
and is denoted as the G(n, m) model. In the limit,
both models act similarly. G(n, m)
The expected number of edges in G(n, p) is n2 p. Now, if we set n2 p = m, in
the limit, both models act the same because they contain the same number
of edges. Note that the G(n, m) model contains a fixed number of edges;
however, the second model G(n, p) is likely to contain none or all possible
edges.
Mathematically, the G(n, p) model is almost always simpler to analyze;
hence the rest of this section deals with properties of this model. Note
that there exist many graphs with n nodes and m edges (i.e., generated by
G(n, m)). The same argument holds for G(n, p), and many graphs can be
generated by the model. Therefore, when measuring properties in random
graphs, the measures are calculated over all graphs that can be generated
by the model and then averaged. This is particularly useful when we are
interested in the average, and not specific, behavior of large graphs.
In G(n, p), the number of edges is not fixed; therefore, we first examine
some mathematical properties regarding the expected number of edges
that are connected to a node, the expected number of edges observed in
the graph, and the likelihood of observing m edges in a random graph
generated by the G(n, p) process.
(4.4)
or equivalently,
p=
c
.
n1
(4.5)
n
p.
2
Proof. Following the same line of argument, because edges are selected
independently
and we have a maximum of n2 edges, the expected number
of edges is n2 p.
Proposition 4.3. In a graph generated by G(n, p) model, the probability of observing m edges is
!
n
n
2
P(|E| = m) =
pm (1 p)(2)m ,
(4.6)
m
which is a binomial distribution.
Proof. m edges are selected from the n2 possible edges. These edges are
formed with probability pm , and other edges are not formed (to guarantee
n
the existence of only m edges) with probability (1 p)(2)m .
Given these basic propositions, we next analyze how random graphs
evolve as we add edges to them.
4.2.1
Giant Component
In random graphs, when nodes form connections, after some time a large
fraction of nodes get connected (i.e., there is a path between any pair of
them). This large fraction forms a connected component, commonly called
the largest connected component or the giant component. We can tune the
behavior of the random graph model by selecting the appropriate p value.
In G(n, p), when p = 0, the size of the largest connected component is 0 (no
two pairs are connected), and when p = 1, the size is n (all pairs are connected). Table 4.3 provides the size of the largest connected component (slc
values in the table) for random graphs with 10 nodes and different p values.
The table also provides information on the average degree c, the diameter
size ds, the size of the largest component slc, and the average path length l
of the random graph.
As shown, in Table 4.3, as p gets larger, the graph gets denser. When p
is very small, the following is found:
112
p
c
ds
slc
l
0.0
0.0
0
0
0.0
0.055
0.8
2
4
1.5
0.11
1
6
7
2.66
1.0
9.0
1
10
1.0
4.2.2
Degree Distribution
When computing degree distribution, we estimate the probability of observing P(dv = d) for node v.
Proposition 4.5. For a graph generated by G(n, p), node v has degree d, d n1,
with probability
!
n1 d
P(dv = d) =
p (1 p)n1d ,
(4.8)
d
which is again a binomial degree distribution.
Proof. The proof is left to the reader.3
115
case, using Equation 4.4 and the fact that limx0 ln(1 + x) = x, we can
compute the limit for each term of Equation 4.8:
n1d
(4.9)
We also have
!
(n 1)!
n1
lim
= lim
n
n (n 1 d)! d!
d
( (n 1) (n 2) (n d) )(n 1 d)!
= lim
n
(n 1 d)! d!
( (n 1) (n 2) (n d) )
= lim
n
d!
(n 1)d
.
(4.10)
d!
We can compute the degree distribution of random graphs in the limit
by substituting Equations 4.10, 4.9, and 4.4 in Equation 4.8,
!
n1 d
lim P(dv = d) = lim
p (1 p)n1d
n
n
d
d
(n 1)d
c d c
c c
=
e =e
,
(4.11)
d!
n1
d!
which is basically the Poisson distribution with mean c. Thus, in the limit,
random graphs generate Poisson degree distribution, which differs from
the power-law degree distribution observed in real-world networks.
Clustering Coefficient
Proposition 4.6. In a random graph generated by G(n, p), the expected local
clustering coefficient for node v is p.
Proof. The local clustering coefficient for node v is
C(v) =
(4.12)
However, v can have different degrees depending on the edges that are
formed randomly. Thus, we can compute the expected value for C(v):
E(C(v)) =
n1
X
(4.13)
d=0
The first term is basically the local clustering coefficient of a node given
its degree. For a random graph, we have
number of connected pairs of vs d neighbors
number of pairs of vs d neighbors
d
p2
=
= p.
(4.14)
d
E(C(v)|dv = d) =
d=n1
X
P(dv = d) = p,
(4.15)
d=0
where we have used the fact that all probability distributions sum up
to 1.
Proposition 4.7. The global clustering coefficient of a random graph generated
by G(n, p) is p.
Proof. The global clustering coefficient of a graph defines the probability
of two neighbors of the same node being connected. In random graphs,
for any two nodes, this probability is the same and is equal to the generation probability p that determines the probability of two nodes getting
connected. Note that in random graphs, the expected local clustering
coefficient is equivalent to the global clustering coefficient.
In random graphs, the clustering coefficient is equal to the probability
p; therefore, by appropriately selecting p, we can generate networks with
a high clustering coefficient. Note that selecting a large p is undesirable
because doing so will generate a very dense graph, which is unrealistic, as
in the real-world, networks are often sparse. Thus, random graphs are considered generally incapable of generating networks with high clustering
coefficients without compromising other required properties.
117
ln |V|
,
ln c
(4.16)
(4.17)
In random graphs, the expected diameter size tends to the average path
length l in the limit. This we provide without proof. Interested readers can
refer to the bibliographic notes for pointers to concrete proofs. Using this
fact, we have
cD cl |V|.
(4.18)
. Therefore, the
Taking the logarithm from both sides we get l lnln|V|
c
ln |V|
average path length in a random graph is equal to ln c .
4.2.3
Network
Size
Film Actors
Medline
Coauthorship
E.Coli
C.Elegans
282
282
4.3
7.35
14
2.9
2.65
0.32 3.04
0.28 2.25
0.026
0.05
Small-World Model
(4.19)
(4.20)
4.3.1
Degree Distribution
The degree distribution for the small-world model is as follows:
!
min(dc/2,c/2)
X
(c/2)dc/2n c/2
c/2
e
,
P(dv = d) =
(1 )n c/2n
(d
c/2
n)
n
n=0
(4.21)
where P(dv = d) is the probability of observing degree d for node v. We provide this equation without proof due to techniques beyond the scope of this
book (see Bibliographic Notes). Note that the degree distribution is quite
similar to the Poisson degree distribution observed in random graphs (Section 4.2.2). In practice, in the graph generated by the small-world model,
121
most nodes have similar degrees due to the underlying lattice. In contrast,
in real-world networks, degrees are distributed based on a power-law
distribution, where most nodes have small degrees and a few have large
degrees.
Clustering Coefficient
3(c2)
The clustering coefficient for a regular lattice is 4(c1) and for a random
c
. The clustering coefficient for a small-world netgraph model is p = n1
work is a value between these two, depending on . Commonly, the
clustering coefficient for a regular lattice is represented using C(0), and the
clustering coefficient for a small-world model with = p is represented as
C(p). The relation between the two values can be computed analytically; it
has been proven that
C(p) (1 p)3 C(0).
(4.22)
The intuition behind this relation is that because the clustering coefficient enumerates the number of closed triads in a graph, we are interested
in triads that are still left connected after the rewiring process. For a triad
to stay connected, all three edges must not be rewired with probability
(1 p). Since the process is performed independently for each edge, the
probability of observing triads is (1 p)3 times the probability of observing
them in a regular lattice. Note that we also need to take into account new
triads that are formed by the rewiring process; however, that probability
is nominal and hence negligible. The graph in Figure 4.5 depicts the value
C(p)
of C(0) for different values of p.
As shown in the figure, the value for C(p) stays high until p reaches 0.1
(10% rewired) and then decreases rapidly to a value around zero. Since
a high clustering coefficient is required in generated graphs, 0.1 is
preferred.
Average Path Length
The same procedure can be done for the average path length. The average
path length in a regular lattice is
n
.
2c
122
(4.23)
Network
Size
Film Actors
Medline
Coauthorship
E.Coli
C.Elegans
282
282
4.3.2
7.35
14
2.9
2.65
Simulated Graph
Average
C
Path
Length
0.79 4.2
0.73
0.56 5.1
0.52
C
0.32 4.46
0.28 3.49
0.31
0.37
4.4
dj
P
.
k dk
There exist a variety of scale-free network-modeling algorithms. A wellestablished one is the model proposed by Barabasi and Albert [24]. The
model is called preferential attachment or sometimes the Barabasi-Albert
(BA) model and is as follows:
When new nodes are added to networks, they are more likely to connect
to existing nodes that many others have connected to.
This connection likelihood is proportional to the degree of the node
that the new node is aiming to connect to. In other words, a rich-getricher phenomenon or aristocrat network is observed where the higher the
nodes degree, the higher the probability of new nodes getting connected
to it. Unlike random graphs in which we assume friendships are formed
randomly, in the preferential attachment model we assume that individuals
are more likely to befriend gregarious others. The models algorithm is
provided in Algorithm 4.2.
The algorithm starts with a graph containing a small set of nodes m0
and then adds new nodes one at a time. Each new node gets to connect
to m m0 other nodes, and each connection to existing node vi depends
on the degree of vi (i.e., P(vi ) = Pdid j ). Intrinsically, higher degree nodes get
j
125
more attention from newly added nodes. Note that the initial m0 nodes
must have at least degree 1 for probability P(vi ) = Pdid j to be nonzero.
j
The model incorporates two ingredients (1) the growth element and
(2) the preferential attachment element to achieve a scale-free network.
The growth is realized by adding nodes as time goes by. The preferential attachment is realized by connecting to node vi based on its degree
probability, P(vi ) = Pdid j . Removing any one of these ingredients generj
ates networks that are not scale-free (see Exercises). Next, we show that
preferential attachment models are capable of generating networks with a
power-law degree distribution. They are also capable of generating small
average path length, but unfortunately fail to generate the high clustering
coefficients observed in real-world networks.
4.4.1
Degree Distribution
We first demonstrate that the preferential attachment model generates
scale-free networks and can therefore model real-world networks. Empirical evidence found by simulating the preferential attachment model
suggests that this model generates a scale-free network with exponent
b = 2.9 0.1 [24]. Theoretically, a mean-field [213] proof can be provided as
follows.
Let di denote the degree for node vi . The probability of an edge connecting from a new node to vi is
di
P(vi ) = P .
j dj
(4.24)
0.5
t
.
ti
(4.26)
Here, ti represents the time that vi was added to the network, and
because we set the expected degree to m in preferential attachment, then
di (ti ) = m.
The probability that di is less than d is
P(di (t) < d) = P(ti > m2 t/d2 ).
(4.27)
m2 t
1
.
2
d (t + m0 )
(4.28)
1
The factor (t+m
shows the probability that one time step has passed
0)
because, at the end of the simulation, t + m0 nodes are in the network. The
probability density for P(d)
P(d) =
(4.29)
2m2 t
d3 (t + m0 )
2m2
,
d3
(4.30)
Clustering Coefficient
In general, not many triangles are formed by the Barabasi-Albert model,
because edges are created independently and one at a time. Again, using
a mean-field analysis, the expected clustering coefficient can be calculated
as
m0 1 (ln t)2
,
(4.31)
C=
8
t
where t is the time passed in the system during the simulation. We avoid
the details of this calculation due to techniques beyond the scope of this
book. Unfortunately, as time passes, the clustering coefficient gets smaller
and fails to model the high clustering coefficient observed in real-world
networks.
Average Path Length
The average path length of the preferential attachment model increases
logarithmically with the number of nodes present in the network:
l
ln |V|
.
ln(ln |V|)
(4.32)
4.4.2
As with random graphs, we can simulate real-world networks by generating a preferential attachment model by setting the expected degree m (see
Algorithm 4.2). Table 4.6 demonstrates the simulation results for various
real-world networks. The preferential attachment model generates a realistic degree distribution and, as observed in the table, small average path
lengths; however, the generated networks fail to exhibit the high clustering
coefficient observed in real-world networks.
128
Network
Size
Film Actors
Medline
Coauthorship
E.Coli
C.Elegans
282
282
4.5
7.35
14
2.9
2.65
Simulated Graph
Average
C
Path
Length
0.79 4.90
0.005
0.56 5.36
0.0002
C
0.32 2.37
0.28 1.99
0.03
0.05
Summary
In this chapter, we discussed three well-established models that generate networks with commonly observed characteristics of real-world networks: random graphs, the small-world model, and preferential attachment. Random graphs assume that connections are completely random.
We discussed two variants of random graphs: G(n, p) and G(n, m). Random
graphs exhibit a Poisson degree distribution, a small clustering coefficient
.
p, and a realistic average path length lnln|V|
c
The small-world model assumes that individuals have a fixed number
of connections in addition to random connections. This model generates
networks with high transitivity and short path lengths, both commonly
observed in real-world networks. Small-world models are created through
a process where a parameter controls how edges are randomly rewired
from an initial regular ring lattice. The clustering coefficient of the model
is approximately (1 p)3 times the clustering coefficient of a regular lattice.
No analytical solution to approximate the average path length with respect
to a regular ring lattice has been found. Empirically, when between 1%
to 10% of edges are rewired (0.01 0.1), the model resembles many
real-world networks. Unfortunately, the small-world model generates a
degree distribution similar to the Poisson degree distribution observed in
random graphs.
129
Finally, the preferential attachment model assumes that friendship formation likelihood depends on the number of friends individuals have. The
model generates a scale-free network; that is, a network with a power-law
degree distribution. When k denotes the degree of a node, and pk the fraction of nodes having degree k, then in a power-law degree distribution,
pk = akb .
(4.33)
130
4.6
Bibliographic Notes
General reviews of the topics in this chapter can be found in [213, 212, 28,
134].
Initial random graph papers can be found in the works of Paul Erdos
and Alfred Renyi [83, 84, 85] as well as Edgar Gilbert [100] and Solomonoff
and Rapoport [262]. As a general reference, readers can refer to [41, 217,
210]. Random graphs described in this chapter did not have any specific
degree distribution; however, random graphs can be generated with a
specific degree distribution. For more on this refer to [212, 216].
Small-worlds were first noticed in a short story by Hungarian writer F.
Karinthy in 1929. Works of Milgram in 1969 and Kochen and Pool in 1978
treated the subject more systematically. Milgram designed an experiment
in which he asked random participants in Omaha, Nebraska, or Wichita,
Kansas, to help send letters to a target person in Boston. Individuals were
only allowed to send the letter directly to the target person if they knew
the person on a first-name basis. Otherwise, they had to forward it to
someone who was more likely to know the target. The results showed that
the letters were on average forwarded 5.5 to 6 times until they reached the
target in Boston. Other recent research on small-world model dynamics
can be found in [295, 296].
Price [1965, 1976] was among the first who described power laws
observed in citation networks and models capable of generating them.
Power-law distributions are commonly found in social networks and the
web [87, 198]. The first developers of preferential attachment models were
Yule [308], who described these models for generating power-law distributions in plants, and Herbert A. Simon [260], who developed these models
for describing power laws observed in various phenomena: distribution
of words in prose, scientists by citations, and cities by population, among
others. Simon used what is known as the master equation to prove that
preferential attachment models generate power-law degree distributions.
A more rigorous proof for estimating the power-law exponent of the preferential attachment model using the master equation method can be found
in [212]. The preferential attachment model introduced in this chapter has
a fixed exponent b = 3, but, as mentioned, real-world networks have exponents in the range [2, 3]. To solve this issue, extensions have been proposed
in [155, 9].
131
4.7
Exercises
(4.34)
Random Graphs
2. Assuming that we are interested in a sparse random graph, what
should we choose as our p value?
3. Construct a random graph as follows. Start with n nodes and a
given k. Generate all the possible combinations of k nodes. For
Small-World Model
5. Show that in a regular lattice the number of connections between
neighbors is given by 38 c(c 2), where c is the average degree.
6. Show how the clustering coefficient can be computed in a regular
lattice of degree k.
132
133
134
Chapter
In social media mining, the raw data is the content generated by individuals, and the knowledge encompasses the interesting patterns observed
in this data. For example, for an online book seller, the raw data is the list
of books individuals buy, and an interesting pattern could describe books
that individuals often buy.
To analyze social media, we can either collect this raw data or use
available repositories that host collected data from social media sites.1
When collecting data, we can either use APIs provided by social media
sites for data collection or scrape the information from those sites. In
either case, these sites are often networks of individuals where one can
perform graph traversal algorithms to collect information from them. In
other words, we can start collecting information from a subset of nodes on
a social network, subsequently collect information from their neighbors,
and so on. The data collected this way needs to be represented in a unified
format for analysis. For instance, consider a set of tweets in which we are
looking for common patterns. To find patterns in these tweets, they need
to be first represented using a consistent data format. In the next section,
we discuss data, its representation, and its types.
136
5.1
Data
Name
John
Mary
Attributes
Money Spent Bought Similar
High
Yes
High
Yes
Visits
Frequently
Rarely
Class
Will Buy
?
Yes
A dataset is represented using a set of features, and an instance is represented using values assigned to these features. Features are also known
as measurements or attributes. In this example, the features are Name, Money
Spent, Bought Similar, and Visits; feature values for the first instance
are John, High, Yes, and Frequently. Given the feature values for one
instance, one tries to predict its class (or class attribute) value. In our example, the class attribute is Will Buy, and our class value prediction for first
instance is Yes. An instance such as John in which the class attribute value
is unknown is called an unlabeled instance. Similarly, a labeled instance
is an instance in which the class attribute value in known. Mary in this
dataset represents a labeled instance. The class attribute is optional in a
dataset and is only necessary for prediction or classification purposes. One
can have a dataset in which no class attribute is present, such as a list of
customers and their characteristics.
There are different types of features based on the characteristics of the
feature and the values they can take. For instance, Money Spent can be
represented using numeric values, such as $25. In that case, we have a
continuous feature, whereas in our example it is a discrete feature, which can
take a number of ordered values: {High, Normal, Low}.
Different types of features were first introduced by psychologist Stanley
Smith Stevens [265] as levels of measurement in the theory of scales. He
137
Point,
Data Point, or Observation
Features,
Measurements, or
Attributes
Labeled and
Unlabeled
Levels of
Measurement
claimed that there are four types of features. For each feature type, there
exists a set of permissible operations (statistics) using the feature values
and transformations that are allowed.
Nominal (categorical). These features take values that are often
represented as strings. For instance, a customers name is a nominal
feature. In general, a few statistics can be computed on nominal
features. Examples are the chi-square statistic (2 ) and the mode (most
common feature value). For example, one can find the most common
first name among customers. The only possible transformation on
the data is comparison. For example, we can check whether our
customers name is John or not. Nominal feature values are often
presented in a set format.
Ordinal. Ordinal features lay data on an ordinal scale. In other
words, the feature values have an intrinsic order to them. In our
example, Money Spent is an ordinal feature because a High value for
Money Spent is more than a Low one.
Interval. In interval features, in addition to their intrinsic ordering, differences are meaningful whereas ratios are meaningless. For
interval features, addition and subtraction are allowed, whereas multiplications and division are not. Consider two time readings: 6:16
PM and 3:08 PM. The difference between these two time readings is
meaningful (3 hours and 8 minutes); however, there is no meaning
PM
, 2.
to 6:16
3:08 PM
Ratio. Ratio features, as the name suggests, add the additional properties of multiplication and division. An individuals income is an
example of a ratio feature where not only differences and additions
are meaningful but ratios also have meaning (e.g., an individuals
income can be twice as much as Johns income).
In social media, individuals generate many types of nontabular data,
such as text, voice, or video. These types of data are first converted to tabular data and then processed using data mining algorithms. For instance,
voice can be converted to feature values using approximation techniques
such as the fast Fourier transform (FFT) and then processed using data
mining algorithms. To convert text into the tabular format, we can use a
138
(5.1)
where w j,i represents the weight for word j that occurs in document i and
N is the number of words used for vectorization.2 To compute w j,i , we can
set it to 1 when the word j exists in document i and 0 when it does not.
We can also set it to the number of times the word j is observed in document i. A more generalized approach is to use the term frequency-inverse
document frequency (TF-IDF) weighting scheme. In the TF-IDF scheme, w j,i
is calculated as
w j,i = t f j,i id f j ,
(5.2)
where t f j,i is the frequency of word j in document i. id f j is the inverse TF-IDF
frequency of word j across all documents,
id f j = log2
|D|
,
|{document D | j document}|
(5.3)
One can use all unique words in all documents (D) or a more frequent subset of words
in the documents for vectorization.
139
document d1 and the word orange appears in all 20 documents. Then, TF-IDF
values for apple and orange in document d1 are
t f id f (apple, d1 ) = 10 log2
20
= 43.22,
1
20
= 0.
20
Example 5.2. Consider the following three documents:
t f id f (orange, d1 ) = 20 log2
(5.4)
(5.5)
(5.6)
(5.7)
(5.8)
social
1
1
0
media
1
1
0
mining
1
0
0
data
0
1
1
financial
0
0
1
market
0
0
1
=
=
=
=
=
=
(5.9)
(5.10)
(5.11)
(5.12)
(5.13)
(5.14)
social
0.584
0.584
0
media
0.584
0.584
0
mining
1.584
0
0
data
0
0.584
0.584
financial
0
0
1.584
market
0
0
1.584
5.1.1
Data Quality
When preparing data for use in data mining algorithms, the following four
data quality aspects need to be verified:
1. Noise is the distortion of the data. This distortion needs to be removed or its adverse effect alleviated before running data mining
algorithms because it may adversely affect the performance of the algorithms. Many filtering algorithms are effective in combating noise
effects.
2. Outliers are instances that are considerably different from other instances in the dataset. Consider an experiment that measures the
average number of followers of users on Twitter. A celebrity with
many followers can easily distort the average number of followers
per individuals. Since the celebrities are outliers, they need to be removed from the set of individuals to accurately measure the average
number of followers. Note that in special cases, outliers represent
useful patterns, and the decision to removing them depends on the
context of the data mining problem.
3. Missing Values are feature values that are missing in instances. For
example, individuals may avoid reporting profile information on social media sites, such as their age, location, or hobbies. To solve
this problem, we can (1) remove instances that have missing values,
(2) estimate missing values (e.g., replacing them with the most common value), or (3) ignore missing values when running data mining
algorithms.
4. Duplicate data occurs when there are multiple instances with the
exact same feature values. Duplicate blog posts, duplicate tweets,
or profiles on social media sites with duplicate information are all
instances of this phenomenon. Depending on the context, these instances can either be removed or kept. For example, when instances
need to be unique, duplicate instances should be removed.
After these quality checks are performed, the next step is preprocessing
or transformation to prepare the data for mining.
141
5.2
Data Preprocessing
Often, the data provided for data mining is not immediately ready. Data
preprocessing (and transformation in Figure 5.1) prepares the data for
mining. Typical data preprocessing tasks are as follows:
1. Aggregation. This task is performed when multiple features need
to be combined into a single one or when the scale of the features
change. For instance, when storing image dimensions for a social
media website, one can store by image width and height or equivalently store by image area (width height). Storing image area saves
storage space and tends to reduce data variance; hence, the data has
higher resistance to distortion and noise.
2. Discretization. Consider a continuous feature such as money spent
in our previous example. This feature can be converted into discrete
values High, Normal, and Low by mapping different ranges to
different discrete values. The process of converting continuous features to discrete ones and deciding the continuous range that is being
assigned to a discrete value is called discretization.
3. Feature Selection. Often, not all features gathered are useful. Some
may be irrelevant, or there may be a lack of computational power
to make use of all the features, among many other reasons. In these
cases, a subset of features are selected that could ideally enhance the
performance of the selected data mining algorithm. In our example,
customers name is an irrelevant feature to the value of the class
attribute and the task of predicting whether the individual will buy
the given book or not.
4. Feature Extraction. In contrast to feature selection, feature extraction
converts the current set of features to a new set of features that can
perform the data mining task better. A transformation is performed
on the data, and a new set of features is extracted. The example
we provided for aggregation is also an example of feature extraction
where a new feature (area) is constructed from two other features
(width and height).
5. Sampling. Often, processing the whole dataset is expensive. With
the massive growth of social media, processing large streams of data
142
is nearly impossible. This motivates the need for sampling. In sampling, a small random subset of instances are selected and processed
instead of the whole data. The selection process should guarantee
that the sample is representative of the distribution that governs the
data, thereby ensuring that results obtained on the sample are close
to ones obtained on the whole dataset. The following are three major
sampling techniques:
Random sampling. In random sampling, instances are selected
uniformly from the dataset. In other words, in a dataset of size
n, all instances have equal probability n1 of being selected. Note
that other probability distributions can also be used to sample
the dataset, and the distribution can be different from uniform.
Sampling with or without replacement. In sampling with replacement, an instance can be selected multiple times in the sample. In sampling without replacement, instances are removed
from the selection pool once selected.
Stratified sampling. In stratified sampling, the dataset is first
partitioned into multiple bins; then a fixed number of instances
are selected from each bin using random sampling. This technique is particularly useful when the dataset does not have a
uniform distribution for class attribute values (i.e., class imbalance). For instance, consider a set of 10 females and 5 males. A
sample of 5 females and 5 males can be selected using stratified
sampling from this set.
In social media, a large amount of information is represented in
network form. These networks can be sampled by selecting a subset
of their nodes and edges. These nodes and edges can be selected
using the aforementioned sampling methods. We can also sample
these networks by starting with a small set of nodes (seed nodes) and
sample
(a) the connected components they belong to;
(b) the set of nodes (and edges) connected to them directly; or
(c) the set of nodes and edges that are within n-hop distance from
them.
143
5.3
5.4
Supervised Learning
5.4.1
Consider the dataset shown in Table 5.1. The last attribute represents the
class attribute, and the other attributes represent the features. In decision
tree classification, a decision tree is learned from the training dataset, and
145
Figure 5.3: Decision Trees Learned from Data Provided in Table 5.1.
146
nonleaf node in a decision tree represents a feature, and each branch represents a value that the feature can take. Instances are classified by following
a path that starts at the root node and ends at a leaf by following branches
based on instance feature values. The value of the leaf determines the class
attribute value predicted for the instance (see Figure 5.3).
Decision trees are constructed recursively from training data using a
top-down greedy approach in which features are sequentially selected. In
Figure 5.3(a), the feature selected for the root node is Celebrity. After
selecting a feature for each node, based on its values, different branches
are created: For Figure 5.3(a), since the Celebrity feature can only take
either Yes or No, two branches are created: one labeled Yes and one labeled
No. The training set is then partitioned into subsets based on the feature
values, each of which fall under the respective feature value branch; the
process is continued for these subsets and other nodes. In Figure 5.3(a),
instances 1, 4, and 9 from Table 5.1 represent the subset that falls under the
Celebrity=Yes branch, and the other instances represent the subset that
falls under the Celebrity=No branch.
When selecting features, we prefer features that partition the set of
instances into subsets that are more pure. A pure subset has instances
that all have the same class attribute value. In Figure 5.3(a), the instances
that fall under the left branch of the root node (Celebrity=Yes) form a
pure subset in which all instances have the same class attribute value
Influential?=No. When reaching pure subsets under a branch, the decision tree construction process no longer partitions the subset, creates a leaf
under the branch, and assigns the class attribute value for subset instances
as the leafs predicted class attribute value. In Figure 5.3(a), the instances
that fall under the right branch of the root node form an impure dataset;
therefore, further branching is required to reach pure subsets. Purity of
subsets can be determined with different measures. A common measure
of purity is entropy. Over a subset of training instances, T, with a binary
class attribute (values {+, }), the entropy of T is defined as
entropy(T) = p+ log p+ p log p ,
(5.15)
7
3
3
7
log
log
= 0.881.
10
10 10
10
(5.16)
Note that if the number of positive and negative instances in the set are equal
(p+ = p = 0.5), then the entropy is 1.
In a pure subset, all instances have the same class attribute value and
the entropy is 0. If the subset being measured contains an unequal number
of positive and negative instances, the entropy is between 0 and 1.
5.4.2
Among many methods that use the Bayes theorem, the naive Bayes classifier (NBC) is the simplest. Given two random variables X and Y, Bayes
theorem states that
P(X|Y)P(Y)
.
(5.17)
P(Y|X) =
P(X)
In NBC, Y represents the class variable and X represents the instance
features. Let X be (x1 , x2 , x3 , . . . , xm ), where xi represents the value of feature
i. Let {y1 , y2 , . . . , yn } represent the values the class attribute Y can take.
Then, the class attribute value of instance X can be calculated by measuring
arg max P(yi |X).
yi
(5.18)
P(X|yi )P(yi )
.
P(X)
(5.19)
(5.20)
No.
1
2
3
4
5
6
7
8
(5.21)
P(i8 )
= P(O = Sunny|PG = Y) P(T = mild|PG = Y)
P(PG = Y)
P(H = high|PG = Y)
P(i8 )
P(PG = Y|i8 ) =
1 1 2
1
=
7 =
.
4 4 4 P(i8 ) 28P(i8 )
Similarly,
P(PG = N|i8 ) =
(5.22)
P(i8 )
= P(O = Sunny|PG = N) P(T = mild|PG = N)
P(PG = N)
P(H = high|PG = N)
P(i8 )
3
2 1 2
4
=
7 =
.
3 3 3 P(i8 ) 63P(i8 )
4
Since 63P(i
>
8)
Play Golf = N.
5.4.3
1
,
28P(i8 )
(5.23)
As the name suggests, k-nearest neighbor or kNN uses the k nearest instances, called neighbors, to perform classification. The instance being
classified is assigned the label (class attribute value) that the majority of
its k neighbors are assigned. The algorithm is outlined in Algorithm 5.1.
When k = 1, the closest neighbors label is used as the predicted label for
the instance being classified. To determine the neighbors of an instance,
we need to measure its distance to all other instances based on some distance metric. Commonly, Euclidean distance is employed; however, for
higher dimensional spaces, Euclidean distance becomes less meaningful
and other distance measures can be used.
Example 5.5. Consider the example depicted in Figure 5.4. As shown, depending
on the value of k, different labels can be predicted for the circle. In our example,
150
5.4.4
Consider a friendship network on social media and a product being marketed to this network. The product seller wants to know who the potential
buyers are for this product. Assume we are given the network with the
list of individuals who decided to buy or not buy the product. Our goal
is to predict the decision for the undecided individuals. This problem
can be formulated as a classification problem based on features gathered
from individuals. However, in this case, we have additional friendship
information that may be helpful in building more accurate classification
models. This is an example of classification with network information.
151
Assume we are not given any profile information, but only connections
and class labels (i.e., the individual bought/will not buy). By using the
rows of the adjacency matrix of the friendship network for each node as
features and the decision (e.g., buy/not buy) as a class label, we can predict
the label for any unlabeled node using its connections; that is, its row in the
adjacency matrix. Let P(yi = 1|N(vi )) denote the probability of node vi having class attribute value 1 given its neighbors. Individuals decisions are
often highly influenced by their immediate neighbors. Thus, we can approximate P(yi = 1) using the neighbors of the individual by assuming that
P(yi = 1) P(yi = 1|N(vi )).
Weighted-Vote
Relational-Neighbor
Classifier
(5.24)
We can estimate P(yi = 1|N(vi )) via different approaches. The weightedvote relational-neighbor (wvRN) classifier is one such approach. It estimates P(yi = 1|N(vi )) as
P(yi = 1|N(vi )) =
X
1
P(y j = 1|N(v j )).
|N(vi )| v N(v )
j
(5.25)
Note that in our example, the class attribute can take two values; therefore, the initial
guess of P(yi = 1|N(vi )) = 21 = 0.5 is reasonable. When a class attribute takes n values, we
can set our initial guess to P(yi = 1|N(vi )) = n1 .
152
(5.26)
(5.27)
(5.28)
1
= (P(y1 = 1|N(v1 )) + P(y2 = 1|N(v2 )) + P(y5 = 1|N(v5 )))
3
1
= (1 + 1 + 0) = 0.67.
(5.29)
3
P(y3 |N(v3 )) does not need to be computed again because its neighbors are all
labeled (thus, this probability estimation has converged). Similarly,
1
P(y4 |N(v4 )) = (1 + 0.5) = 0.75,
2
1
P(y6 |N(v6 )) = (0.75 + 0) = 0.38.
2
(5.30)
(5.31)
We need to recompute both P(y4 |N(v4 )) and P(y6 |N(v6 )) until convergence.
153
Let P(t) (yi |N(vi )) denote the estimated probability after t computations. Then,
1
P(1) (y4 |N(v4 )) = (1 + 0.38) = 0.69,
2
1
P(1) (y6 |N(v6 )) = (0.69 + 0) = 0.35,
2
1
P(2) (y4 |N(v4 )) = (1 + 0.35) = 0.68,
2
1
P(2) (y6 |N(v6 )) = (0.68 + 0) = 0.34,
2
1
P(3) (y4 |N(v4 )) = (1 + 0.34) = 0.67,
2
1
P(3) (y6 |N(v6 )) = (0.67 + 0) = 0.34,
2
1
P(4) (y4 |N(v4 )) = (1 + 0.34) = 0.67,
2
1
P(4) (y6 |N(v6 )) = (0.67 + 0) = 0.34.
2
(5.32)
(5.33)
(5.34)
(5.35)
(5.36)
(5.37)
(5.38)
(5.39)
After four iterations, both probabilities converge. So, from these probabilities
(Equations 5.29, 5.38, and 5.39), we can tell that nodes v3 and v4 will likely have
class attribute value 1 and node v6 will likely have class attribute value 0.
5.4.5
Regression
Linear Regression
In linear regression, we assume that the class attribute Y has a linear relation
with the regressors (feature set) X by considering a linear error . In other
words,
Y = XW + ,
(5.40)
where W represents the vector of regression coefficients. The problem of
regression can be solved by estimating W using the training dataset and
its labels Y such that fitting error is minimized. A variety of methods
have been introduced to solve the linear regression problem, most of which
use least squares or maximum-likelihood estimation. We employ the least
squares technique here. Interested readers can refer to the bibliographic
notes for more detailed analyses. In the least square method, we find W
using regressors X and labels Y such that the square of fitting error epsilon
is minimized.
2 = ||2 || = ||Y XW||2 .
(5.41)
To minimize , we compute the gradient and set it to zero to find the
optimal W:
||Y XW||2
= 0.
(5.42)
W
We know that for any X, ||X||2 = (XT X); therefore,
(Y XW)T (Y XW)
||Y XW||2
=
W
W
T
(Y W T XT )(Y XW)
=
W
(YT Y YT XW W T XT Y + W T XT XW)
=
W
= 2XT Y + 2XT XW = 0.
(5.43)
Therefore,
XT Y = XT XW.
(5.44)
(XT X)1 XT Y
(VUT UV T )1 VUT Y
(V2 V T )1 VUT Y
V2 V T VUT Y
V1 UT Y,
(5.46)
(5.47)
(5.48)
where X is the vector of features and Y is the class attribute. We can use
linear regression to approximate p. In other words, we can assume that
probability p depends on X; that is,
p = X,
(5.49)
(5.51)
1
eX
= X
.
X
e +1 e
+1
(5.52)
This function is known as the logistic function and is plotted in Figure 5.6. An interesting property of this function is that, for any real value
(negative to positive infinity), it will generate values between 0 and 1. In
other words, it acts as a probability function.
Our task is to find s such that P(Y|X) is maximized. Unlike linear
regression models, there is no closed form solution to this problem, and it
is usually solved using iterative maximum likelihood methods (See Bibliographic Notes).
After s are found, similar to the Naive Bayes Classifier (NBC), we
compute the probability P(Y|X) using Equation 5.52. In a situation where
the class attribute takes two values, when this probability is larger than
0.5, the class attribute is predicted 1; otherwise, 0 is predicted.
5.4.6
Leave-one-out
k-fold
Cross
Validation
c
.
n
(5.53)
Measure Name
Mahalanobis
d(X, Y) =
P
i
|xi yi |
1
P
d(X, Y) = ( i |xi yi |n ) n
Description
X, Y are features vectors and
is the covariance matrix of the
dataset
X, Y are features
vectors
X, Y are features
vectors
are close. The smaller the distance between these lines, the more accurate
the models learned from the data.
5.5
Unsupervised Learning
5.5.1
Clustering Algorithms
Figure 5.7: k-Means Output on a Sample Dataset. Instances are twodimensional vectors shown in the 2-D space. k-means is run with k = 6,
and the clusters found are visualized using different symbols.
assignments of the data instances stabilizing. In practice, the algorithm
execution can be stopped when the Euclidean distance between the centroids in two consecutive steps is bounded above by some small positive
. As an alternative, k-means implementations try to minimize an objective
function. A well-known objective function in these implementations is the
squared distance error:
n(i)
k X
X
2
||xij ci || ,
(5.55)
i=1 j=1
where xij is the jth instance of cluster i, n(i) is the number of instances
in cluster i, and ci is the centroid of cluster i. The process stops when
the difference between the objective function values of two consecutive
iterations of the k-means algorithm is bounded by some small value .
Note that k-means is highly sensitive to the initial k centroids, and different clustering results can be obtained on a single dataset depending on
161
5.5.2
When clusters are found, there is a need to evaluate how accurately the task
has been performed. When ground truth is available, we have prior knowledge of which instances should belong to which cluster, as discussed in
Chapter 6 in detail. However, evaluating clustering is a challenge because
ground truth is often not available. When ground truth is unavailable,
we incorporate techniques that analyze the clusters found and describe
the quality of clusters found. In particular, we can use techniques that
measure cohesiveness or separateness of clusters.
162
n(i)
k X
X
||xij ci || ,
(5.56)
i=1 j=1
which is the squared distance error (also known as SSE) discussed previously. Small values of cohesiveness denote highly cohesive clusters in
which all instances are close to the centroid of the cluster.
Example 5.7. Figure 5.8 shows a dataset of four one-dimensional instances. The
instances are clustered into two clusters. Instances in cluster 1 are x11 and x12 , and
instances in cluster 2 are x21 and x22 . The centroids of these two clusters are denoted
as c1 and c2 . For these two clusters, the cohesiveness is
cohesiveness = | 10 (7.5)|2 + | 5 (7.5)|2 + |5 7.5)|2
+|10 7.5|2 = 25.
(5.57)
Separateness
We are also interested in clustering of the data that generates clusters that
are well separated from one another. To measure this distance between
clusters, we can use the separateness measure. In statistics, separateness
163
k
X
||c ci ||2 ,
(5.58)
i=1
P
where c = n1 ni=1 xi is the centroid of all instances and ci is the centroid of
cluster i. Large values of separateness denote clusters that are far apart.
Example 5.8. For the dataset shown in Figure 5.8, the centroid for all instances
is denoted as c. For this dataset, the separateness is
separateness = | 7.5 0|2 + |7.5 0|2 = 112.5.
(5.59)
1 X
||x y||2 .
|G| yG
164
(5.61)
(5.62)
(5.63)
The silhouette index takes values between [1, 1]. The best clustering
happens when x a(x) b(x). In this case, silhouette 1. Similarly when
silhouette < 0, that indicates that many instances are closer to other clusters
than their assigned cluster, which shows low-quality clustering.
Example 5.9. In Figure 5.8, the a(.), b(.), and s(.) values are
a(x11 ) = | 10 (5)|2 = 25
1
(| 10 5|2 + | 10 10|2 ) = 312.5
b(x11 ) =
2
312.5 25
s(x11 ) =
= 0.92
312.5
a(x12 ) = | 5 (10)|2 = 25
1
b(x12 ) =
(| 5 5|2 + | 5 10|2 ) = 162.5
2
162.5 25
s(x12 ) =
= 0.84
162.5
a(x21 ) = |5 10|2 = 25
1
b(x21 ) =
(|5 (10)|2 + |5 (5)|2 ) = 162.5
2
162.5 25
s(x21 ) =
= 0.84
162.5
a(x22 ) = |10 5|2 = 25
1
b(x22 ) =
(|10 (5)|2 + |10 (10)|2 ) = 312.5
2
312.5 25
s(x22 ) =
= 0.92.
312.5
Given the s(.) values, the silhouette index is
1
silhouette = (0.92 + 0.84 + 0.84 + 0.92) = 0.88.
4
165
(5.64)
(5.65)
(5.66)
(5.67)
(5.68)
(5.69)
(5.70)
(5.71)
(5.72)
(5.73)
(5.74)
(5.75)
(5.76)
5.6
Summary
This chapter covered data mining essentials. The general process for analyzing data is known as knowledge discovery in databases (KDD). The first
step in the KDD process is data representation. Data instances are represented in tabular format using features. These instances can be labeled or
unlabeled. There exist different feature types: nominal, ordinal, interval,
and ratio. Data representation for text data can be performed using the
vector space model. After having a representation, quality measures need
to be addressed and preprocessing steps completed before processing the
data. Quality measures include noise removal, outlier detection, missing
values handling, and duplicate data removal. Preprocessing techniques
commonly performed are aggregation, discretization, feature selection,
feature extraction, and sampling.
We covered two categories of data mining algorithms: supervised and
unsupervised learning. Supervised learning deals with mapping feature
values to class labels, and unsupervised learning is the unsupervised division of instances into groups of similar objects.
When labels are discrete, the supervised learning is called classification,
and when labels are real numbers, it is called regression. We covered, these
classification methods: decision tree learning, naive Bayes classifier (NBC),
nearest neighbor classifier, and classifiers that use network information.
We also discussed linear and logistic regression.
To evaluate supervised learning, a training-testing framework is used
in which the labeled dataset is partitioned into two parts, one for training
and the other for testing. Different approaches for evaluating supervised
learning such as leave-one-out or k-fold cross validation were discussed.
Any clustering algorithm requires the selection of a distance measure.
We discussed partitional clustering algorithms and k-means from these
algorithms, as well as methods of evaluating clustering algorithms. To
evaluate clustering algorithms, one can use clustering quality measures
such as cohesiveness, which measures how close instances are inside clusters, or separateness, which measures how separate different clusters are
from one another. Silhouette index combines the cohesiveness and separateness into one measure.
166
5.7
Bibliographic Notes
168
5.8
Exercises
Data
3. Describe methods that can be used to deal with missing data.
4. Given a continuous attribute, how can we convert it to a discrete
attribute? How can we convert discrete attributes to continuous
ones?
5. If you had the chance of choosing either instance selection or feature
selection, which one would you choose? Please justify.
6. Given two text documents that are vectorized, how can we measure
document similarity?
7. In the example provided for TF-IDF (Example 5.1), the word orange
received zero score. Is this desirable? What does a high TF-IDF value
show?
Supervised Learning
8. Provide a pseudocode for decision tree induction.
9. How many decision trees containing n attributes and a binary class
can be generated?
10. What does zero entropy mean?
11.
Unsupervised Learning
13. (a) Given k clusters and their respective cluster sizes s1 , s2 , . . . , sk ,
what is the probability that two random (with replacement) data
vectors (from the clustered dataset) belong to the same cluster?
(b) Now, assume you are given this probability (you do not have
si s and k), and the fact that clusters are equally sized, can you
find k? This gives you an idea how to predict the number of
clusters in a dataset.
170
16.
15. What is the usual shape of clusters generated by k-means? Give
an example of cases where k-means has limitations in detecting the
patterns formed by the instances.
17. Describe a preprocessing strategy that can help detect nonspherical
clusters using k-means.
171
172
Part II
Communities and Interactions
173
Chapter
Community Analysis
In November 2010, a team of Dutch law enforcement agents dismantled
a community of 30 million infected computers across the globe that were
sending more than 3.6 billion daily spam mails. These distributed networks
of infected computers are called botnets. The community of computers in
a botnet transmit spam or viruses across the web without their owners
permission. The members of a botnet are rarely known; however, it is
vital to identify these botnet communities and analyze their behavior to
enhance internet security. This is an example of community analysis. In this
chapter, we discuss community analysis in social media.
Also known as groups, clusters, or cohesive subgroups, communities have
been studied extensively in many fields and, in particular, the social sciences. In social media mining, analyzing communities is essential. Studying communities in social media is important for many reasons. First,
individuals often form groups based on their interests, and when studying individuals, we are interested in identifying these groups. Consider
the importance of finding groups with similar reading tastes by an online book seller for recommendation purposes. Second, groups provide a
clear global view of user interactions, whereas a local-view of individual
behavior is often noisy and ad hoc. Finally, some behaviors are only observable in a group setting and not on an individual level. This is because
the individuals behavior can fluctuate, but group collective behavior is
more robust to change. Consider the interactions between two opposing
political groups on social media. Two individuals, one from each group,
can hold similar opinions on a subject, but what is important is that their
175
Social Communities
Broadly speaking, a real-world community is a body of individuals with
common economic, social, or political interests/characteristics, often living
in relatively close proximity. A virtual community comes into existence
when like-minded users on social media form a link and start interacting
with each other. In other words, formation of any community requires (1)
a set of at least two nodes sharing some interest and (2) interactions with
respect to that interest.
176
Figure 6.1: Zacharys Karate Club. Nodes represent karate club members
and edges represent friendships. A conflict in the club divided the members into two groups. The color of the nodes denotes which one of the two
groups the nodes belong to.
As a real-world community example, consider the interactions of a college karate club collected by Wayne Zachary in 1977. The example is often
referred to as Zacharys Karate Club [309] in the literature. Figure 6.1 depicts Zacharys Karate Club
the interactions in a college karate club over two years. The links show
friendships between members. During the observation period, individuals split into two communities due to a disagreement between the club
administrator and the karate instructor, and members of one community
left to start their own club. In this figure, node colors demonstrate the
communities to which individuals belong. As observed in this figure, using graphs is a convenient way to depict communities because color-coded
nodes can denote memberships and edges can be used to denote relations.
Furthermore, we can observe that individuals are more likely to be friends
with members of their own group, hence, creating tightly knit components
in the graph.
Zacharys Karate Club is an example of two explicit communities. An ex- Explicit (emic)
plicit community, also known as an emic community, satisfies the following Communities
three criteria:
177
6.1
Community Detection
6.1.1
6.1.2
The intuition behind member-based community detection is that members with the same (or similar) characteristics are more often in the same
community. Therefore, a community detection algorithm following this
approach should assign members with similar characteristics to the same
community. Let us consider a simple example. We can assume that nodes
that belong to a cycle form a community. This is because they share the
same characteristic: being in the cycle. Figure 6.3 depicts a 4-cycle. For
instance, we can search for all n-cycles in the graph and assume that they
represent a community. The choice for n can be based on empirical evidence or heuristics, or n can be in a range [1 , 2 ] for which all cycles are
found. A well-known example is the search for 3-cycles (triads) in graphs.
In theory, any subgraph can be searched for and assumed to be a community. In practice, only subgraphs that have nodes with specific characteristics are considered as communities. Three general node characteristics
that are frequently used are node similarity, node degree (familiarity), and node
reachability.
When employing node degrees, we seek subgraphs, which are often
connected, such that each node (or a subset of nodes) has a certain node
degree (number of incoming or outgoing edges). Our 4-cycle example
follows this property, the degree of each node being two. In reachability, we
seek subgraphs with specific properties related to paths existing between
nodes. For instance, our 4-cycle instance also follows the reachability
characteristic where all pairs of nodes can be reached via two independent
paths. In node similarity, we assume nodes that are highly similar belong
181
k-plex
Relaxing cliques. A well-known clique relaxation that comes from sociology is the k-plex concept. In a clique of size k, all nodes have the degree
of k 1; however, in a k-plex, all nodes have a minimum degree that is not
necessarily k 1 (as opposed to cliques of size k). For a set of vertices V,
the structure is called a k-plex if we have
dv |V| k, v V,
(6.1)
Clique
Percolation
Method (CPM)
(6.2)
For large networks, this value can increase rapidly, because nodes may
share many neighbors. Generally, similarity is attributed to a value that is
bounded and usually in the range [0, 1]. For that to happen, various normalization procedures such as the Jaccard similarity or the cosine similarity
187
can be done:
|N(vi ) N(v j )|
,
|N(vi ) N(v j )|
(6.3)
|N(vi ) N(v j )|
Cosine (vi , v j ) = p
.
|N(vi )||N(v j )|
(6.4)
Jaccard (vi , v j ) =
Example 6.2. Consider the graph in Figure 6.7. The similarity values between
nodes v2 and v5 are
|{v1 , v3 , v4 } {v3 , v6 }|
= 0.25,
|{v1 , v3 , v4 , v6 }|
|{v1 , v3 , v4 } {v3 , v6 }|
= 0.40.
Cosine (v2 , v5 ) =
|{v1 , v3 , v4 }||{v3 , v6 }|
Jaccard (v2 , v5 ) =
(6.5)
(6.6)
6.1.3
Figure 6.8: Minimum Cut (A) and Two More Balanced Cuts (B and C) in a
Graph.
clustering techniques have proven to be useful in identifying communities in social networks. In graph-based clustering, we cut the graph into
several partitions and assume these partitions represent communities.
Formally, a cut in a graph is a partitioning (cut) of the graph into two
(or more) sets (cutsets). The size of the cut is the number of edges that are
being cut and the summation of weights of edges that are being cut in a
weighted graph. A minimum cut (min-cut) is a cut such that the size of the
cut is minimized. Figure 6.8 depicts several cuts in a graph. For example,
cut B has size 4, and A is the minimum cut.
Minimum Cut
Based on the well-known max-flow min-cut theorem, the minimum cut
of a graph can be computed efficiently. However, minimum cuts are not always preferred for community detection. Often, they result in cuts where
a partition is only one node (singleton), and the rest of the graph is in
the other. Typically, communities with balanced sizes are preferred. Figure 6.8 depicts an example where the minimum cut (A) creates unbalanced
partitions, whereas, cut C is a more balanced cut.
To solve this problem, variants of minimum cut define an objective
function, minimizing (or maximizing) that during the cut-finding procedure, results in a more balanced and natural partitioning of the data.
Consider a graph G(V, E). A partitioning of G into k partitions is a tuple
S
P = (P1 , P2 , P3 , . . . , Pk ), such that Pi V, Pi P j = and ki=1 Pi = V. Then,
the objective function for the ratio cut and normalized cut are defined as Ratio cut and
Normalized Cut
follows:
k
1 X cut(Pi , P i )
,
Ratio Cut(P) =
k i=1
|Pi |
189
(6.7)
k
1 X cut(Pi , P i )
Normalized Cut(P) =
,
k i=1 vol(Pi )
(6.8)
where P i = V Pi is the P
complement cut set, cut(Pi , P i ) is the size of the
cut, and volume vol(Pi ) = vPi dv . Both objective functions provide a more
balanced community size by normalizing the cut size either by the number
of vertices in the cutset or the volume (total degree).
Both the ratio cut and normalized cut can be formulated in a matrix
format. Let matrix X {0, 1}|V|k denote the community membership matrix,
where Xi,j = 1 if node i is in community j; otherwise, Xi,j = 0. Let D =
diag(d1 , d2 , . . . , dn ) represent the diagonal degree matrix. Then the ith entry
on the diagonal of XT AX represents the number of edges that are inside
community i. Similarly, the ith element on the diagonal of XT DX represents
the number of edges that are connected to members of community i. Hence,
the ith element on the diagonal of XT (D A)X represents the number of
edges that are in the cut that separates community i from all other nodes. In
fact, the ith diagonal element of XT (DA)X is equivalent to the summation
term cut(Pi , P i ) in both the ratio and normalized cut. Thus, for ratio cut, we
have
k
1 X cut(Pi , P i )
Ratio Cut(P) =
k i=1
|Pi |
k
(6.9)
1 X Xi (D A)Xi
=
k i=1
XiT Xi
(6.10)
1 X T
X (D A)X i ,
k i=1 i
(6.11)
i=1 Xii ). Using the trace, the objectives for both the ratio and normalized
cut can be formulated as trace-minimization problems,
190
(6.12)
1/2
1/2
ID
AD
Normalized Cut Laplacian (Normalized Laplacian).
(6.13)
It has been shown that both ratio cut and normalized cut minimization Normalized and
are NP-hard; therefore, approximation algorithms using relaxations are Unnormalized Graph
Laplacian
desired. Spectral clustering is one such relaxation:
(6.14)
s.t. X T X = Ik .
(6.15)
The solution to this problem is the top eigenvectors of L.1 Given L, the Spectral Clustering
top k eigenvectors corresponding to the smallest eigen values are computed
and then k-means is run on X to extract communities
and used as X,
memberships (X). The first eigenvector is meaningless (why?); hence, the
rest of the eigenvectors (k 1) are used as k-means input.
Example 6.3. Consider the graph in Figure 6.8. We find two communities in this
graph using spectral clustering (i.e., k = 2). Then, we have
D = diag(2, 2, 4, 4, 4, 4, 4, 3, 1).
The adjacency matrix A and the unnormalized laplacian L are
0 1 1 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0
1 1 0 1 1 0 0 0 0
0 0 1 0 1 1 1 0 0
A = 0 0 1 1 0 1 1 0 0 ,
0 0 0 1 1 0 1 1 0
0 0 0 1 1 1 0 1 0
0 0 0 0 0 1 1 0 1
0 0 0 0 0 0 0 1 0
1
191
(6.16)
(6.17)
L = D A = 0
1 1
2 1
4 1
0 1
1 1
0 1
1 1
0 1
4 1
0 1
0 1 1
4
0
0
0
0
0
0
0
0
0 .
1
0
1
0
3 1
1
1
0
(6.18)
We aim to find two communities; therefore, we get two eigenvectors corresponding to the two smallest eigenvalues from L:
1
2
3
4
X = 5
6
7
8
9
0.33 0.46
0.33 0.46
0.33 0.26
0.33 0.01
0.33 0.01 .
0.33
0.13
0.33
0.13
0.33
0.33
0.33
0.59
(6.19)
k-connected
dj
.
2m
i j
hence, the expected number of edges between vi and v j is 2m
. So, given
a degree distribution, the expected number of edges between any pair of
vertices can be computed. Real-world communities are far from random;
therefore, the more distant they are from randomly generated communities, the more structure they exhibit. Modularity defines this distance, Modularity
and modularity maximization tries to maximize this distance. Consider
a partitioning of the graph G into k partitions, P = (P1 , P2 , P3 , . . . , Pk ). For
partition Px , this distance can be defined as
Ai j
vi ,v j Px
di d j
.
2m
(6.20)
Ai j
x=1 vi ,v j Px
di d j
.
2m
(6.21)
The summation is over all edges (m), and because all edges are counted
twice (Ai j = A ji ), the normalized version of this distance is defined as
193
modularity [211]:
k
di d j
1 X X
.
Ai j
Q=
2m x=1 v ,v P
2m
i
(6.22)
1
Tr(XT BX),
2m
(6.23)
Graph Density
Often, we are interested in dense communities, which have sufficiently frequent interactions. These communities are of particular interest in social
media where we would like to have enough interactions for analysis to
make statistical sense. When we are measuring density in communities,
the community may or may not be connected as long as it satisfies the properties required, assuming connectivity is not one such property. Cliques,
clubs, and clans are examples of connected dense communities. Here,
we focus on subgraphs that have the possibility of being disconnected.
Density-based community detection has been extensively discussed in the
field of clustering (see Chapter 5, Bibliographic Notes).
The density of a graph defines how close a graph is to a clique. In
other words, the density is the ratio of the
number of edges |E| that graph
:
G has over the maximum it can have |V|
2
=
|E|
.
|V|
2
194
(6.24)
A graph G = (V, E) is -dense if |E| |V|
. Note that a 1-dense graph
2
is a clique. Here, we discuss the interesting scenario of connected dense
graphs (i.e., quasi-cliques). A quasi-clique (or -clique) is a connected Quasi-Clique
-dense graph. Quasi-cliques can be searched for using approaches previously discussed for finding cliques. We can utilize the brute-force clique
identification algorithm (Algorithm 6.1) for finding quasi-cliques as well.
The only part of the algorithm that needs to be changed is the part where
the clique condition is checked (Line 8). This can be replaced with a quasiclique checking condition. In general, because there is less regularity in
quasi-cliques, searching for them becomes harder. Interested readers can
refer to the bibliographic notes for faster algorithms.
Hierarchical Communities
All previously discussed methods have considered communities at a single
level. In reality, it is common to have hierarchies of communities, in which
each community can have sub/super communities. Hierarchical clustering
deals with this scenario and generates community hierarchies. Initially, n
nodes are considered as either 1 or n communities in hierarchical clustering.
These communities are gradually merged or split (agglomerative or divisive
hierarchical clustering algorithms), depending on the type of algorithm,
until the desired number of communities are reached. A dendrogram
is a visual demonstration of how communities are merged or split using
hierarchical clustering. The Girvan-Newman [101] algorithm is specifically
designed for finding communities using divisive hierarchical clustering.
The assumption underlying this algorithm is that, if a network has a
set of communities and these communities are connected to one another
with a few edges, then all shortest paths between members of different
communities should pass through these edges. By removing these edges
(at times referred to as weak ties), we can recover (i.e., disconnect) communities in a network. To find these edges, the Girvan-Newman algorithm
uses a measure called edge betweenness and removes edges with higher edge
betweenness. For an edge e, edge betweenness is defined as the number Edge
of shortest paths between node pairs (vi , v j ) such that the shortest path Betweenness
between vi and v j passes through e. For instance, in Figure 6.9(a), edge
betweenness for edge e(1, 2) is 6/2 + 1 = 4, because all the shortest paths
from 2 to {4, 5, 6, 7, 8, 9} have to either pass e(1, 2) or e(2, 3), and e(1, 2) is the
shortest path between 1 and 2. Formally, the Girvan-Newman algorithm
195
is as follows:
1. Calculate edge betweenness for all edges in the graph.
2. Remove the edge with the highest betweenness.
3. Recalculate betweenness for all edges affected by the edge removal.
4. Repeat until all edges are removed.
Example 6.4. Consider the graph depicted in Figure 6.9(a). For this graph, the
edge-betweenness values are as follows:
1 2 3 4 5 6 7 8 9
1
0 4 1 9 0 0 0 0 0
2
4 0 4 0 0 0 0 0 0
3
1 4 0 9 0 0 0 0 0
4
9 0 9 0 10 10 0 0 0
.
(6.25)
5
0
0
0
10
0
1
6
3
0
6
0 0 0 10 1 0 6 3 0
7
0 0 0 0 6 6 0 2 8
8
0 0 0 0 3 3 2 0 0
9
0 0 0 0 0 0 8 0 0
Therefore, by following the algorithm, the first edge that needs to be removed
is e(4, 5) (or e(4, 6)). By removing e(4, 5), we compute the edge betweenness once
196
6.2
Community Evolution
6.2.1
Large social networks are highly dynamic, where nodes and links appear
or disappear over time. In these evolving networks, many interesting patterns are observed; for instance, when distances (in terms of shortest path
distance) between two nodes increase, their probability of getting connected decreases.2 We discuss three common patterns that are observed in
evolving networks: segmentation, densification, and diameter shrinkage.
Network Segmentation
Often, in evolving networks, segmentation takes place, where the large
network is decomposed over time into three parts:
1. Giant Component: As network connections stabilize, a giant component of nodes is formed, with a large proportion of network nodes
and edges falling into this component.
2. Stars: These are isolated parts of the network that form star structures. A star is a tree with one internal node and n leaves.
3. Singletons: These are orphan nodes disconnected from all nodes in
the network.
Figure 6.11 depicts a segmented network and these three components.
Graph Densification
It is observed in evolving graphs that the density of the graph increases as
the network grows. In other words, the number of edges increases faster
than the number of nodes. This phenomenon is called densification. Let
V(t) denote nodes at time t and let E(t) denote edges at time t,
|E(t)| |V(t)| .
2
198
(6.26)
Figure 6.13: Diameter Shrinkage over Time for a Patent Citation Network
(from [167]).
Diameter Shrinkage
Another property observed in large networks is that the network diameter
shrinks in time. This property has been observed in random graphs as well
(see Chapter 4). Figure 6.13 depicts the diameter shrinkage for the same
patent network discussed in Figure 6.12.
In this section we discussed three phenomena that are observed in
evolving networks. Communities in evolving networks also evolve. They
appear, grow, shrink, split, merge, or even dissolve over time. Figure 6.14
depicts different situations that can happen during community evolution.
Both networks and their internal communities evolve over time. Given
evolution information (e.g., when edges or nodes are added), how can
we study evolving communities? And can we adapt static (nontemporal)
methods to use this temporal information? We discuss these questions
next.
200
6.2.2
(6.27)
(6.28)
(6.29)
1
T
||Xt XtT Xt1 Xt1
||2 ,
2
1
T T
T
) (Xt XtT Xt1 Xt1
)),
Tr((Xt XtT Xt1 Xt1
2
1
T
T
T
),
Tr(Xt XtT Xt XtT 2Xt XtT Xt1 Xt1
+ Xt1 Xt1
Xt1 Xt1
2
T
),
Tr(I Xt XtT Xt1 Xt1
T
T
Tr(I Xt Xt1 Xt1 Xt ),
(6.31)
(6.33)
T
where L = I D1/2
At D1/2
(1 )Xt1 Xt1
. Similar to spectral clustering,
t
t
203
Figure 6.15: Community Evaluation Example. Circles represent communities, and items inside the circles represent members. Each item is represented using a symbol, +, , or 4, that denotes the items true label.
Note that at time t, we can obtain Xt directly by solving spectral clustering for the laplacian of the graph at time t, but then we are not employing
any temporal information. Using evolutionary clustering and the new
we incorporate temporal information into our community delaplacian L,
tection algorithm and disallow user memberships in communities at time
t: Xt to change dramatically from Xt1 .
6.3
Community Evaluation
When communities are found, one must evaluate how accurately the detection task has been performed. In terms of evaluating communities, the
task is similar to evaluating clustering methods in data mining. Evaluating
clustering is a challenge because ground truth may not be available. We
consider two scenarios: when ground truth is available and when it is not.
6.3.1
(6.34)
(6.35)
Precision defines the fraction of pairs that have been correctly assigned
to the same community. Recall defines the fraction of pairs that the community detection algorithm assigned to the same community of all the
pairs that should have been in the same community.
Example 6.5. We compute these values for Figure 6.15. For TP, we need to
compute the number of pairs with the same label that are
in the same community.
For instance, for label and community 1, we have 52 such pairs. Therefore,
!
!
!
!
5
6
4
2
TP =
+
+ (
+
) = 32.
2
2
2
2
|{z}
|{z}
| {z }
Community 1 Community 2 Community 3
205
(6.36)
For FP, we need to compute dissimilar pairs that are in the same community.
For instance, for community 1, this is (5 1 + 5 1 + 1 1). Therefore,
FP = (5 1 + 5 1 + 1 1) +
(6 1)
+
(4 2)
= 25.
|
{z
}
| {z }
| {z }
Community 1
Community 2 Community 3
(6.37)
(6.38)
,+
4,+
4,
,+
+,4
4,+
,+
,4
(6.39)
Hence,
32
= 0.56
32 + 25
32
R=
= 0.52.
32 + 29
P=
(6.40)
(6.41)
F-Measure
To consolidate precision and recall into one measure, we can use the harmonic mean of precision and recall:
F=2
PR
.
P+R
206
(6.42)
Purity =
1 X
max |Ci L j |,
N i=1 j
(6.43)
Unfortunately, mutual information is unbounded; however, it is common for measures to have values in range [0,1]. To address this issue, we
can normalize mutual information. We provide the following equation,
without proof, which will help us normalize mutual information,
MI min(H(L), H(H)),
(6.45)
Xn
lL
H(H) =
Xn
hH
log
nl
n
(6.46)
log
nh
.
n
(6.47)
(6.48)
Equivalently,
MI
p
H(H) H(L).
(6.49)
(6.50)
(6.51)
An NMI value close to one indicates high similarity between communities found and labels. A value close to zero indicates a long distance
between them.
208
6.3.2
http://www.mturk.com.
209
nity detection algorithms are available. Each algorithm is run on the target
network, and the quality measure is computed for the identified communities. The algorithm that yields a more desirable quality measure value
is considered a better algorithm. SSE (sum of squared errors) and intercluster distance are some of the quality measures. For other measures refer
to Chapter 5.
We can also follow this approach for evaluating a single community
detection algorithm; however, we must ensure that the clustering quality
measure used to evaluate community detection is different from the measure used to find communities. For instance, when using node similarity
to group individuals, a measure other than node similarity should be used
to evaluate the effectiveness of community detection.
210
6.4
Summary
In this chapter, we discussed community analysis in social media, answering three general questions: (1) how can we detect communities, (2) how
do communities evolve and how can we study evolving communities, and
(3) how can we evaluate detected communities? We started with a description of communities and how they are formed. Communities in social
media are either explicit (emic) or implicit (etic). Community detection
finds implicit communities in social media.
We reviewed member-based and group-based community detection
algorithms. In member-based community detection, members can be
grouped based on their degree, reachability, and similarity. For example, when using degrees, cliques are often considered as communities.
Brute-force clique identification is used to identify cliques. In practice, due
to the computational complexity of clique identifications, cliques are either
relaxed or used as seeds of communities. k-Plex is an example of relaxed
cliques, and the clique percolation algorithm is an example of methods that
use cliques as community seeds. When performing member-based community detection based on reachability, three frequently used subgraphs
are the k-clique, k-club, and k-clan. Finally, in member-based community
detection based on node similarity, methods such as Jaccard and Cosine
similarity help compute node similarity. In group-based community detection, we described methods that find balanced, robust, modular, dense,
or hierarchical communities. When finding balanced communities, one
can employ spectral clustering. Spectral clustering provides a relaxed
solution to the normalized cut and ratio cut in graphs. For finding robust communities, we search for subgraphs that are hard to disconnect.
k-edge and k-vertex graphs are two examples of these robust subgraphs.
To find modular communities, one can use modularity maximization and
for dense communities, we discussed quasi-cliques. Finally, we provided
hierarchical clustering as a solution to finding hierarchical communities,
with the Girvan-Newman algorithm as an example.
In community evolution, we discussed when networks and, on a lower
level, communities evolve. We also discussed how communities can be
detected in evolving networks using evolutionary clustering. Finally, we
presented how communities are evaluated when ground truth exists and
when it does not.
211
6.5
Bibliographic Notes
formed for networks with multiple types of interaction (edges) [279]; [280].
We also restricted our discussion to community detection algorithms that
use graph information. One can also perform community detection based
on the content that individuals share on social media. For instance, using
tagging relations (i.e., individuals who shared the same tag) [292], instead
of connections between users, one can discover overlapping communities,
which provides a natural summarization of the interests of the identified
communities.
In network evolution analysis, network segmentation is discussed in
[157]. Segment-based clustering [269] is another method not covered in
this chapter.
NMI was first introduced in [267] and in terms of clustering quality
measures, the Davies-Bouldin [67] measure, Rand index [236], C-index
[76], Silhouette index [241], and Goodman-Kruskal index [106] can be
used.
213
6.6
Exercises
Community Detection
2. Given a complete graph Kn , how many nodes will the clique percolation method generate for the clique graph for value k? How many
edges will it generate?
3. Find all k-cliques, k-clubs, and k-clans in a complete graph of size 4.
4. For a complete graph of size n, is it m-connected? What possible
values can m take?
5. Why is the smallest eigenvector meaningless when using an unnormalized laplacian matrix?
6. Modularity can be defined as
#
"
di d j
1 X
(ci , c j ),
Q=
Ai j
2m i j
2m
(6.52)
Community Evolution
8. What is the upper bound on densification factor ? Explain.
Community Evaluation
9. Normalized mutual information (NMI) is used to evaluate community detection results when the actual communities (labels) are
known beforehand.
What are the maximum and minimum values for the NMI?
Provide details.
Explain how NMI works (describe the intuition behind it).
10. Compute NMI for Figure 6.15.
11. Why is high precision not enough? Provide an example to show that
both precision and recall are important.
12. Discuss situations where purity does not make sense.
13. Compute the following for Figure 6.17:
216
Chapter
diffusion process.
2. Receiver(s). A receiver or a set of receivers receive diffused information. Commonly, the set of receivers is much larger than the set of
senders and can overlap with the set of senders.
3. Medium. This is the medium through which the diffusion takes
place. For example, when a rumor is spreading, the medium can be
the personal communication between individuals.
Intervention
This definition can be generalized to other domains. In a diseasespreading process, the disease is the analog to the information, and infection can be considered a diffusing process. The medium in this case is the
air shared by the infecter and the infectee. An information diffusion can
be interrupted. We define the process of interfering with information diffusion by expediting, delaying, or even stopping diffusion as intervention.
Individuals in online social networks are situated in a network where
they interact with others. Although this network is at times unavailable
or unobservable, the information diffusion process takes place in it. Individuals facilitate information diffusion by making individual decisions
that allow information to flow. For instance, when a rumor is spreading,
individuals decide if they are interested in spreading it to their neighbors.
They can make this decision either dependently (i.e., depending on the
information they receive from others) or independently. When they make
dependent decisions, it is important to gauge the level of dependence that
individuals have on others. It could be local dependence, where an individuals decision is dependent on all of his or her immediate neighbors
(friends) or global dependence, where all individuals in the network are
observed before making decisions.
In this chapter, we present in detail four general types of information
diffusion: herd behavior, information cascades, diffusion of innovation, and
epidemics.
Herd behavior takes place when individuals observe the actions of all
others and act in an aligned form with them. An information cascade
describes the process of diffusion when individuals merely observe their
immediate neighbors. In information cascades and herd behavior, the network of individuals is observable; however, in herding, individuals decide
based on global information (global dependence); whereas, in information
218
7.1
Herd Behavior
Consider people participating in an online auction. Individuals are connected via the auctions site where they cannot only observe the bidding
behaviors of others but can also often view profiles of others to get a feel
for their reputation and expertise. Individuals often participate actively in
online auctions, even bidding on items that might otherwise be considered
unpopular. This is because they trust others and assume that the high
number of bids that the item has received is a strong signal of its value. In
this case, herd behavior has taken place.
Herd behavior, a term first coined by British surgeon Wilfred [283],
describes when a group of individuals performs actions that are aligned
without previous planning. It has been observed in flocks, herds of animals, and in humans during sporting events, demonstrations, and religious gatherings, to name a few examples. In general, any herd behavior
requires two components:
1. connections between individuals
2. a method to transfer behavior among individuals or to observe their
behavior
Solomon Asch
Conformity
Experiment
Individuals can also make decisions that are aligned with others (mindless decisions) when they conform to social or peer pressure. A well-known
example is the set of experiments performed by Solomon Asch during the
1950s [17]. In one experiment, he asked groups of students to participate
in a vision test where they were shown two cards (Figure 7.2), one with
a single line segment and one with three lines, and told to match the line
segments with the same length.
Each participant was put into a group where all the other group members were actually collaborators with Asch, although they were introduced
as participants to the subject. Asch found that in control groups with no
pressure to conform, in which the collaborators gave the correct answer,
only 3% of the subjects provided an incorrect answer. However, when participants were surrounded by individuals providing an incorrect answer,
up to 32% of the responses were incorrect.
In contrast to this experiment, we refer to the process in which individuals consciously make decisions aligned with others by observing the
decisions of other individuals as herding or herd behavior. In theory, there is
220
3. Decisions are not mindless, and people have private information that
helps them decide.
4. No message passing is possible. Individuals do not know the private
information of others, but can infer what others know from what
they observe from their behavior.
Anderson and Holt [11, 12] designed an experiment satisfying these
four conditions, in which students guess whether an urn containing red
and blue marbles is majority red or majority blue. Each student had access
to the guesses of students beforehand. Anderson and Holt observed a herd
behavior where students reached a consensus regarding the majority color
over time. It has been shown [78] that Bayesian modeling is an effective
technique for demonstrating why this herd behavior occurs. Simply put,
computing conditional probabilities and selecting the most probable majority color result in herding over time. We detail this experiment and how
conditional probabilities can explain why herding takes place next.
7.1.1
222
We start with the first student. If the marble selected is red, the prediction will be majority red; if blue, it will be majority blue. Assuming it was
blue, on the board we have
The second student can pick a blue or a red marble. If blue, he also
predicts majority blue because he knows that the previous student must
have picked blue. If red, he knows that because he has picked red and
the first student has picked blue, he can randomly assume majority red or
blue. So, after the second student we either have
Assume we end up with BOARD: {B, B}. In this case, if the third student
takes out a red ball, the conditional probability is higher for majority blue,
although she observed a red marble. Hence, a herd behavior takes place,
and on the board, we will have BOARD: {B,B,B}. From this student and
onward, independent of what is being observed, everyone will predict
majority blue. Let us demonstrate why this happens based on conditional
probabilities and our problem setting. In our problem, we know that the
first student predicts majority blue if P(majority blue|students obervation) >
1/2 and majority red otherwise. We also know from the experiments setup
that
P(majority blue) = P(majority red) = 1/2,
P(blue|majority blue) = P(red|majority red) = 2/3.
(7.1)
(7.2)
P(majority blue|blue) =
223
(7.3)
(7.4)
(7.5)
7.1.2
Intervention
7.2
Information Cascades
7.2.1
Sender-Centric
Model
Considering nodes that are active as senders and nodes that are being
activated as receivers, in the independent cascade model (ICM) senders
activate receivers. Therefore, ICM is denoted as a sender-centric model. In
this model, the node that becomes active at time t has, in the next time step
t + 1, one chance of activating each of its neighbors. Let v be an active node
at time t. Then, for any neighbor w, there is a probability pv,w that node w
gets activated at t + 1. A node v that has been activated at time t has a single
chance of activating its neighbor w and that activation can only happen at
t + 1. We start with a set of active nodes and we continue until no further
activation is possible. Algorithm 7.1 details the process of the ICM model.
1
226
i = i + 1;
Ai = {};
for all v Ai1 do
for all w neighbor of v, w < ij=0 A j do
rand = generate a random number in [0,1];
if rand < pv,w then
activate w;
Ai = Ai {w};
end if
end for
end for
end while
A = ij=0 A j ;
Return A ;
Example 7.2. Consider the network in Figure 7.4 as an example. The network is
undirected; therefore, we assume pv,w = pw,v . Since it is undirected, for any two
vertices connected via an edge, there is an equal chance of one activating the other.
Consider the network in step 1. The values on the edges denote pv,w s. The ICM
procedure starts with a set of nodes activated. In our case, it is node v1 . Each
activated node gets one chance of activating its neighbors. The activated node
generates a random number for each neighbor. If the random number is less than
the respective pv,w of the neighbor (see Algorithm 7.1, lines 911), the neighbor
gets activated. The random numbers generated are shown in Figure 7.4 in the
form of inequalities, where the left-hand side is the random number generated and
the right-hand side is the pv,w . As depicted, by following the procedure after five
steps, five nodes get activated and the ICM procedure converges.
227
228
7.2.2
Consider a network of users and a company that is marketing a product. The company is trying to advertise its product in the network. The
company has a limited budget; therefore, not all users can be targeted.
However, when users find the product interesting, they can talk with their
friends (immediate neighbors) and market the product. Their neighbors,
in turn, will talk about it with their neighbors, and as this process progresses, the news about the product is spread to a population of nodes in
the network. The company plans on selecting a set of initial users such
that the size of the final population talking about the product is maximized.
Formally, let S denote a set of initially activated nodes (seed set) in ICM.
Let f (S) denote the number of nodes that get ultimately activated in the
network if nodes in S are initially activated. For our ICM example depicted
in Figure 7.4, |S| = 1 and f (S) = 5. Given a budget k, our goal is to find a
set S such that its size is equal to our budget |S| = k and f (S) is maximized.
Since the activations in ICM depend on the random number generated
for each node (see line 9, Algorithm 7.1), it is challenging to determine
the number of nodes that ultimately get activated f (S) for a given set S.
In other words, the number of ultimately activated individuals can be
different depending on the random numbers generated. ICM can be made
deterministic (nonrandom) by generating these random numbers in the
beginning of the ICM process for the whole network. In other words, we
can generate a random number ru,w for any connected pair of nodes. Then,
whenever node v has a chance of activating u, instead of generating the
random number, it can compare ru,w with pv,w . Following this approach,
ICM becomes deterministic, and given any set of initially activated nodes
S, we can compute the number of ultimately activated nodes f (S).
Before finding S, we detail properties of f (S). The function f (S) is nonnegative because for any set of nodes S, in the worst case, no node gets
activated. It is also monotone:
f (S {v}) f (S).
(7.9)
This is because when a node is added to the set of initially activated nodes,
it either increases the number of ultimately activated nodes or keeps them
the same. Finally, f (S) is submodular. A set function f is submodular if Submodular
function
for any finite set N,
S T N, v N \ T, f (S {v}) f (S) f (T {v}) f (T).
229
(7.10)
To find the first node v, we compute f ({v}) for all v. We start with node 1. At
time 0, node 1 can only activate node 6, because
|1 6| 2 (mod 3),
3
(7.11)
Formally, assuming P , NP, there is no polynomial time algorithm for this problem.
230
(7.12)
At time 1, node 1 can no longer activate others, but node 6 is active and can
activate others. Node 6 has outgoing edges to nodes 4 and 5. From 4 and 5, node
6 can only activate 4:
|6 4| 2 (mod 3)
|6 5| . 2 (mod 3).
(7.13)
(7.14)
7.2.3
Intervention
7.3
Diffusion of Innovations
Diffusion of innovations is a phenomenon observed regularly in social media. A music video going viral or a piece of news being retweeted many
times are examples of innovations diffusing across social networks. As
defined by Rogers [239], an innovation is an idea, practice, or object that
is perceived as new by an individual or other unit of adoption. Innovations are created regularly; however, not all innovations spread through
populations. The theory of diffusion of innovations aims to answer why
and how these innovations spread. It also describes the reasons behind
the diffusion process, the individuals involved, and the rate at which ideas
spread. In this section, we review characteristics of innovations that are
likely to be diffused through populations and detail well-known models
in the diffusion of innovations. Finally, we provide mathematical models
that can model the process of diffusion of innovations and describe how
we can intervene with these models.
7.3.1
Innovation Characteristics
paradigm to which it is being presented, should be observable under various trials (trialability), and should not be highly complex.
In terms of individual characteristics, many researchers [239, 127] claim
that the adopter should adopt the innovation earlier than other members
of his or her social circle (innovativeness).
7.3.2
Elihu Katz, a professor of communication at the University of Pennsylvania, is a well-known figure in the study of the flow of information. In
addition to a study similar to the adoption of hybrid corn seed on how
physicians adopted the new tetracycline drug [59], Katz also developed a
two-step flow model (also known as the multistep flow model) [143] that describes how information is delivered through mass communication. The
basic idea is depicted in Figure 7.6. Most information comes from mass
media and is then directed toward influential figures called opinion leaders.
These leaders then convey the information (or form opinions) and act as
hubs for other members of the society.
234
7.3.3
To effectively make use of the theories regarding the diffusion of innovations, we demonstrate a mathematical model for it in this section. The
model incorporates basic elements discussed so far and can be used to
effectively model a diffusion of innovations process. It can be concretely
described as
dA(t)
= i(t)[P A(t)].
(7.15)
dt
Here, A(t) denotes the total population that adopted the innovation
until time t. i(t) denotes the coefficient of diffusion, which describes the
innovativeness of the product being adopted, and P denotes the total number of potential adopters (until time t). This equation shows that the rate at
which the number of adopters changes throughout time depends on how
innovative is the product being adopted. The adoption rate only affects
the potential adopters who have not yet adopted the product. Since A(t)
is the total population of adopters until time t, it is a cumulative sum and
can be computed as follows:
Z t
A(t) =
a(t)dt,
(7.16)
t0
where a(t) defines the adopters at time t. Let A0 denote the number of
adopters at time t0 . There are various methods of defining the diffusion
coefficient [185]. One way is to define i(t) as a linear combination of the
cumulative number of adopters at different times A(t),
i(t) = + 0 A0 + + t A(t) = +
t
X
i A(i),
(7.17)
i=t0
where i s are the weights for each time step. Often a simplified version of
this linear combination is used. In particular, the following three models
for computing i(t) are considered in the literature:
i(t) = ,
i(t) = A(t),
i(t) = + A(t),
External Influence
Factor
External-Influence Model
Internal-Influence Model
Mixed-Influence Model
(7.18)
(7.19)
(7.20)
(7.21)
(7.22)
(7.23)
(PA0 ) (+P)(tt0 )
e
+A0
1+
(PA0 ) (+P)(tt )
0
e
+A0
(7.26)
The internal-influence model is similar to the SI model discussed later in the section
on epidemics. For the sake of completeness, we provide solutions to both. Readers are
encouraged to refer to that model in Section 7.4 for further insight.
238
7.3.4
Intervention
Consider a faulty product being adopted. The product company is planning to stop or delay adoptions until the product is fixed and re-released.
This intervention can be performed by doing the following:
Limiting the distribution of the product or the audience that can adopt the
product. In our mathematical model, this is equivalent to reducing
the population P that can potentially adopt the product.
Reducing interest in the product being sold. For instance, the company
can inform adopters of the faulty status of the product. In our models,
this can be achieved by tampering : setting to a very small value
in Equation 7.22 results in a slow adoption rate.
239
7.4
Epidemics
A generalization of these techniques over networks can be found in [126, 125, 212].
241
7.4.1
Definitions
Closed-world
Assumption
7.4.2
SI Model
We start with the most basic model. In this model, the susceptible individuals get infected, and once infected, they will never get cured. Denote
as the contact probability. In other words, the probability of a pair of
people meeting in any time step is . So, if = 1, everyone comes into
contact with everyone else, and if = 0, no one meets another individual.
Assume that when an infected individual meets a susceptible individual
the disease is being spread with probability 1 (this can be generalized to
other values). Figure 7.10 demonstrates the SI model and the transition
between states that happens in this model for individuals. The value over
the arrow shows that each susceptible individual meets at least I infected
individuals during the next time step.
Given this situation, infected individuals will meet N people on average. We know from this set that only the fraction S/N will be susceptible
and that the rest are infected already. So, each infected individual will
infect NS/N = S others. Since I individuals are infected, IS will be
infected in the next time step. This means that the number of susceptible individuals will be reduced by this factor as well. So, to get different
values of S and I at different times, we can solve the following differential
equations:
dS
= IS,
(7.27)
dt
dI
= IS.
(7.28)
dt
Since S + I = N at all times, we can eliminate one equation by replacing
S with N I:
dI
= I(N I).
(7.29)
dt
243
NI0 et
,
N + I0 (et 1)
(7.30)
where I0 is the number of individuals infected at time 0. In general, analyzing epidemics in terms of the number of infected individuals has nominal
generalization power. To address this limitation, we can consider infected
I0
in the previous equation,
fractions. We therefore substitute i0 = N
i(t) =
i0 et
.
1 + i0 (et 1)
(7.31)
Note that in the limit, the SI model infects all the susceptible population
because there is no recovery in the model. Figure 7.11(a) depicts the
logistic growth function (infected individuals) and susceptible individuals
for N = 100, I0 = 1, and = 0.003. Figure 7.11(b) depicts the infected
population for HIV/AIDS for the past 20 years. As observed, the infected
population can be approximated well with the logistic growth function
and follows the SI model. Note that in the HIV/AIDS graph, not everyone
is getting infected. This is because not everyone in the United States is in
the susceptible population, so not everyone will get infected in the end.
Moreover, there are other factors that are far more complex than the details
of the SI model that determine how people get infected with HIV/AIDS.
7.4.3
SIR Model
The SIR model, first introduced by Kermack, and McKendrick [148], adds
more detail to the standard SI model. In the SIR model, in addition to the
I and S states, a recovery state R is present. Figure 7.12 depicts the model.
In the SIR model, hosts get infected, remain infected for a while, and then
recover. Once hosts recover (or are removed), they can no longer get
infected and are no longer susceptible. The process by which susceptible
individuals get infected is similar to the SI model, where a parameter
defines the probability of contacting others. Similarly, a parameter
in the SIR model defines how infected people recover, or the recovering
probability of an infected individual in a time period t.
In terms of differential equations, the SIR model is
dS
= IS,
dt
dI
= IS I,
dt
dR
= I.
dt
(7.32)
(7.33)
(7.34)
Equation 7.32 is identical to that of the SI model (Equation 7.27). Equation 7.33 is different from Equation 7.28 of the SI model by the addition
of the term I, which defines the number of infected individuals who recovered. These are removed from the infected set and are added to the
recovered ones in Equation 7.34. Dividing Equation 7.32 by Equation 7.34,
we get
dS
= S,
dR
(7.35)
S0
= R.
S
245
(7.36)
S0 = Se R
S = S0 e
(7.37)
(7.38)
(7.39)
dR
= (N S0 e R R).
dt
(7.40)
If we solve this equation for R, then we can determine S from 7.38 and
I from I = N R S. The solution for R can be computed by solving the
following integration:
Z
1 R
dx
.
(7.41)
t=
0 N S e x x
0
However, there is no closed-form solution to this integration, and only
numerical approximation is possible. Figure 7.13 depicts the behavior of
the SIR model for a set of initial parameters.
The two models in the next two subsections are generalized versions
of the two models discussed thus far: SI and SIR. These models allow
individuals to have temporary immunity and to get reinfected.
7.4.4
SIS Model
The SIS model is the same as the SI model, with the addition of infected
nodes recovering and becoming susceptible again (see Figure 7.14). The
differential equations describing the model are
dS
= I IS,
dt
dI
= IS I.
dt
(7.42)
(7.43)
(7.44)
Figure 7.15: SIS Model Simulated with S0 = 99, I0 = 1, = 0.01, and = 0.1.
7.4.5
SIRS Model
The final model analyzed in this section is the SIRS model. Just as the
SIS model extends the SI, the SIRS model extends the SIR, as shown in
Figure 7.16. In this model, the assumption is that individuals who have
recovered will lose immunity after a certain period of time and will become
susceptible again. A new parameter has been added to the model that
defines the probability of losing immunity for a recovered individual. The
(7.45)
(7.46)
(7.47)
Like the SIR model, this model has no closed-form solution, so numerical integration can be used. Figure 7.17 demonstrates a simulation of the
SIRS model with given parameters of choice. As observed, the simulation
outcome is similar to the SIR model simulation (see Figure 7.13). The major
difference is that in the SIRS, the number of susceptible and recovered individuals changes non-monotonically over time. For example, in SIRS, the
number of susceptible individuals decreases over time, but after reaching
the minimum count, starts increasing again. On the contrary, in the SIR,
both susceptible individuals and recovered individuals change monotonically, with the number of susceptible individuals decreasing over time and
that of recovered individuals increasing over time. In both SIR and SIRS,
the infected population changes non-monotonically.
7.4.6
Intervention
250
7.5
Summary
251
7.6
Bibliographic Notes
The concept of the herd has been well studied in psychology by Freud
(crowd psychology), Carl Gustav Jung (the collective unconscious), and
Gustave Le Bon (the popular mind). It has also been observed in economics
by Veblen [288] and in studies related to the bandwagon effect [240, 259, 165].
The behavior is also discussed in terms of sociability [258] in sociology.
Herding, first coined by Banerjee [23], at times refers to a slightly different concept. In herd behaviour discussed in this chapter, the crowd does
not necessarily start with the same decision, but will eventually reach one,
whereas in herding the same behavior is usually observed. Moreover, in
herd behavior, individuals decide whether the action they are taking has
some benefits to themselves or is rational, and based on that, they will
align with the population. In herding, some level of uncertainty is associated with the decision, and the individual does not know why he or she is
following the crowd.
Another confusion is that the terms herd behavior/herding is often
used interchangebly with information cascades [37, 299]. To avoid this
problem, we clearly define both in the chapter and assume that in herd
behavior, decisions are taken based on global information, whereas in
information cascades, local information is utilized.
Herd behavior has been studied in the context of financial markets
[60, 74, 38, 69] and investment [250]. Gale analyzes the robustness of
different herd models in terms of different constraints and externalities [93],
and Shiller discusses the relation between information, conversation, and
herd behavior [256]. Another well-known social conformity experiment
was conducted in Manhattan by Milgram et al. [195].
Other recent applications of threshold models can be found in [307,
295, 296, 285, 286, 252, 232, 202, 184, 183, 108, 34]. Bikhchandani et al.
[1998] review conformity, fads, and information cascades and describe
how observing past human decisions can help explain human behavior.
Hirshleifer [128] provides information cascade examples in many fields,
including zoology and finance.
In terms of diffusion models, Robertson [238] describes the process and
Hagerstrand et al. [118] introduce a model based on the spatial stages
of the diffusion of innovations and Monte Carlo simulation models for
diffusion of innovations. Bass [30] discusses a model based on differential
equations. Mahajan and Peterson [187] extend the Bass model.
252
253
7.7
Exercises
1. Discuss how different information diffusion modeling techniques differ. Name applications on social media that can make use of methods
in each area.
Herd Effect
2. What are the minimum requirements for a herd behavior experiment?
Design an experiment of your own.
Diffusion of Innovation
3. Simulate internal-, external-, and mixed-influence models in a program. How are the saturation levels different for each model?
4. Provide a simple example of diffusion of innovations and suggest a
specific way of intervention to expedite the diffusion.
Information Cascades
5. Briefly describe the independent cascade model (ICM).
6. What is the objective of cascade maximization? What are the usual
constraints?
7. Follow the ICM procedure until it converges for the following graph.
Assume that node i activates node j when i j 1 (mod 3) and node
5 is activated at time 0.
254
Epidemics
8. Discuss the mathematical relationship between the SIR and the SIS
models.
9. Based on our assumptions in the SIR model, the probability that an
individual remains infected follows a standard exponential distribution. Describe why this happens.
10. In the SIR model, what is the most likely time to recover based on the
value of ?
11. In the SIRS model, compute the length of time that an infected individual is likely to remain infected before he or she recovers.
12. After the model saturates, how many are infected in the SIS model?
255
256
Part III
Applications
257
Chapter
Assortativity
Influence,
Homophily, and
Confounding
8.1
Measuring Assortativity
Figure 8.2: A U.S. High School Friendship Network in 1994 between Races.
Eighty percent of the links exist between members of the same race (from
[62]).
To measure assortativity, we measure the number of edges that fall in
between the nodes of the same race. This technique works for nominal
attributes, such as race, but does not work for ordinal ones such as age.
Consider a network where individuals are friends with people of different
ages. Unlike races, individuals are more likely to be friends with others
close in age, but not necessarily with ones of the exact same age. Hence,
1
261
we discuss two techniques: one for nominal attributes and one for ordinal
attributes.
8.1.1
(8.1)
Assortativity
Significance
This measure has its limitations. Consider a school of Hispanic students. Obviously, all connections will be between Hispanics, and assortativity value 1 is not a significant finding. However, consider a school where
half the population is white and half the population is Hispanic. It is statistically expected that 50% of the connections will be between members of
different race. If connections in this school were only between whites and
Hispanics and not within groups, then our observation is significant. To
account for this limitation, we can employ a common technique where we
measure the assortativity significance by subtracting the measured assortativity by the statistically expected assortativity. The higher this value, the
more significant the assortativity observed.
Consider a graph G(V, E), |E| = m, where the degrees are known beforehand (how many friends an individual has), but the edges are not.
Consider two nodes vi and v j , with degrees di and d j , respectively. What is
the expected number of edges between these two nodes? Consider node
2
262
vi . For any edge going out of vi randomly, the probability of this edge
d
d
getting connected to node v j is P jdi = 2mj . Since the degree for vi is di , we
i
have di such edges; hence, the expected number of edges between vi and
di d j
v j is 2m
. Now, the expected number of edges between vi and v j that are of
dd
i j
the same type is 2m
( t(vi ), t(v j ) ) and the expected number of edges of the
same type in the whole graph is
1 X di d j
1 X di d j
( t(vi ), t(v j ) ) =
( t(vi ), t(v j ) ).
m (v ,v )E 2m
2m i j 2m
i
(8.3)
1 X
1 X di d j
Ai j ( t(vi ), t(v j ) )
( t(vi ), t(v j ) )
2m i j
2m i j 2m
(8.4)
di d j
1 X
) ( t(vi ), t(v j ) ).
( Ai j
2m i j
2m
(8.5)
Q
Qmax
(8.6)
dd
i j
Ai j 2m
) ( t(vi ), t(v j ) )
=
P
P di d j
1
1
max[ 2m
i j Ai j ( t(vi ), t(v j ) ) 2m
i j 2m ( t(vi ), t(v j ) )]
1
2m
i j(
(8.7)
=
di d j
i j ( Ai j 2m )
P di d j
1
1
2m 2m
i j 2m
2m
1
2m
( t(vi ), t(v j ) )
( t(vi ), t(v j ) )
i j(
(8.8)
dd
i j
Ai j 2m
) ( t(vi ), t(v j ) )
.
P di d j
2m i j 2m
( t(vi ), t(v j ) )
P
=
263
(8.9)
Therefore, (T )i,j = (t(vi ), t(v j )). Let B = A ddT /2m denote the
modularity matrix where d Rn1 is the degree vector for all nodes. Given
that the trace of multiplication of two matrices X and YT is Tr(XYT ) =
P
i,j Xi,j Yi,j and Tr(XY) = Tr(YX), modularity can be reformulated as
Q =
di d j
1
1 X
( Ai j
) ( t(vi ), t(v j ) ) =
Tr(BT )
2m i j
2m | {z } 2m
| {z }
(T )i,j
Bij
1
Tr(T B).
(8.12)
2m
Example 8.1. Consider the bipartite graph in Figure 8.3. For this bipartite graph,
=
0
0
A =
1
0
0
1
1
1
1
0
0
1
1
0
0
1
1
=
0
0
264
0
0
1
1
2
2
d = , m = 4.
2
2
(8.13)
0.5
0.5
B = A ddT /2m =
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5 0.5 0.5
0.5 0.5 0.5
(8.14)
(8.15)
In this example, all edges are between nodes of different color. In other words,
the number of edges between nodes of the same color is less than the expected
number of edges between them. Therefore, the modularity value is negative.
8.1.2
18
21
XL =
,
21
20
21
18
XR =
.
20
21
(8.16)
In other words, XL represents the ordinal values associated with the left
node of the edges, and XR represents the values associated with the right
node of the edges. Our problem is therefore reduced to computing the covariance between variables XL and XR . Note that since we are considering
an undirected graph, both edges (vi , v j ) and (v j , vi ) exist; therefore, xi and
x j are observed in both XL and XR . Thus, XL and XR include the same set of
values but in a different order. This implies that XL and XR have the same
mean and standard deviation.
E(XL ) = E(XR ),
(XL ) = (XR ).
(8.17)
(8.18)
Since we have m edges and each edge appears twice for the undirected
graph, then XL and XR have 2m elements. Each value xi appears di times
since it appears as endpoints of di edges. The covariance between XL and
XR is
(XL , XR ) =
=
=
=
P
=
ij
Ai j xi x j
ij
di d j xi x j
2m
(2m)2
X
di d j
1
=
( Ai j
)xi x j .
2m i j
2m
(8.22)
Correlation
(8.23)
(XL , XR )
,
(XL )2
P
1
i j ( Ai j
2m
di d j
2m
)xi x j
(8.24)
Note the similarity between Equations 8.9 and 8.24. Although modularity is used for nominal attributes and correlation for ordinal attributes,
the major difference between the two equations is that the function in
modularity is replaced by xi x j in the correlation equation.
Example 8.2. Consider Figure 8.4 with values demonstrating the attributes associated with each node. Since this graph is undirected, we have the following
edges:
E = {(a, c), (c, a), (c, b), (b, c)}.
(8.25)
The correlation is between the values associated with the endpoints of the edges.
Consider XL as the value of the left end of an edge and XR as the value of the right
end of an edge:
18
21
21
18
XL =
(8.26)
, XR =
20
21
20
21
The correlation between these two variables is (XL , XR ) = 0.67.
267
8.2
Influence
8.2.1
Prediction-Based
Influence Measures
Measuring Influence
Observation-based
Influence Measures
Observation-Based Measures. In observation-based measures, we quantify the influence of an individual by measuring the amount of influence
attributed to him. An individual can influence differently in diverse settings, and so, depending on the context, the observation-based measuring
of influence changes. We next describe three different settings and how
influence can be measured in each.
1. When an individual is the role model. This happens in the case of
individuals in the fashion industry, teachers, and celebrities. In this
case, the size of the audience that has been influenced due to that
fashion, charisma, or the like could act as an accurate measure. A
3
268
Influence Flow
|Ip |
X
m=1
I(Pm ) wout
|Op |
X
I(Pn ),
(8.27)
n=1
where I(.) denotes the influence of a blogpost and win and wout are the
weights that adjust the contribution of in- and out-links, respectively. In
this equation, Pm s are blogposts that point to post p, and Pn s are blogposts
270
that are referred to in post p. Influence flow describes a measure that only
accounts for in-links (recognition) and out-links (novelty). To account for
the other two factors, we design the influence of a blogpost p as
I(p) = wlength lp (wcomment cp + InfluenceFlow(p)).
(8.28)
Here, wlength is the weight for the length of blogpost4 . wcomment describes
how the number of comments is weighted. Note that the four weights win ,
wout , wcomments , and wlength need to be tuned to make the model more accurate. This tuning can be done by a variety of techniques. For instance, we
can use a test system where the influential posts are already known (labeled
data) to tune them.5 Finally, a bloggers influence index (iIndex) can be defined as the maximum influence value among all his or her N blogposts,
iIndex = max I(pn ).
pn N
(8.29)
Computing iIndex for a set of bloggers over all their blogposts can help
identify and rank influential bloggers in a system.
Measuring Social Influence on Twitter. On Twitter, a microblogging platform, users receive tweets from other users by following them. Intuitively,
we can think of the number of followers as a measure of influence (indegree centrality). In particular, three measures are frequently used to
quantify influence in Twitter,
1. In-degree: the number of users following a person on Twitter. As
discussed, the number of individuals who are interested in someones
tweets (i.e., followers) is commonly used as an influence measure on
Twitter. In-degree denotes the audience size of an individual.
2. Number of mentions: the number of times an individual is mentioned in tweets. Mentioning an individual with a username handle
is performed by including @username in a tweet. The number of
times an individual is mentioned can be used as an influence measure. The number of mentions denotes the ability in engaging others
in conversation [50].
4
In the original paper, the authors utilize a weight function instead. Here, for clarity,
we use coefficients for all parameters.
5
Note that Equation 8.28 is defined recursively, because I(p) depends on InfluenceFlow
and that, in turn, depends on I(p) (Equation 8.27). Therefore, to estimate I(p), we can
use iterative methods where we start with an initial value for I(p) and compute until
convergence.
271
Table 8.1: Rank Correlation between Top 10% of Influentials for Different
Measures on Twitter
Measures
Correlation Value
In-degree vs. retweets
0.122
In-degree vs. mentions
0.286
Retweets vs. mentions
0.638
3. Number of retweets: the number of times tweets of a user are
retweeted. Individuals on Twitter have the opportunity to forward
tweets to a broader audience via the retweet capability. Clearly, the
more ones tweets are retweeted, the more likely one is influential.
The number of retweets indicates an individuals ability to generate
content that is worth being passed along.
Spearmans Rank
Correlation
8.2.2
Modeling Influence
In influence modeling, the goal is to design models that can explain how
individuals influence one another. Given the nature of social media, it is
safe to assume that influence takes place among connected individuals.
At times, this network is observable (explicit networks), and at others
times, it is unobservable (implicit networks). For instance, in referral
networks, where people refer others to join an online service on social
media, the network of referrals is often observable. In contrast, people are
influenced to buy products, and in most cases, the seller has no information
on who referred the buyer, but does have approximate estimates on the
number of products sold over time. In the observable (explicit) network,
we resort to threshold models such as the linear threshold model (LTM) Linear
to model influence; in implicit networks, we can employ methods such as Threshold
the linear influence model (LIM) that take the number of individuals who Model (LTM)
get influenced at different times as input (e.g., the number of buyers per
week).
Modeling Influence in Explicit Networks
Threshold models are simple yet effective methods for modeling influence
in explicit networks. In these models, nodes make decision based on the
number or the fraction (the threshold) of their neighbors (or incoming
neighbors in a directed graph) who have already decided to make the
same decision. Threshold models were employed in the literature as early
as the 1970s in the works of Granovetter [109] and Schelling [251]. Using
a threshold model, Schelling demonstrated that minor local preferences in
having neighbors of the same color leads to global racial segregation.
A linear threshold model (LTM) is an example of a threshold model.
Assume a weighted directed graph where nodes v j and vi are connected
with weight w j,i 0. This weight denotes how much node v j can affect
node vi s decision. We also assume
X
w j,i 1,
(8.31)
v j Nin (vi )
where Nin (vi ) denotes the incoming neighbors of node vi . In a linear threshold model, each node vi is assigned a threshold i such that when the
amount of influence exerted toward vi by its active incoming neighbors is
273
where At1 denotes the set of active nodes at the end of time t 1. The
threshold values are generally assigned uniformly at random to nodes
from the interval [0,1]. Note that the threshold i defines how resistant to
change node vi is: a very small i value might indicate that a small change
in the activity of vi s neighborhood results in vi becoming active and a large
i shows that vi resists changes.
Provided a set of initial active nodes A0 and a graph, the LTM algorithm
is shown in Algorithm 8.1. In each step, for all inactive nodes, the condition
in Equation 8.32 is checked, and if it is satisfied, the node becomes active.
The process ends when no more nodes can be activated. Once thresholds
are fixed, the process is deterministic and will always converge to the same
state.
274
Figure 8.5: Linear Threshold Model (LTM) Simulation. The values attached to nodes denote thresholds i , and the values on the edges represent
weights wi, j .
Example 8.3. Consider the graph in Figure 8.5. Values attached to nodes represent the LTM thresholds, and edge values represent the weights. At time 0, node
v1 is activated. At time 2, both nodes v2 and v3 receive influence from node v1 .
Node v2 is not activated since 0.5 < 0.8 and node v3 is activated since 0.8 > 0.7.
Similarly, the process continues and then stops with five activated nodes.
Figure 8.6: The Size of the Influenced Population as a Summation of Individuals Influenced by Activated Individuals (from [306]).
is the set of influenced population P(t) at any time and the time tu , when
each individual u gets initially influenced (activated). We assume that any
influenced individual u can influence I(u, t) number of non-influenced (inactive) individuals after t time steps. We call I(., .) the influence function.
Assuming discrete time steps, we can formulate the size of the influenced
population |P(t)|:
X
|P(t)| =
I(u, t tu ).
(8.33)
uP(t)
Figure 8.6 shows how the model performs. Individuals u, v, and w are
activated at time steps tu , tv , and tw , respectively. At time t, the total number
of influenced individuals is a summation of influence functions Iu , Iv , and
Iw at time steps t tu , t tv , and t tw , respectively. Our goal is to estimate
I(., .) given activation times and the number of influenced individuals at all
times. A simple approach is to utilize a probability distribution to estimate
I function. For instance, we can employ the power-law distribution to
276
|V| X
T
X
(8.34)
u=1 t=1
(8.35)
It is common to assume that individuals can only activate other individuals and cannot stop others from becoming activated. Hence, negative
values for influence do not make sense; therefore, we would like measured
influence values to be positive I 0,
minimize ||P AI||22
subject to
I 0.
(8.36)
(8.37)
This formulation is similar to regression coefficients computation outlined in Chapter 5, where we compute a least square estimate of I; however,
this formulation cannot be solved using regression techniques studied earlier because, in regression, computed I values can become negative. In
practice, this formulation can be solved using non-negative least square
methods (see [164] for details).
277
8.3
Homophily
8.3.1
Measuring Homophily
8.3.2
Modeling Homophily
Note that we have assumed that homophily is the leading social force in the network
and that it leads to its assortativity change. This assumption is often strong for social
networks because other social forces act in these networks.
278
8.4
8.4.1
Social
Correlation
Shuffle Test
280
or equivalently,
p(a)
) = a + ,
(8.41)
1 p(a)
where measures the social correlation and denotes the activation bias.
For computing the number of already active nodes of an individual, we
need to know the activation time stamps of the nodes.
Let ya,t denote the number of individuals who became activated at time
t and had a active friends and let na,t denote the ones
active
P who had a P
friends but did not get activated at time t. Let ya = t ya,t and na = t na,t .
We define the likelihood function as
Y
p(a) ya (1 p(a))na .
(8.42)
ln(
To estimate and , we find their values such that the likelihood function denoted in Equation 8.42 is maximized. Unfortunately, there is no
closed-form solution, but there exist software packages that can efficiently
compute the solution to this optimization.8
Let tu denote the activation time (when a node is first influenced) of
node u. When activated node u influences nonactivated node v, and v
is activated, then we have tu < tv . Hence, when temporal information is
available about who activated whom, we see that influenced nodes are
activated at a later time than those who influenced them. Now, if there is
no influence in the network, we can randomly shuffle the activation time
stamps, and the predicted should not change drastically. So, if we shuffle
activation time stamps and compute the correlation coefficient 0 and its
value is close to the computed in the original unshuffled dataset (i.e.,
| 0 | is small), then the network does not exhibit signs of social influence.
8.4.2
Edge-Reversal Test
Note that maximizing this term is equivalent to maximizing the logarithm; this is
where Equation 8.41 comes into play.
281
Figure 8.7: The Effect of Influence and Homophily on Attributes and Links
over Time (reproduced from [161]).
8.4.3
Randomization Test
Unlike the other two tests, the randomization test [161] is capable of detecting both influence and homophily in networks. Let X denote the attributes
associated with nodes (age, gender, location, etc.) and Xt denote the attributes at time t. Let Xi denote attributes of node vi . As mentioned before,
in influence, individuals already linked to one another change their attributes (e.g., a user changes habits), whereas in homophily, attributes do
not change but connections are formed due to similarity. Figure 8.7 demonstrates the effect of influence and homophily in a network over time.
The assumption is that, if influence or homophily happens in a network, then networks become more assortative. Let A(Gt , Xt ) denote the
assortativity of network G and attributes X at time t. Then, the network
becomes more assortative at time t + 1 if
A(Gt+1 , Xt+1 ) A(Gt , Xt ) > 0.
(8.43)
(8.44)
(8.45)
Homophily
Significance Test
then used to compute influence gains {gi }ni=1 . Obviously, the more distant
g0 is from these gains, the more significant influence is. We can assume
that whenever g0 is smaller than /2% (or larger than 1 /2%) of {gi }ni=1
values, it is significant. The value of is set empirically.
Similarly, in the homophily significance test, we compute the original
homophily gain and construct random graph links GRit+1 at time t + 1, such
that no homophily effect is exhibited in how links are formed. To perform
this for any two (randomly selected) links ei j and ekl formed in the original
Gt+1 graph, we form edges eil and ek j in GRit+1 . This is to make sure that the
homophily effect is removed and that the degrees in GRit+1 are equal to that
of Gt+1 .
284
8.5
Summary
Individuals are driven by different social forces across social media. Two
such important forces are influence and homophily.
In influence, an individuals actions induce her friends to act in a similar
fashion. In other words, influence makes friends more similar. Homophily
is the tendency for similar individuals to befriend each other. Both influence and homophily result in networks where similar individuals are
connected to each other. These are assortative networks. To estimate the
assortativity of networks, we use different measures depending on the
attribute type that is tested for similarity. We discussed modularity for
nominal attributes and correlation for ordinal ones.
Influence can be quantified via different measures. Some are predictionbased, where the measure assumes that some attributes can accurately predict how influential an individual will be, such as with in-degree. Others
are observation-based, where the influence score is assigned to an individual based on some history, such as how many individuals he or she has
influenced. We also presented case studies for measuring influence in the
blogosphere and on Twitter.
Influence is modeled differently depending on the visibility of the network. When network information is available, we employ threshold models such as the linear threshold model (LTM), and when network information is not available, we estimate influence rates using the linear influence
model (LIM). Similarly, homophily can be measured by computing the assortativity difference in time and modeled using a variant of independent
cascade models.
Finally, to determine the source of assortativity in social networks, we
described three statistical tests: the shuffle test, the edge-reversal test, and
the randomization test. The first two can determine if influence is present
in the data, and the last one can determine both influence and homophily.
All tests require temporal data, where activation times and changes in
attributes and links are available.
285
8.6
Bibliographic Notes
286
8.7
Exercises
1. State two common factors that explain why connected people are
similar or vice versa.
Measuring Assortativity
2.
Influence
3. Does the linear threshold model (LTM) converge? Why?
4. Follow the LTM procedure until convergence for the following graph.
Assume all the thresholds are 0.5 and node v1 is activated at time 0.
287
Homophily
6. Design a measure for homophily that takes into account assortativity
changes due to influence.
288
Chapter
(9.1)
In other words, the algorithm learns a function that assigns a real value
to each user-item pair (u, i), where this value indicates how interested user
u is in item i. This value denotes the rating given by user u to item i. The
recommendation algorithm is not limited to item recommendation and
289
9.1
Challenges
Recommendation systems face many challenges, some of which are presented next:
Cold-Start Problem. Many recommendation systems use historical data or information provided by the user to recommend items,
products, and the like. However, when individuals first join sites,
they have not yet bought any product: they have no history. This
makes it hard to infer what they are going to like when they start
on a site. The problem is referred to as the cold-start problem. As
an example, consider an online movie rental store. This store has no
idea what recently joined users prefer to watch and therefore cannot
recommend something close to their tastes. To address this issue,
these sites often ask users to rate a couple of movies before they begin recommend others to them. Other sites ask users to fill in profile
290
several items are bought together by many users, the system recommends these to new users items together. However, the system does
not know why these items are bought together. Individuals may
prefer some reasons for buying items; therefore, recommendation
algorithms should provide explanation when possible.
9.2
9.2.1
Content-Based Methods
ui,l i j,l
q
Pk
2
l=1
ui,l
l=1 i j,l
(9.2)
2
John
Joe
Jill
Jane
Jorge
9.2.2
Lion King
3
5
1
3
2
Aladdin
0
4
2
?
2
Mulan
3
0
4
1
0
Anastasia
3
2
2
0
1
Collaborative filtering is another set of classical recommendation techniques. In collaborative filtering, one is commonly given a user-item matrix where each entry is either unknown or is the rating assigned by the
user to an item. Table 9.1 is an user-item matrix where ratings for some cartoons are known and unknown for others (question marks). For instance,
on a review scale of 5, where 5 is the best and 0 is the worst, if an entry (i, j)
in the user-item matrix is 4, that means that user i liked item j.
In collaborative filtering, one aims to predict the missing ratings and
possibly recommend the cartoon with the highest predicted rating to the
user. This prediction can be performed directly by using previous ratings
in the matrix. This approach is called memory-based collaborative filtering
because it employs historical data available in the matrix. Alternatively,
one can assume that an underlying model (hypothesis) governs the way
users rate items. This model can be approximated and learned. After
the model is learned, one can use it to predict other ratings. The second
approach is called model-based collaborative filtering.
293
Neighborhood
(9.4)
Example 9.1. In Table 9.1, rJane,Aladdin is missing. The average ratings are the
following:
rJohn =
rJoe =
rJill =
rJane =
rJorge =
3+3+0+3
= 2.25
4
5+4+0+2
= 2.75
4
1+2+4+2
= 2.25
4
3+1+0
= 1.33
3
2+2+0+1
= 1.25.
4
(9.6)
(9.7)
(9.8)
(9.9)
(9.10)
Using cosine similarity (or Pearson correlation), the similarity between Jane
and others can be computed:
33+13+03
10 27
35+10+02
sim(Jane, Joe) =
10 29
31+14+02
sim(Jane, Jill) =
10 21
32+10+01
sim(Jane, Jorge) =
10 5
sim(Jane, John) =
= 0.73
(9.11)
= 0.88
(9.12)
= 0.48
(9.13)
= 0.84.
(9.14)
Now, assuming that the neighborhood size is 2, then Jorge and Joe are the two
most similar neighbors. Then, Janes rating for Aladdin computed from user-based
collaborative filtering is
rJane,Aladdin = rJane +
+
= 1.33 +
(9.15)
sim(Aladdin, Mulan) =
sim(Aladdin, Anastasia) =
Now, assuming that the neighborhood size is 2, then Lion King and Anastasia
are the two most similar neighbors. Then, Janes rating for Aladdin computed
from item-based collaborative filtering is
sim(Aladdin, Lion King)(rJane,Lion King rLion King )
sim(Aladdin, Lion King) + sim(Aladdin, Anastasia)
sim(Aladdin, Anastasia)(rJane,Anastasia rAnastasia )
+
sim(Aladdin, Lion King) + sim(Aladdin, Anastasia)
0.84(3 2.8) + 0.67(0 1.6)
= 2+
= 1.40.
(9.24)
0.84 + 0.67
rJane,Aladdin = rAladdin +
(9.25)
(9.26)
U =
0.4110 0.6626
0.6207
0.3251
0.2373
0.1572
0
0
8.0265
0
4.3886
0
0
0
2.0777
0
0
0
T
0.2872 0.9290
V = 0.2335
0.6181
0.7814
0.0863
0.1093
0.4099
0.0820
0.9018
(9.27)
(9.28)
(9.29)
Uk
k
VkT
"
=
"
=
0.4151 0.4754
0.7437
0.5278
0.4110 0.6626
0.3251
0.2373
#
8.0265
0
0
4.3886
#
0.7506 0.5540 0.3600
.
0.2335
0.2872 0.9290
(9.30)
(9.31)
(9.32)
The rows of Uk represent users. Similarly the columns of VkT (or rows of Vk )
represent items. Thus, we can plot users and items in a 2-D figure. By plotting
1
299
9.2.3
All methods discussed thus far are used to predict a rating for item i
for an individual u. Advertisements that individuals receive via email
marketing are examples of this type of recommendation on social media.
However, consider ads displayed on the starting page of a social media site.
These ads are shown to a large population of individuals. The goal when
showing these ads is to ensure that they are interesting to the individuals
300
1X
ru,i .
n uG
(9.33)
301
(9.34)
(9.35)
uG
Since we recommend items that have the highest Ri values, this strategy
guarantees that the items that are being recommended to the group are
enjoyed the most by at least one member of the group.
Example 9.4. Consider the user-item matrix in Table 9.3. Consider group G =
{John, Jill, Juan}. For this group, the aggregated ratings for all products using
average satisfaction, least misery, and most pleasure are as follows.
Table 9.3: User-Item Matrix
Soda Water Tea Coffee
John
1
3
1
1
Joe
4
3
1
2
Jill
2
2
4
2
Jorge
1
1
3
5
Juan
3
3
4
5
Average Satisfaction:
1+2+3
3
3+2+3
=
3
1+4+4
=
3
1+2+5
=
3
RSoda =
= 2.
(9.36)
= 2.66.
(9.37)
= 3.
(9.38)
= 2.66.
(9.39)
RSoda = min{1, 2, 3} = 1.
(9.40)
RWater
RTea
RCoffee
Least Misery:
302
RWater = min{3, 2, 3} = 2.
RTea = min{1, 4, 4} = 1.
RCoffee = min{1, 2, 5} = 1.
(9.41)
(9.42)
(9.43)
Most Pleasure:
RSoda
RWater
RTea
RCoffee
=
=
=
=
max{1, 2, 3} = 3.
max{3, 2, 3} = 3.
max{1, 4, 4} = 4.
max{1, 2, 5} = 5.
(9.44)
(9.45)
(9.46)
(9.47)
Thus, the first recommended items are tea, water, and coffee based on average
satisfaction, least misery, and most pleasure, respectively.
9.3
In social media, in addition to ratings of products, there is additional information available, such as the friendship network among individuals.
This information can be used to improve recommendations, based on the
assumption that an individuals friends have an impact on the ratings ascribed to the individual. This impact can be due to homophily, influence,
or confounding, discussed in Chapter 8. When utilizing this social information (i.e., social context) we can (1) use friendship information alone, (2)
use social information in addition to ratings, or (3) constrain recommendations using social information. Figure 9.2 compactly represents these three
approaches.
9.3.1
9.3.2
Social information can also be used in addition to a user-item rating matrix to improve recommendation. Addition of social information can be
performed by assuming that users that are connected (i.e., friends) have
similar tastes in rating items. We can model the taste of user Ui using
a k-dimensional vector Ui Rk1 . We can also model items in the kdimensional space. Let V j Rk1 denote the item representation in kdimensional space. We can assume that rating Ri j given by user i to item j
304
can be computed as
Ri j = UiT Vi .
(9.48)
(9.49)
(9.50)
Users often have only a few ratings for items; therefore, the R matrix
is very sparse and has many missing values. Since we compute U and V
only for nonmissing ratings, we can change Equation 9.50 to
n
1 XX
Ii j (Ri j UiT V j )2 ,
min
U,V 2
i=1 j=1
(9.51)
where Ii j {0, 1} and Ii j = 1 when user i has rated item j and is equal to
0 otherwise. This ensures that nonrated items do not contribute to the
summations being minimized in Equation 9.51. Often, when solving this
optimization problem, the computed U and V can estimate ratings for the
already rated items accurately, but fail at predicting ratings for unrated
items. This is known as the overfitting problem. The overfitting problem can Overfitting
be mitigated by allowing both U and V to only consider important features
required to represent the data. In mathematical terms, this is equivalent to
both U and V having small matrix norms. Thus, we can change Equation
9.51 to
n
m
1
2
1 XX
Ii j (Ri j UiT V j )2 + ||U||2F + ||V||2F ,
(9.52)
2 i=1 j=1
2
2
where 1 , 2 > 0 are predetermined constants that control the effects of
matrix norms. The terms 21 ||U||2F and 22 ||V||2F are denoted as regularization
305
terms. Note that to minimize Equation 9.52, we need to minimize all terms
in the equation, including the regularization terms. Thus, whenever one Regularization
Term
needs to minimize some other constraint, it can be introduced as a new
additive term in Equation 9.52. Equation 9.52 lacks a term that incorporates
the social network of users. For that, we can add another regularization
term,
n X
X
sim(i, j)||Ui U j ||2F ,
(9.53)
i=1 jF(i)
where sim(i, j) denotes the similarity between user i and j (e.g., cosine
similarity or Pearson correlation between their ratings) and F(i) denotes
the friends of i. When this term is minimized, it ensures that the taste for
user i is close to that of all his friends j F(i). As we did with previous
regularization terms, we can add this term to Equation 9.51. Hence, our
final goal is to solve the following optimization problem:
n
XX
1 XX
Ii j (Ri j UiT V j )2 +
sim(i, j)||Ui U j ||2F
min
U,V 2
i=1 j=1
i=1 jF(i)
+
2
1
||U||2F + ||V||2F ,
2
2
(9.54)
where is the constant that controls the effect of social network regularization. A local minimum for this optimization problem can be obtained
using gradient-descent-based approaches. To solve this problem, we can
compute the gradient with respect to Ui s and Vi s and perform a gradientdescent-based method.
9.3.3
John
0
1
0
0
1
Joe
1
0
1
0
0
,
A =
(9.57)
0
1
0
1
1
Jill
Jane
0
0
1
0
0
Jorge
1
0
1
0
0
We wish to predict rJill,Mulan . We compute the average ratings and similarity
between Jill and other individuals using cosine similarity:
rJohn =
4+3+2+2
= 2.75.
4
307
(9.58)
5+2+1+5
= 3.25.
4
2+5+0
=
= 2.33.
3
1+3+4+3
=
= 2.75.
4
3+1+1+2
=
= 1.75.
4
rJoe =
(9.59)
rJill
(9.60)
rJane
rJorge
(9.61)
(9.62)
= 0.79.
(9.63)
= 0.50.
(9.64)
= 0.72.
(9.65)
= 0.54.
(9.66)
Considering a neighborhood of size 2, the most similar users to Jill are John
and Jane:
N(Jill) = {John, Jane}.
(9.67)
We also know that friends of Jill are
F(Jill) = {Joe, Jane, Jorge}.
(9.68)
We can use Equation 9.55 to predict the missing rating by taking the intersection of friends and neighbors:
sim(Jill, Jane)(rJane,Mulan rJane )
sim(Jill, Jane)
= 2.33 + (4 2.75) = 3.58.
rJill,Mulan = rJill +
(9.69)
Similarly, we can utilize Equation 9.56 to compute the missing rating. Here,
we take Jills two most similar neighbors: Jane and Jorge.
rJill,Mulan = rJill +
9.4
(9.70)
Evaluating Recommendations
9.4.1
When evaluating the accuracy of predictions, we measure how close predicted ratings are to the true ratings. Similar to the evaluation of supervised
learning, we often predict the ratings of some items with known ratings
(i.e., true ratings) and compute how close the predictions are to the true
ratings. One of the simplest methods, mean absolute error (MAE), computes the average absolute difference between the predicted ratings and
true ratings,
P
ri j ri j |
i j |
,
(9.71)
MAE =
n
where n is the number of predicted ratings, ri j is the predicted rating, and
ri j is the true rating. Normalized mean absolute error (NMAE) normalizes
MAE by dividing it by the range ratings can take,
NMAE =
MAE
,
rmax rmin
(9.72)
where rmax is the maximum rating items can take and rmin is the minimum.
In MAE, error linearly contributes to the MAE value. We can increase this
contribution by considering the summation of squared errors in the root
mean squared error (RMSE):
s
1X
RMSE =
(ri j ri j )2 .
(9.73)
n i, j
309
Example 9.6. Consider the following table with both the predicted ratings and
true ratings of five items:
Item
1
2
3
4
5
Predicted Rating
1
2
3
4
4
True Rating
3
5
3
2
1
1
r
(1 3)2 + (2 5)2 + (3 3)2 + (4 2)2 + (4 1)2
RMSE =
.
5
= 2.28.
MAE =
9.4.2
(9.74)
(9.75)
(9.76)
Nrs
.
Ns
(9.77)
Nrs
.
Nr
(9.78)
We can also combine both precision and recall by taking their harmonic
mean in the F-measure:
2PR
.
(9.79)
F=
P+R
Example 9.7. Consider the following recommendation relevancy matrix for a set
of 40 items. For this table, the precision, recall, and F-measure values are
Relevant
Irrelevant
Total
Selected
9
3
12
Not Selected
15
13
28
Total
24
16
40
9
= 0.75.
12
9
R =
= 0.375.
24
2 0.75 0.375
F =
= 0.5.
0.75 + 0.375
P =
9.4.3
(9.80)
(9.81)
(9.82)
Often, we predict ratings for multiple products for a user. Based on the
predicted ratings, we can rank products based on their levels of interestingness to the user and then evaluate this ranking. Given the true ranking
of interestingness of items, we can compare this ranking with it and report
a value. Rank correlation measures the correlation between the predicted
ranking and the true ranking. One such technique is the Spearmans rank
311
Kendalls Tau
yi < y j
or xi < x j ,
yi > y j .
(9.85)
cd
n .
(9.86)
Kendalls tau takes value in range [1, 1]. When the ranks completely
agree, all pairs are concordant and Kendalls tau takes value 1, and when
the ranks completely disagree, all pairs are discordant and Kendalls tau
takes value 1.
Example 9.8. Consider a set of four items I = {i1 , i2 , i3 , i4 } for which the predicted
and true rankings are as follows:
i1
i2
i3
i4
Predicted Rank
1
2
3
4
312
True Rank
1
4
2
3
:
:
:
:
:
:
concordant
concordant
concordant
discordant
discordant
concordant
(9.87)
(9.88)
(9.89)
(9.90)
(9.91)
(9.92)
42
= 0.33.
6
313
(9.93)
9.5
Summary
314
9.6
Bibliographic Notes
General references for the content provided in this chapter can be found in
[138, 237, 249, 5]. In social media, recommendation is utilized for various
items, including blogs [16], news [177, 63], videos [66], and tags [257]. For
example, YouTube video recommendation system employs co-visitation
counts to compute the similarity between videos (items). To perform
recommendations, videos with high similarity to a seed set of videos are
recommended to the user. The seed set consists of the videos that users
watched on YouTube (beyond a certain threshold), as well as videos that
are explicitly favorited, liked, rated, or added to playlists.
Among classical techniques, more on content-based recommendation
can be found in [226], and more on collaborative filtering can be found
in [268, 246, 248]. Content-based and CF methods can be combined into
hybrid methods, which are not discussed in this chapter. A survey of hybrid
methods is available in [48]. More details on extending classical techniques
to groups are provided in [137].
When making recommendations using social context, we can use additional information such as tags [116, 254] or trust [102, 222, 190, 180]. For
instance, in [272], the authors discern multiple facets of trust and apply
multifaceted trust in social recommendation. In another work, Tang et
al. [273] exploit the evolution of both rating and trust relations for social
recommendation. Users in the physical world are likely to ask for suggestions from their local friends while they also tend to seek suggestions from
users with high global reputations (e.g., reviews by vine voice reviewers
of Amazon.com). Therefore, in addition to friends, one can also use global
network information for better recommendations. In [274], the authors
exploit both local and global social relations for recommendation.
When recommending people (potential friends), we can use all these
types of information. A comparison of different people recommendation
techniques can be found in the work of Chen et al. [52]. Methods that
extend classical techniques with social context are discussed in [181, 182,
152].
315
9.7
Exercises
Newton
Einstein
Gauss
Aristotle
Euclid
God
3
5
1
3
2
Le Cercle
Rouge
0
4
2
?
2
Cidade
de Deu
3
0
4
1
0
La vita
e bella
2
3
0
2
5
Rashomon
3
2
2
0
1
ru
1.5
Newton
0.76
Einstein
?
Gauss
0.40
Euclid
0.78
P P
5. In Equation 9.54, the term ni=1 jF(i) sim(i, j)||Ui U j ||2F is added to
model the similarity between friends tastes. Let T Rnn denote
the pairwise trust matrix, in which 0 Ti j 1 denotes how much
user i trusts user j. Using your intuition on how trustworthiness
of individuals should affect recommendations received from them,
modify Equation 9.54 using trust matrix T.
317
318
Chapter
10
Behavior Analytics
What motivates individuals to join an online group? When individuals
abandon social media sites, where do they migrate to? Can we predict
box office revenues for movies from tweets posted by individuals? These
questions are a few of many whose answers require us to analyze or predict
behaviors on social media.
Individuals exhibit different behaviors in social media: as individuals
or as part of a broader collective behavior. When discussing individual behavior, our focus is on one individual. Collective behavior emerges when a
population of individuals behave in a similar way with or without coordination or planning.
In this chapter we provide examples of individual and collective behaviors and elaborate techniques used to analyze, model, and predict these
behaviors.
10.1
Individual Behavior
We read online news; comment on posts, blogs, and videos; write reviews
for products; post; like; share; tweet; rate; recommend; listen to music;
and watch videos, among many other daily behaviors that we exhibit on
social media. What are the types of individual behavior that leave a trace
on social media?
We can generally categorize individual online behavior into three categories (shown in Figure 10.1):
319
10.1.1
Diminishing
Returns
Causality
Testing
Granger
Causality
t
X
i=1
t
X
ai Yi + 1 ,
ai Yi +
i=1
t
X
(10.1)
bi Xi + 2 ,
(10.2)
i=1
10.1.2
Similar to network models, models of individual behavior can help concretely describe why specific individual behaviors are observed in social
media. In addition, they allow for controlled experiments and simulations
that can help study individuals in social media.
As with other modeling approaches (see Chapter 4), in behavior modeling, one must make a set of assumptions. Behavior modeling can be
performed via a variety of techniques, including those from economics,
game theory, or network science. We discussed some of these techniques
in earlier chapters. We review them briefly here, and refer interested readers to the respective chapters for more details.
Threshold models (Chapter 8). When a behavior diffuses in a network, such as the behavior of individuals buying a product and
referring it to others, one can use threshold models. In threshold
models, the parameters that need to be learned are the node activation threshold i and the influence probabilities wi j . Consider
the following methodology for learning these values. Consider a
merchandise store where the store knows the connections between
individuals and their transaction history (e.g., the items that they
have bought). Then, wi j can be defined as the
fraction of times user i buys a product and
user j buys the same product soon after that
326
10.1.3
like new edges, new nodes can be introduced in social networks; therefore,
G[t2 , t02 ] may contain nodes not present in G[t1 , t01 ]. Hence, a link prediction
algorithm is generally constrained to predict edges only for pairs of nodes
that are present during the training period. One can add extra constraints
such as predicting links only for nodes that are incident to at least k edges
(i.e., have degree greater or equal to k) during both testing and training
intervals.
Let G(Vtrain , Etrain ) be our training graph. Then, a link prediction algorithm generates a sorted list of most probable edges in Vtrain Vtrain Etrain .
The first edge in this list is the one the algorithm considers the most likely
to soon appear in the graph. The link prediction algorithm assigns a score
(x, y) to every edge e(x, y) in Vtrain Vtrain Etrain . Edges sorted by this value
in decreasing order will create our ranked list of predictions. (x, y) can be
predicted based on different techniques. Note that any similarity measure
between two nodes can be used for link prediction; therefore, methods
discussed in Chapter 3 are of practical use here. We outline some of the
most well-established techniques for computing (x, y) here.
Node Neighborhood-Based Methods
The following methods take advantage of neighborhood information to
compute the similarity between two nodes.
Common Neighbors. In this method, one assumes that the more
common neighbors that two nodes share, the more similar they are.
Let N(x) denote the set of neighbors of node x. This method is
formulated as
(x, y) = |N(x) N(y)|.
(10.3)
Jaccard Similarity. This commonly used measure calculates the likelihood of a node that is a neighbor of either x or y to be a common
neighbor. It can be formulated as the number of common neighbors
divided by the total number of neighbors of either x or y:
(x, y) =
|N(x) N(y)|
.
|N(x) N(y)|
(10.4)
Adamic and Adar Measure. A similar measure to Jaccard, this measure was introduced by Lada Adamic and Eytan Adar [2003]. The
328
(10.6)
Example 10.1. For the graph depicted in Figure 10.5, the similarity between
nodes 5 and 7 based on different neighborhood-based techniques is
(Common Neighbor) (5, 7) = |{4, 6} {4}| = 1
|{4, 6} {4}| 1
(Jaccard) (5, 7) =
=
|{4, 6} {4}| 2
1
1
=
(Adamic and Adar) (5, 7) =
log |{5, 6, 7}| log 3
(Preferential Attachment) (5, 7) = |{4}| |{4, 6}| = 1 2 = 2
329
(10.7)
(10.8)
(10.9)
(10.10)
l |paths<l>
x,y |,
(10.11)
l=1
where |paths<l>
x,y | denotes the number of paths of length l between x and
y. is a constant that exponentially damps longer paths. Note that a
very small results in a common neighbor measure (see Exercises).
Similar to our finding in Chapter 3, one can find the Katz similarity
measure in a closed form by (I A)1 I. The Katz measure can also
330
(10.13)
Hitting time is not symmetric, and in general, Hx,y , H y,x . Thus, one
can introduce the commute time to mitigate this issue:
(x, y) = (Hx,y + H y,x ).
(10.14)
(10.15)
10.2
Collective Behavior
10.2.1
The result, however, when all these analyses are put together would be an
expected behavior for a large population. The user migration behavior we
discuss in this section is an example of this type of analysis of collective
behavior.
One can also analyze the population as a whole. In this case, an individuals opinion or behavior is rarely important. In general, the approach
is the same as analyzing an individual, with the difference that the content
and links are now considered for a large community. For instance, if we are
analyzing 1,000 nodes, one can combine these nodes and edges into one
hyper-node, where the hyper-node is connected to all other nodes in the
graph to which its members are connected and has an internal structure
(subgraph) that details the interaction among its members. This approach
is unpopular for analyzing collective behavior because it does not consider
specific individuals and at times, interactions within the population. Interested readers can refer to the bibliographic notes for further references
that use this approach to analyze collective behavior. On the contrary, this
approach is often considered when predicting collective behavior, which
is discussed later in this chapter.
User Migration in Social Media
Users often migrate from one site to another for different reasons. The main
rationale behind it is that users have to select some sites over others due
to their limited time and resources. Moreover, social medias networking
often dictates that one cannot freely choose a site to join or stay. An
individuals decision is heavily influenced by his or her friends, and vice
versa. Sites are often interested in keeping their users, because they are
valuable assets that help contribute to their growth and generate revenue
by increased traffic. There are two types of migration that take place in
social media sites: site migration and attention migration.
1. Site Migration. For any user who is a member of two sites s1 and s2
at time ti , and is only a member of s2 at time t j > ti , then the user is
said to have migrated from site s1 to site s2 .
2. Attention Migration. For any user who is a member of two sites s1
and s2 and is active at both at time ti , if the user becomes inactive on
s1 and remains active on s2 at time t j > ti , then the users attention is
said to have migrated away from site s1 and toward site s2 .
333
user has uploaded. One can normalize this value by its maximum in the
site (e.g., the maximum number of videos any user has uploaded) to get
an activity measure in the range [0,1]. If a user is allowed to have multiple
activities on a site, as in posting comments and liking videos, then a linear
com- bination of these measures can be used to describe user activity on a
site.
User network size can be easily measured by taking the number of
friends a user has on the site. It is common for social media sites to
facilitate the addition of friends. The number of friends can be normalized
in the range [0,1] by the maximum number of friends one can have on the
site.
Finally, user rank is how important a user is on the site. Some sites
explicitly provide their users prestige rank list (e.g., top 100 bloggers),
whereas for others, one needs to approximate a users rank. One way
of approximating it is to count the number of citations (in-links) an individual is receiving from others. A practical technique is to perform this
via web search engines. For instance, user test on StumbleUpon has
http://test.stumpleupon.com as his profile page. A Google search for
link:http://test.stumbleupon.com provides us with the number of inlinks to the profile on StumbleUpon and can be considered as a ranking
measure for user test.
These three features are correlated with the site attention migration
behavior and one expects changes in them when migrations happen.
Feature-Behavior Association
Given two snapshots of a network, we know if users migrated or not. We
can also compute the values for the aforementioned features. Hence, we
can determine the correlation between features and migration behavior.
Let vector Y Rn indicate whether any of our n users have migrated or
not. Let Xt R3n be the features collected (activity, friends, rank) for any
one of these users at time stamp t. Then, the correlation between features
Xt and labels Y can be computed via logistic regression. How can we verify
that this correlation is not random? Next, we discuss how we verify that
this correlation is statistically significant.
336
Evaluation Strategy
To verify if the correlation between features and the migration behavior is
not random, we can construct a random set of migrating users and compute
XRandom and YRandom for them as well. This can be obtained by shuffling
the rows of the original Xt and Y. Then, we perform logistic regression
on these new variables. This approach is very similar to the shuffle test
presented in Chapter 8. The idea is that if some behavior creates a change
in features, then other random behaviors should not create that drastic a
change. So, the observed correlation between features and the behavior
should be significantly different in both cases. The correlation can be
described in terms of logistic regression coefficients, and the significance
can be measured via any significance testing methodology. For instance,
we can employ the 2 -statistic,
2 -statistic
n
X
(Ai Ri )2
,
=
R
i
i=1
2
(10.17)
where n is the number of logistic regression coefficients, Ai s are the coefficients determined using the original dataset, and Ri s are the coefficients
obtained from the random dataset.
10.2.2
Consider a hypothetical model that can simulate voters who cast ballots in
elections. This effective model can help predict an elections turnout rate
as an outcome of the collective behavior of voting and help governments
prepare logistics accordingly. This is an example of collective behavior
modeling, which improves our understanding of the collective behaviors
that take place by providing concrete explanations.
Collective behavior can be conveniently modeled using some of the
techniques discussed in Chapter 4, Network Models. Similar to collective behavior, in network models, we express models in terms of characteristics observable in the population. For instance, when a power-law degree
distribution is required, the preferential attachment model is preferred, and
when the small average shortest path is desired, the small-world model is
the method of choice. In network models, node properties rarely play a
role; therefore, they are reasonable for modeling collective behavior.
337
10.2.3
Collective behavior can be predicted using methods we discussed in Chapters 7 and 8. For instance, epidemics can predict the effect of a disease on
a population and the behavior that the population will exhibit over time.
Similarly, implicit influence models such as the LIM model discussed in
Chapter 8 can estimate the influence of individuals based on collective behavior attributes, such as the size of the population adopting an innovation
at any time.
As noted earlier, collective behavior can be analyzed either in terms of
individuals performing the collective behavior or based on the population
as a whole. When predicting collective behavior, it is more common to
consider the population as a whole and aim to predict some phenomenon.
This simplifies the challenges and reduces the computation dramatically,
since the number of individuals who perform a collective behavior is often
large and analyzing them one at a time is cumbersome.
In general, when predicting collective behavior, we are interested in
predicting the intensity of a phenomenon, which is due to the collective
behavior of the population (e.g., how many of them will vote?) To perform this prediction, we utilize a data mining approach where features
that describe the population well are used to predict a response variable
(i.e., the intensity of the phenomenon). A training-testing framework or
correlation analysis is used to determine the generalization and the accuracy of the predictions. We discuss this collective behavior prediction
strategy through the following example. This example demonstrates how
the collective behavior of individuals on social media can be utilized to
predict real-world outcomes.
Predicting Box Office Revenue for Movies
Can we predict opening-weekend revenue for a movie from its prerelease
chatter among fans? This tempting goal of predicting the future has been
around for many years. The goal is to predict the collective behavior
of watching a movie by a large population, which in turn determines the
revenue for the movie. One can design a methodology to predict box office
revenue for movies that uses Twitter and the aforementioned collective
behavior prediction strategy. To summarize, the strategy is as follows:
1. Set the target variable that is being predicted. In this case, it is the
338
revenue that a movie produces. Note that the revenue is the direct
result of the collective behavior of going to the theater to watch the
movie.
2. Determine the features in the population that may affect the target
variable.
3. Predict the target variable using a supervised learning approach,
utilizing the features determined in step 2.
4. Measure performance using supervised learning evaluation.
One can use the population that is discussing the movie on Twitter
before its release to predict its opening-weekend revenue. The target variable is the amount of revenue. In fact, utilizing only eight features, one
can predict the revenue with high accuracy. These features are the average
hourly number of tweets related to the movie for each of the seven days
prior to the movie opening (seven features) and the number of opening
theaters for the movie (one feature). Using only these eight features, training data for some movies (their seven-day tweet rates, their number of
opening theaters, and their revenue), and a linear regression model, one
can predict the movie opening-weekend revenue with high correlation. It
has been shown by researchers (see Bibliographic Notes) that the predictions using this approach are closer to reality than that of the Hollywood
Stock Exchange (HSX), which is the gold standard for predicting revenues
for movies.
This simple model for predicting movie revenue can be easily extended
to other domains. For instance, assume we are planning to predict another
collective behavior outcome, such as the number of individuals who aim
to buy a product. In this case, the target variable y is the number of
individuals who will buy the product. Similar to tweet rate, we require
some feature A that denotes the attention the product is receiving. We also
need to model the publicity of the product P. In our example, this was
the number of theaters for the movie; for a product, it could represent the
number of stores that sell it. A simple linear regression model can help
learn the relation between these features and the target variable:
y = w1 A + w2 P + ,
(10.18)
where is the regression error. Similar to our movie example, one attempts
to extract the values for A and P from social media.
339
10.3
Summary
Individuals exhibit different behaviors in social media, which can be categorized into individual and collective behavior. Individual behavior
is the behavior that an individual targets toward (1) another individual
(individual-individual behavior), (2) an entity (individual-entity behavior), or
(3) a community (individual-community behavior). We discussed how to analyze and predict individual behavior. To analyze individual behavior,
there is a four-step procedure, outlined as a guideline. First, the behavior observed should be clearly observable on social media. Second, one
needs to design meaningful features that are correlated with the behavior
taking place in social media. The third step aims to find correlations and
relationships between features and the behavior. The final step is to verify
these relationships that are found. We discussed community joining as
an example of individual behavior. Modeling individual behavior can be
performed via cascade or threshold models. Behaviors commonly result in
interactions in the form of links; therefore, link prediction techniques are
highly efficient in predicting behavior. We discussed neighborhood-based
and path-based techniques for link prediction.
Collective behavior is when a group of individuals with or without
coordination act in an aligned manner. Collective behavior analysis is either done via individual behavior analysis and then averaged or analyzed
collectively. When analyzed collectively, one commonly looks at the general patterns of the population. We discussed user migrations in social
media as an example of collective behavior analysis. Modeling collective
behavior can be performed via network models, and prediction is possible
by using population properties to predict an outcome. Predicting movie
box-office revenues was given as an example, which uses population properties such as the rate at which individuals are tweeting to demonstrate
the effectiveness of this approach.
It is important to evaluate behavior analytics findings to ensure that
these finding are not due to externalities. We discussed causality testing, randomization tests, and supervised learning evaluation techniques
for evaluating behavior analytics findings. However, depending on the
context, researchers may need to devise other informative techniques to
ensure the validity of the outcomes.
340
10.4
Bibliographic Notes
polls. Their results are highly correlated with Gallup opinion polls for
presidential job approval. In [1], the authors analyzed collective social
media data and show that by carefully selecting data from social media,
it is possible to use social media as a lens to analyze and even predict
real-world events.
342
10.5
Exercises
Individual Behavior
1.
2. Consider the commenting under a blogpost behavior in social media. Follow the four steps of behavior analysis to analyze this behavior.
3. We emphasized selecting meaningful features for analyzing a behavior. Discuss a methodology to verify if the selected features carry
enough information with respect to the behavior being analyzed.
4. Correlation does not imply causality. Discuss how this fact relates to
most of the datasets discussed in this chapter being temporal.
5. Using a neighborhood-based link prediction method compute the
top two most likely edges for the following figure.
6. Compute the most likely edge for the following figure for each pathbased link prediction technique.
343
7. In a link prediction problem, show that for small , the Katz similarity measure ((u, v) =
` |path<`>
u,v |) becomes Common neighbors
`=1
((u, v) = |N(u) N(v)|).
8. Provide the matrix format for rooted PageRank and SimRank techniques.
Collective Behavior
9. Recent research has shown that social media can help replicate survey
results for elections and ultimately predict presidential election outcomes. Discuss what possible features can help predict a presidential
election.
344
Bibliography
[1] Mohammad Ali Abbasi, Sun-Ki Chai, Huan Liu, and Kiran. Sagoo,
Real-world behavior analysis through a social media lens., Social Computing, Behavioral-Cultural Modeling and Prediction, Springer, 2012,
pp. 1826.
[2] J. Abello, M. Resende, and S. Sudarsky, Massive quasi-clique detection.,
LATIN 2002: Theoretical Informatics (2002), 598612.
[3] E. Abrahamson and L. Rosenkopf, Institutional and competitive bandwagons: Using mathematical modeling as a tool to explore innovation
diffusion., Academy of management review (1993), 487517.
[4] L.A. Adamic and E. Adar, Friends and neighbors on the web., Social
Networks 25 (2003), no. 3, 211230.
[5] Gediminas Adomavicius and Alexander. Tuzhilin, Toward the next
generation of recommender systems: a survey of the state-of-the-art and
possible extensions., IEEE Transactions on Knowledge and Data Engineering 17 (2005), no. 6, 734749.
[6] N. Agarwal, H. Liu, L. Tang, and P.S. Yu, Identifying the influential
bloggers in a community, Proceedings of the International Conference
on Web Search and Web Data Mining, ACM, 2008, pp. 207218.
[7] R.K. Ahuja, T.L. Magnanti, J.B. Orlin, and K. Weihe, Network flows:
theory, algorithms and applications., ZOR-Methods and Models of Operations Research 41 (1995), no. 3, 252254.
345
, Information cascades in the laboratory., The American Economic Review (1997), 847862.
[18] Sitaram Asur and Bernardo A. Huberman, Predicting the future with
social media. ieee international conference on, Web Intelligence and Intelligent Agent Technology, vol. 1, IEEE, 2010, pp. 492499.
[19] L. Backstrom, D. Huttenlocher, J.M. Kleinberg, and X. Lan, Group
formation in large social networks: membership, growth, and evolution,
Proceedings of the 12th ACM SIGKDD international conference on
Knowledge Discovery and Data Mining, ACM, 2006, pp. 4454.
[20] L. Backstrom, E. Sun, and C. Marlow, Find me if you can: improving
geographical prediction with social and spatial proximity., Proceedings of
the 19th International conference on World Wide Web, ACM, 2010,
pp. 6170.
[21] N.T.J. Bailey, The mathematical theory of infectious diseases and its applications., Charles Griffin & Company Ltd, 1975.
[22] E. Bakshy, J.M. Hofman, W.A. Mason, and D.J. Watts, Everyones an
influencer: quantifying influence on twitter, Proceedings of the fourth
ACM international conference on Web Search and Data Mining,
ACM, 2011, pp. 6574.
[23] A.V. Banerjee, A simple model of herd behavior., The Quarterly Journal
of Economics 107 (1992), no. 3, 797817.
[24] A.L. Barabasi and R. Albert, Emergence of scaling in random networks.,
science 286 (1999), no. 5439, 509512.
[25] Geoffrey Barbier, Zhuo Feng, Pritam Gundecha, and Huan. Liu, Morgan & Claypool Publishers, 2013.
[26] Geoffrey Barbier, Reza Zafarani, Huiji Gao, Gabriel Fung, and Huan.
Liu, Maximizing benefits from crowdsourced data., Computational and
Mathematical Organization Theory 18 (2012), no. 3, 257279.
[27] S.J. Barnes and E. Scornavacca, Mobile marketing: the role of permission
and acceptance., International Journal of Mobile Communications 2
(2004), no. 2, 128139.
[28] Alain Barrat, Marc Barthelemy, and Alessandro. Vespignani, Dynamical processes on complex networks., vol. 1, Cambridge University
Press, 2008.
347
[40]
2006.
[43] J.A. Bondy and U.S.R. Murty, Graph theory with applications., vol. 290,
MacMillan London, 1976.
[44] Stephen Poythress Boyd and Lieven. Vandenberghe, Convex optimization., Cambridge University Press, 2004.
[45] Ulrik. Brandes, A faster algorithm for betweenness centrality., Journal of
Mathematical Sociology 25 (2001), no. 2, 163177.
[46] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan,
Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet.
Wiener, Graph structure in the web., Computer networks 33 (2000),
no. 1, 309320.
[47] Alan. Bryman, Social research methods., Oxford University Press, 2012.
[48] Robin. Burke, Hybrid recommender systems: Survey and experiments.,
User Modeling and User-Adapted Interaction 12 (2002), no. 4, 331
370.
[49] K. Selcuk Candan and Maria Luisa. Sapino, Data management for
multimedia retrieval., Cambridge University Press, 2010.
[50] M. Cha, H. Haddadi, F. Benevenuto, and K.P. Gummadi, Measuring
user influence in twitter: The million follower fallacy, AAAI Conference
on Weblogs and Social Media, vol. 14, 2010, p. 8.
[51] Soumen. Chakrabarti, Mining the Web: discovering knowledge from
hypertext data., Morgan Kaufmann, 2003.
[52] Jilin Chen, Werner Geyer, Casey Dugan, Michael Muller, and Ido.
Guy, Make new friends, but keep the old: recommending people on social
networking sites., Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems, ACM, 2009, pp. 201210.
[53] S. Chinese and et al., Molecular evolution of the sars coronavirus during
the course of the sars epidemic in china., Science 303 (2004), no. 5664,
1666.
[54] N.A. Christakis and J.H. Fowler, The spread of obesity in a large social
network over 32 years., New England Journal of Medicine 357 (2007),
no. 4, 370379.
349
[55] Nicholas A. Christakis and James H. Fowler, Connected: The surprising power of our social networks and how they shape our lives., Norsk
epidemiologi= Norwegian Journal of Epidemiology 19 (2009), no. 1,
5.
[56] F.R.K. Chung, Spectral graph theory., no. 92, American Mathematical
Society, 1997.
[57] R. B. Cialdini and M. R. Trost., Social influence: Social norms, conformity
and compliance., (1998), 151.
[58] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ. Newman, Powerlaw distributions in empirical data., SIAM Review 51 (2009), no. 4, 661
703.
[59] J.S. Coleman, E. Katz, and H. Menzel, Medical innovation: a diffusion
study., Bobbs-Merrill Company, 1966.
[60] R. Cont and J.P. Bouchaud, Herd behavior and aggregate fluctuations in
financial markets., Macroeconomic Dynamics 4 (2000), no. 02, 170196.
[61] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and
Clifford. Stein, Introduction to algorithms., 2009.
[62] S. Currarini, M.O. Jackson, and P. Pin, An economic model of friendship:
Homophily, minorities, and segregation., Econometrica 77 (2009), no. 4,
10031045.
[63] Abhinandan S. Das, Mayur Datar, Ashutosh Garg, and Shyam. Rajaram, Google news personalization: scalable online collaborative filtering.,
Proceedings of the 16th international conference on World Wide Web,
ACM, 2007, pp. 271280.
[64] Manoranjan Dash and Huan. Liu, Feature selection for classification.,
Intelligent Data Analysis 1 (1997), no. 3, 131156.
[65]
[66] James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake
Livingston, and et al., The youtube video recommendation system., Proceedings of the fourth ACM conference on Recommender Systems,
ACM, 2010, pp. 293296.
[67] D.L. Davies and D.W. Bouldin, A cluster separation measure., IEEE
Transactions on Pattern Analysis and Machine Intelligence (1979),
no. 2, 224227.
[68] D.C. Des Jarlais, S.R. Friedman, J.L. Sotheran, J. Wenston, M. Yancovitz Marmor, Frank S.R., Beatrice B., and Mildvan D. S., Continuity
and change within an hiv epidemic., JAMA: the journal of the American
Medical Association 271 (1994), no. 2, 121127.
[69] A. Devenow and I. Welch, Rational herding in financial economics.,
European Economic Review 40 (1996), no. 3, 603615.
[70] H. Dia, An object-oriented neural network approach to short-term traffic
forecasting., European Journal of Operational Research 131 (2001),
no. 2, 253261.
[71] R. Diestel, Graph theory. 2005., Graduate Texts in Math (2005).
[72] K. Dietz, Epidemics and rumours: a survey., Journal of the Royal Statistical Society. Series A (General) (1967), 505528.
[73] P.S. Dodds and D.J. Watts, Universal behavior in a generalized model of
contagion., Physical Review Letters 92 (2004), no. 21, 218701.
[74] M. Drehmann, J. Oechssler, and A. Roider, Herding and contrarian
behavior in financial markets an internet experiment., (2005).
[75] Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern classification., Wiley-interscience, 2012.
[76] J.C. Dunn, Well-separated clusters and optimal fuzzy partitions., Journal
of cybernetics 4 (1974), no. 1, 95104.
[77] C. Dye and N. Gay, Modeling the sars epidemic., Science 300 (2003),
no. 5627, 18841885.
351
[78] D. Easley and J.M. Kleinberg, Networks, crowds, and markets., Cambridge University Press, 2010.
[79] R.C. Eberhart, Y. Shi, and J. Kennedy, Swarm intelligence., Elsevier,
2001.
[80] J. Edmonds and R.M. Karp, Theoretical improvements in algorithmic
efficiency for network flow problems., Journal of the ACM (JACM) 19
(1972), no. 2, 248264.
[81] Nicole B. Ellison and et al., Social network sites: definition, history,
and scholarship., Journal of Computer-Mediated Communication 13
(2007), no. 1, 210230.
[82] A.P. Engelbrecht, Fundamentals of computational swarm intelligence.,
Recherche 67 (2005), 02.
[83] P. Erdos and A. Renyi, On random graphs., Publicationes Mathematicae Debrecen 6 (1959), 290297.
[84]
[85]
, On the strength of connectedness of a random graph., Acta Mathematica Hungarica 12 (1961), no. 1, 261267.
[97] Huiji Gao, Jiliang Tang, and Huan. Liu, gscorr: modeling geo-social
correlations for new check-ins on location-based social networks., Proceedings of the 21st ACM international conference on Information and
Knowledge Management, ACM, 2012, pp. 15821586.
[98] Huiji Gao, Xufei Wang, Geoffrey Barbier, and Huan. Liu, Promoting coordination for disaster relief from crowdsourcing to coordination.,
Social Computing, Behavioral-Cultural Modeling and Prediction,
Springer, 2011, pp. 197204.
[99] D. Gibson, R. Kumar, and A. Tomkins, Discovering large dense subgraphs in massive graphs, Discovering large dense subgraphs in massive graphs., VLDB Endowment, 2005, pp. 721732.
[100] E.N. Gilbert, Random graphs., The Annals of Mathematical Statistics
30 (1959), no. 4, 11411144.
[101] M. Girvan and M.E.J. Newman, Community structure in social and
biological networks., Proceedings of the National Academy of Sciences
99 (2002), no. 12, 7821.
353
[102] Jennifer Golbeck and James. Hendler, Filmtrust: movie recommendations using trust in web-based social networks., Proceedings of the IEEE
Consumer Communications and Networking Conference, vol. 96,
Citeseer, 2006.
[103] A.V. Goldberg and R.E. Tarjan, A new approach to the maximum-flow
problem., Journal of the ACM (JACM) 35 (1988), no. 4, 921940.
[104] B. Golub and M.O. Jackson, Naive learning in social networks and the
wisdom of crowds., American Economic Journal: Microeconomics 2
(2010), no. 1, 112149.
[105] M.F. Goodchild and J.A. Glennon, Crowdsourcing geographic information for disaster response: a research frontier., International Journal of
Digital Earth 3 (2010), no. 3, 231241.
[106] L. Goodman and W. Kruskal, Measures of associations for crossvalidations., Journal of the American Statistical Association 49, 732
764.
[107] A. Goyal, F. Bonchi, and L.V.S. Lakshmanan, Learning influence probabilities in social networks, Proceedings of the Third ACM international
conference on Web Search and Data Mining, ACM, 2010, pp. 241250.
[108] M. Granovetter, Threshold models of collective behavior., American Journal of Sociology (1978), 14201443.
[109] M.S. Granovetter, The strength of weak ties., American Journal of Sociology (1973), 13601380.
[110] V. Gray, Innovation in the states: a diffusion study., The American
Political Science Review 67 (1973), no. 4, 11741185.
[111] Z. Griliches, Hybrid corn: an exploration in the economics of technological
change., Econometrica, Journal of the Econometric Society (1957),
501522.
[112] D. Gruhl, R. Guha, D. Liben-Nowell, and A. Tomkins, Information
diffusion through blogspace, Proceedings of the 13th international conference on the World Wide Web, ACM, 2004, pp. 491501.
354
[113] Y. Guan, H. Chen, K.S. Li, S. Riley, G.M. Leung, R. Webster, J.S.M.
Peiris, and K.Y. Yuen, A model to control the epidemic of h5n1 influenza
at the source., BMC Infectious Diseases 7 (2007), no. 1, 132.
[114] Pritam Gundecha, Geoffrey Barbier, and Huan. Liu, Exploiting Vulnerability to Secure User Privacy on a Social Networking Site., Proceedings of the 17th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, KDD, 2011, pp. 511519.
[115] Pritam Gundecha and Huan. Liu, Mining social media: a brief introduction., Tutorials in Operations Research 1 (2012), no. 4.
[116] Ido Guy, Naama Zwerdling, Inbal Ronen, David Carmel, and Erel.
Uziel, Social media recommendation based on people and tags., Proceedings of the 33rd international ACM SIGIR conference on Research
and Development in Information Retrieval, ACM, 2010, pp. 194201.
[117] Isabelle. Guyon, Feature extraction: foundations and applications., vol.
207, Springer, 2006.
[118] T. Hagerstrand and et al., Innovation diffusion as a spatial process.,
Innovation diffusion as a spatial process. (1968).
[119] R.L. Hamblin, R.B. Jacobsen, and J.L.L. Miller, A mathematical theory
of social change., Wiley, 1973.
[120] Jiawei Han, Micheline Kamber, and Jian. Pei, Data mining: concepts
and techniques., Morgan Kaufmann, 2006.
[121] M.S. Handcock, A.E. Raftery, and J.M. Tantrum, Model-based clustering for social networks., Journal of the Royal Statistical Society: Series
A (Statistics in Society) 170 (2007), no. 2, 301354.
[122] P.E. Hart, N.J. Nilsson, and B. Raphael, A formal basis for the heuristic
determination of minimum cost paths., Systems Science and Cybernetics, IEEE Transactions on 4 (1968), no. 2, 100107.
[123] Simon. Haykin, Neural networks: a comprehensive foundation., Prentice
Hall, 1994.
[124] H.W. Hethcote, A thousand and one epidemic models., Lecture Notes in
Biomathematics (1994), 504504.
355
[125]
[126] H.W. Hethcote, H.W. Stech, and P. van den Driessche, Periodicity
and stability in epidemic models: a survey., Differential Equations and
Applications in Ecology, Epidemics and Population Problems (SN
Busenberg and KL Cooke, eds.) (1981), 6582.
[127] E.C. Hirschman, Innovativeness, novelty seeking, and consumer creativity., Journal of Consumer Research (1980), 283295.
[128] D. Hirshleifer, Informational cascades and social conventions., University
of Michigan Business School Working Paper No. 9705-10 (1997).
[129] P.D. Hoff, A.E. Raftery, and M.S. Handcock, Latent space approaches to
social network analysis., Journal of the American Statistical Association
97 (2002), no. 460, 10901098.
[130] J. Hopcroft and R. Tarjan, Algorithm 447: efficient algorithms for graph
manipulation., Communications of the ACM 16 (1973), no. 6, 372378.
[131] Xia Hu, Jiliang Tang, Huiji Gao, and Huan. Liu, Unsupervised sentiment analysis with emotional signals., Proceedings of the 22nd international conference on World Wide Web, WWW13, ACM, 2013.
[132] Xia Hu, Lei Tang, Jiliang Tang, and Huan. Liu, Exploiting social relations for sentiment analysis in microblogging., Proceedings of the sixth
ACM international conference on Web Search and Data Mining, 2013.
[133] P. Jaccard, Distribution de la Flore Alpine: dans le Bassin des dranses et
dans quelques regions voisines., Rouge, 1901.
[134] M.O. Jackson, Social and economic networks., Princeton University
Press, 2008.
[135] A.K. Jain and R.C. Dubes, Algorithms for clustering data., PrenticeHall, 1988.
[136] A.K. Jain, M.N. Murty, and P.J. Flynn, Data clustering: a review., ACM
Computing Surveys (CSUR) 31 (1999), no. 3, 264323.
356
[160] Shamanth Kumar, Fred Morstatter, Reza Zafarani, and Huan. Liu,
Whom should i follow? identifying relevant users during crises., Proceedings of the 24th ACM Conference on Hypertext and Social Media,
2013.
[161] T. La Fond and J. Neville, Randomization tests for distinguishing social
influence and homophily effects, Proceedings of the 19th international
conference on the World Wide Web, ACM, 2010, pp. 601610.
[162] A. Lancichinetti and S. Fortunato, Community detection algorithms: a
comparative analysis., Physical Review E 80 (2009), no. 5, 056117.
[163] P. Langley, Elements of machine learning., Morgan Kaufmann, 1996.
[164] Charles L. Lawson and Richard J. Hanson, Solving least squares problems, vol. 15, SIAM, 1995.
[165] H. Leibenstein, Bandwagon, snob, and veblen effects in the theory of
consumers demand., The Quarterly Journal of Economics 64 (1950),
no. 2, 183207.
[166] E.A. Leicht, P. Holme, and M.E.J. Newman, Vertex similarity in networks., Physical Review E 73 (2006), no. 2, 026120.
[167] J. Leskovec, J.M. Kleinberg, and C. Faloutsos, Graphs over time: densification laws, shrinking diameters and possible explanations, Proceedings
of the 11th ACM SIGKDD international conference on Knowledge
Discovery in Data Mining, ACM, 2005, pp. 177187.
[168] J. Leskovec, K.J. Lang, and M. Mahoney, Empirical comparison of algorithms for network community detection, Proceedings of the 19th international conference on the World Wide Web, ACM, 2010, pp. 631640.
[169] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst,
Cascading behavior in large blog graphs., Arxiv preprint arXiv:0704.2803
(2007).
[170] Jure Leskovec, Lars Backstrom, and Jon. Kleinberg, Meme-tracking
and the dynamics of the news cycle., Proceedings of the 15th ACM
SIGKDD international conference on Knowledge Discovery and
Data Mining, ACM, 2009, pp. 497506.
359
[171] T.G. Lewis, Network Science: theory and Applications., Wiley Publishing, 2009.
[172] D. Liben-Nowell and J.M. Kleinberg, The link-prediction problem for social networks., Journal of the American society for information science
and technology 58 (2007), no. 7, 10191031.
[173] Katri Lietsala and Esa. Sirkkunen, Social media. introduction to the tools
and processes of participatory economy, tampere university., (2008).
[174] B. Liu, Web data mining: exploring hyperlinks, contents, and usage data.,
Springer Verlag, 2007.
[175] Huan Liu and Hiroshi. Motoda, Feature extraction, construction and
selection: a data mining perspective., Springer, 1998.
[176] Huan Liu and Lei. Yu, Toward integrating feature selection algorithms
for classification and clustering., IEEE Transactions on Knowledge and
Data Engineering 17 (2005), no. 4, 491502.
[177] Jiahui Liu, Peter Dolan, and Elin Rnby. Pedersen, Personalized news
recommendation based on click behavior., Proceedings of the 15th international conference on Intelligent User Interfaces, ACM, 2010,
pp. 3140.
[178] F. Lorrain and H.C. White, Structural equivalence of individuals in social
networks., Journal of Mathematical Sociology 1 (1971), no. 1, 4980.
[179] Linyuan Lu and Tao. Zhou, Link prediction in complex networks: a
survey., Physica A: Statistical Mechanics and its Applications 390
(2011), no. 6, 11501170.
[180] Hao Ma, Michael R. Lyu, and Irwin. King, Learning to recommend
with trust and distrust relationships., Proceedings of the third ACM
conference on Recommender Systems, ACM, 2009, pp. 189196.
[181] Hao Ma, Haixuan Yang, Michael R. Lyu, and Irwin. King, Sorec:
social recommendation using probabilistic matrix factorization., Proceedings of the 17th ACM conference on Information and Knowledge
Management, ACM, 2008, pp. 931940.
360
[182] Hao Ma, Dengyong Zhou, Chao Liu, Michael R. Lyu, and Irwin.
King, Recommender systems with social regularization., Proceedings of
the fourth ACM international conference on Web Search and Data
Mining, ACM, 2011, pp. 287296.
[183] M.W. Macy, Chains of cooperation: threshold effects in collective action.,
American Sociological Review (1991), 730747.
[184] M.W. Macy and R. Willer, From factors to actors: computational sociology
and agent-based modeling., Annual Review of Sociology (2002), 143
166.
[185] V. Mahajan, Models for innovation diffusion., no. 48, Sage Publications,
1985.
[186] V. Mahajan and E. Muller, Innovative behavior and repeat purchase diffusion models, Proceedings of the American Marketing Educators Conference, vol. 456, 1982, p. 460.
[187] V. Mahajan and R.A. Peterson, Innovation diffusion in a dynamic potential adopter population., Management Science (1978), 15891597.
[188] E. Mansfield, Technical change and the rate of imitation., Econometrica:
Journal of the Econometric Society (1961), 741766.
[189] J.P. Martino, Technological forecasting for decision making., McGrawHill, 1993.
[190] Paolo Massa and Paolo. Avesani, Trust-aware collaborative filtering for
recommender systems., On the Move to Meaningful Internet Systems
2004: CoopIS, DOA, and ODBASE, Springer, 2004, pp. 492508.
[191] B.D. McKay, Practical graph isomorphism, Practical Graph Isomorphism., vol. 1, Utilitas Mathematica, 1981, p. 45.
[192] M. McPherson, L. Smith-Lovin, and J.M. Cook, Birds of a feather:
homophily in social networks., Annual Review of Sociology (2001),
415444.
[193] D.F. Midgley and G.R. Dowling, Innovativeness: the concept and its
measurement., Journal of Consumer Research (1978), 229242.
361
[206] M.I. Nelson and E.C. Holmes, The evolution of epidemic influenza.,
Nature reviews genetics 8 (2007), no. 3, 196205.
[207] George L. Nemhauser and Laurence A. Wolsey, Integer and combinatorial optimization., vol. 18, Wiley New York, 1988.
[208] John Neter, William Wasserman, Michael H. Kutner, and et al., Applied linear statistical models., vol. 4, Irwin Chicago, 1996.
[209] M.E.J. Newman, Mixing patterns in networks., Physical Review E 67
(2003), no. 2, 026126.
[210]
[211]
[212]
[213] M.E.J. Newman, A.L. Barabasi, and D.J. Watts, The structure and
dynamics of networks., Princeton University Press, 2006.
[214] M.E.J. Newman, S. Forrest, and J. Balthrop, Email networks and the
spread of computer viruses., Physical Review E 66 (2002), no. 3, 035101.
[215] M.E.J. Newman and M. Girvan, Mixing patterns and community structure in networks., Statistical Mechanics of Complex Networks (2003),
6687.
[216] M.E.J. Newman, S.H. Strogatz, and D.J. Watts, Random graphs with
arbitrary degree distributions and their applications., Physical Review E
64, 026118.
[217] M.E.J. Newman, D.J. Watts, and S.H. Strogatz, Random graph models
of social networks., Proceedings of the National Academy of Sciences
of the United States of America 99 (2002), no. Suppl 1, 2566.
[218] R.T. Ng and J. Han, Efficient and Effective Clustering Methods for Spatial
Data Mining., Proceedings of the 20th International Conference on
Very Large Data Bases (1994), 144155.
363
[219] Jorge Nocedal and S. Wright, Numerical optimization, series in operations research and financial engineering., Springer (2006).
[220] J. Nohl, C.H. Clarke, and et al., The Black Death. A Chronicle of the
Plague. westholme., The Black Death. A Chronicle of the Plague. (1926).
[221] Brendan OConnor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith, From tweets to polls: linking text sentiment
to public opinion time series., Proceedings of the International AAAI
Conference on Weblogs and Social Media, 2010, pp. 122129.
[222] John ODonovan and Barry. Smyth, Trust in recommender systems.,
Proceedings of the 10th international conference on Intelligent User
Interfaces, ACM, 2005, pp. 167174.
[223] J.P. Onnela and F. Reed-Tsochas, Spontaneous emergence of social influence in online systems., Proceedings of the National Academy of
Sciences 107 (2010), no. 43, 1837518380.
[224] L. Page, S. Brin, R. Motwani, and T. Winograd, The pagerank citation
ranking: bringing order to the web, (1999).
[225] G. Palla, I. Derenyi, I. Farkas, and T. Vicsek, Uncovering the overlapping
community structure of complex networks in nature and society., Nature
435 (2005), no. 7043, 814818.
[226] Gergely Palla, Albert-Laszlo Barabasi, and Tamas. Vicsek, Quantifying social group evolution., Nature 446 (2007), no. 7136, 664667.
[227] Bo Pang and Lillian. Lee, Opinion mining and sentiment analysis., Foundations and Trends in Information Retrieval 2 (2008), no. 1-2, 1135.
[228] Christos H. Papadimitriou and Kenneth. Steiglitz, Combinatorial optimization: algorithms and complexity., Courier Dover Publications,
1998.
[229] R. Pastor-Satorras and A. Vespignani, Epidemic spreading in scale-free
networks., Physical review letters 86 (2001), no. 14, 32003203.
[230] K.B. Patterson and T. Runge, Smallpox and the native american., The
American Journal of the Medical Sciences 323 (2002), no. 4, 216.
364
[235]
1993.
[236] W.M. Rand, Objective criteria for the evaluation of clustering methods.,
Journal of the American Statistical Association (1971), 846850.
[237] Paul Resnick and Hal R. Varian, Recommender systems., Communications of the ACM 40 (1997), no. 3, 5658.
[238] T.S. Robertson, The process of innovation and the diffusion of innovation.,
The Journal of Marketing (1967), 1419.
[239] E.M. Rogers, Diffusion of innovations., Free Press, 1995.
[240] J.H. Rohlfs and H.R. Varian, Bandwagon effects in high-technology
industries., The MIT Press, 2003.
[241] P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and
validation of cluster analysis., Journal of computational and applied
mathematics 20 (1987), 5365.
[242] B. Ryan and N.C. Gross, The diffusion of hybrid seed corn in two iowa
communities., Rural sociology 8 (1943), no. 1, 1524.
[243] G. Salton, A. Wong, and C.S. Yang, A vector space model for automatic
indexing., Communications of the ACM 18 (1975), no. 11, 613620.
365
[244] Gerard Salton and Michael J. McGill, Introduction to modern information retrieval, mcgraw-hill., (1986).
[245] J. Sander, M. Ester, H.P. Kriegel, and X. Xu, Density-based clustering
in spatial databases: the algorithm GDBSCAN and its applications., Data
Mining and Knowledge Discovery 2 (1998), no. 2, 169194.
[246] Badrul Sarwar, George Karypis, Joseph Konstan, and John. Riedl,
Item-based collaborative filtering recommendation algorithms., Proceedings of the 10th international conference on World Wide Web, ACM,
2001, pp. 285295.
[247] S. Scellato, M. Musolesi, C. Mascolo, V. Latora, and A. Campbell,
Nextplace: a spatio-temporal prediction framework for pervasive systems.,
Pervasive Computing (2011), 152169.
[248] J Ben Schafer, Dan Frankowski, Jon Herlocker, and Shilad. Sen, Collaborative filtering recommender systems., The Adaptive Web, Springer,
2007, pp. 291324.
[249] J Ben Schafer, Joseph Konstan, and John Riedl, Recommender systems
in e-commerce., Proceedings of the first ACM conference on Electronic
Commerce, ACM, 1999, pp. 158166.
[250] D.S. Scharfstein and J.C. Stein, Herd behavior and investment., The
American Economic Review (1990), 465479.
[251] T.C. Schelling, Dynamic models of segregation., Journal of Mathematical
Sociology 1 (1971), no. 2, 143186.
, Micromotives and macrobehavior., WW Norton & Company,
[252]
2006.
[253] John. Scott, Social network analysis., Sociology 22 (1988), no. 1, 109
127.
[254] Shilad Sen, Jesse Vig, and John. Riedl, Tagommenders: connecting users
to items through tags., Proceedings of the 18th international conference
on World Wide Web, ACM, 2009, pp. 671680.
366
[255] C.R. Shalizi and A.C. Thomas, Homophily and contagion are generically
confounded in observational social network studies., Sociological Methods & Research 40 (2011), no. 2, 211239.
[256] R.J. Shiller, Conversation, information, and herd behavior., The American
Economic Review 85 (1995), no. 2, 181185.
[257] Borkur
Sigurbjornsson
and Roelof. Van Zwol, Flickr tag recommen
dation based on collective knowledge., Proceedings of the 17th international conference on World Wide Web, ACM, 2008, pp. 327336.
[258] G. Simmel and E.C. Hughes, The sociology of sociability., American
Journal of Sociology (1949), 254261.
[259] H.A. Simon, Bandwagon and underdog effects and the possibility of election predictions., Public Opinion Quarterly 18 (1954), no. 3, 245253.
[260]
[276]
2012.
[277]
[278] L. Tang and H. Liu, Community detection and mining in social media., Synthesis Lectures on Data Mining and Knowledge Discovery
2 (2010), no. 1, 1137.
368
[279] Lei Tang and Huan. Liu, Relational learning via latent social dimensions.,
Proceedings of the 15th ACM SIGKDD international conference on
Knowledge Discovery and Data Mining, ACM, 2009, pp. 817826.
[280] Lei Tang, Xufei Wang, and Huan. Liu, Community detection via heterogeneous interaction analysis., Data Mining and Knowledge Discovery
(DMKD) 25 (2012), no. 1, 1 33.
[281] G. Tarde, Las leyes de la imitacion: Estudio sociologico., Daniel Jorro,
1907.
[282] N. Thanh and T.M. Phuong, A gaussian mixture model for mobile location prediction., 2007 IEEE International Conference on Research,
Innovation and Vision for the Future, IEEE, 2007, pp. 152157.
[283] W. Trotter, Instincts of the Herd in War and Peace., 1916.
[284] Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron. Marlow,
The anatomy of the facebook social graph, arxiv preprint arxiv:1111.4503.,
Structure 5, 6.
[285] T.W. Valente, Network models of the diffusion of innovations, 1995.
[286]
[287]
[288] T. Veblen, The Theory of the Leisure Class., Houghton Mifflin Boston,
1965.
[289] F. Wang and Q.Y. Huang, The importance of spatial-temporal issues for
case-based reasoning in disaster management., 2010 18th International
Conference on Geoinformatics, IEEE, 2010, pp. 15.
[290] S.S. Wang, S.I. Moon, K.H. Kwon, C.A. Evans, and M.A. Stefanone,
Face off: implications of visual cues on initiating friendship on facebook.,
Computers in Human Behavior 26 (2010), no. 2, 226234.
369
[291] Xufei Wang, Shamanth Kumar, and Huan. Liu, A study of tagging
behavior across social media., In SIGIR Workshop on Social Web Search
and Mining (SWSM), 2011.
[292] Xufei Wang, Lei Tang, Huiji Gao, and Huan. Liu, Discovering overlapping groups in social media., the 10th IEEE International Conference
on Data Mining (ICDM2010) (Sydney, Australia), December 14 - 17
2010.
[293] S. Warshall, A theorem on boolean matrices., Journal of the ACM (JACM)
9 (1962), no. 1, 1112.
[294] S. Wasserman and K. Faust, Social network analysis: Methods and applications., Cambridge University Press (1994).
[295] D.J. Watts, Networks, dynamics, and the small-world phenomenon.,
American Journal of Sociology 105 (1999), no. 2, 493527.
[296]
, A simple model of global cascades on random networks., Proceedings of the National Academy of Sciences 99 (2002), no. 9, 57665771.
[297] D.J. Watts and P.S. Dodds, Influentials, networks, and public opinion
formation., Journal of Consumer Research 34 (2007), no. 4, 441458.
[298] D.J. Watts and S.H. Strogatz, Collective dynamics of small-world networks., nature 393 (1998), no. 6684, 440442.
[299] I. Welch, Sequential sales, learning, and cascades., Journal of Finance
(1992), 695732.
[300] J. Weng, E.P. Lim, J. Jiang, and Q. He, Twitterrank: finding topicsensitive influential twitterers, Proceedings of the third ACM international conference on Web Search and Data Mining, ACM, 2010,
pp. 261270.
[301] D.B. West, Introduction to graph theory., vol. 2, Prentice Hall Upper
Saddle River, NJ.:, 2001.
[302] D.R. White, Structural equivalences in social networks: concepts and
measurement of role structures, Research Methods in Social Network
Analysis Conference, 1980, pp. 193234.
370
[303]
[304] I.H. Witten, E. Frank, and M.A. Hall, Data Mining: practical machine
learning tools and techniques., Morgan Kaufmann, 2011.
[305] R. Xu and D. Wunsch, Survey of clustering algorithms., Neural Networks, IEEE Transactions on 16 (2005), no. 3, 645678.
[306] J. Yang and J. Leskovec, Modeling information diffusion in implicit networks, IEEE 10th International Conference on Data Mining, IEEE,
2010, pp. 599608.
[307] H.P. Young, Individual strategy and social structure: an evolutionary
theory of institutions., Princeton University Press, 2001.
[308] G.U. Yule, A mathematical theory of evolution, based on the conclusions
of dr. j.c. willis, frs., Philosophical Transactions of the Royal Society
of London. Series B, Containing Papers of a Biological Character 213
(1925), 2187.
[309] W.W. Zachary, An information flow model for conflict and fission in small
groups., Journal of Anthropological Research (1977), 452473.
[310] Reza Zafarani, William D. Cole, and Huan. Liu, Sentiment propagation in social networks: a case study in livejournal., Advances in Social
Computing, Springer, 2010, pp. 413420.
[311] Reza Zafarani and Huan. Liu, Connecting corresponding identities
across communities., ICWSM, 2009.
[312]
[313] Zheng Alan Zhao and Huan. Liu, Spectral feature selection for data
mining., Chapman & Hall/CRC, 2011.
371
372
Index
degree, 81
Katz centrality, 74
divergence in computation,
75
PageRank, 76
Christakis, Nicholas A., 245
citizen journalist, 11
class, see class attribute
class attribute, 133
clique identification, 178
clique percolation method, 180
closeness centrality, 80
cluster centroid, 156
clustering, 155, 156
k-means, 156
hierarchical, 191
agglomerative, 191
divisive, 191
partitional, 156
spectral, 187
clustering coefficient, 84, 102
global, 85
local, 86
cohesive subgroup, see community
cold start problem, 286
collaborative filtering, 289, 323
memory-based, 289
item-based, 292
user-based, 290
model-based, 289, 293
collective behavior, 328
collective behavior modeling, 333
common neighbors, 324
community, 171
detection, 175
emit, see explicit
etic, see implicit
evolution, 193
explicit, 173
implicit, 174
community detection, 175
group-based, 175, 184
member-based, 175, 177
node degree, 178
node reachability, 181
node similarity, 183
community evaluation, 200
community membership behavior, 317
commute time, 327
confounding, 255
contact network, 237
content data, 316
content-based recommendation,
288
cosine similarity, 92, 183, 291
covariance, 94, 261
data point, see instance
data preprocessing, 138
aggregation, 138
discretization, 138
feature extraction, 138
feature selection, 138
sampling, 138
random, 139
stratified, 139
with/without replacement,
139
data quality, 137
duplicate data, 137
missing values, 137
noise, 137
outliers, 137
data scientist, 12
374
undirected, 27
visit, 36
edge list, 32
edge-reversal test, 277
eigenvector centrality, 71
entropy, 143
epidemics, 214, 236
infected, 238
recovered, 238
removed, 238
SI model, 239
SIR model, 241
SIRS model, 244
SIS model, 242
susceptible, 238
Erdos, Paul, 107
Euclidean distance, 155
evaluation dilemma, 13
evolutionary clustering, 198
external-influence model, 232
theorem,
adjacency list, 32
adjacency matrix, 31
edge list, 32
shortest path, 39
Dijskras algorithm, 49
signed graph, 34
simple graph, 34
strongly connected, 38
subgraph, 30
minimum spanning tree, 41
spanning tree, 41
Steiner tree, 42
traversal, 45
breadth-first search, 47
depth-first search, 46
tree, 40
undirected graph, 33
weakly connected, 38
weighted graph, 34
graph Laplacian, 187
graph traversal, 45
breadth-first search, 47
depth-first search, 46
group, see community
group centrality, 81
betweenness, 82
closeness, 82
degree, 81
herd behavior, 214
hitting time, 327
Holt, Charles A., 218
homophily, 255, 274
measuring, 274
modeling, 274
ICM, see independent cascade
model
376
in-degree, 28
independent cascade model, 222
individual behavior, 315
user-community
behavior,
316
user-entity behavior, 316
user-user behavior, 316
individual behavior modeling,
322
influence, 255, 264
measuring, 264
observation-based, 264
prediction-based, 264
modeling, 269
influence flow, 266
influence modeling, 269
explicit networks, 269
implicit networks, 271
information cascade, 214, 221, 323
information pathways, 78
information provenance, 249
innovation characteristics, 228
innovators, 229
instance, 133
labeled, 133
unlabeled, 133
internal-influence model, 232
intervention, 214, 220, 227, 235,
245
Prims algorithm, 52
mixed graph, 33
mixed-influence model, 232
modeling homophily, 274
modularity, 189, 259
multi-step flow model, 230
mutual information, 203
similarity, 91
cosine similarity, 92
Jaccard similarity, 92, 94
regular equivalence, 94
structural equivalence, 92
SimRank, 328
singleton, 194
singular value decomposition,
152, 293
SIR model, 241
SIRS model, 244
SIS model, 242
six degrees of separation, 105
small world model, 333
small-world, 105
small-world model, 115
average path length, 118
clustering coefficient, 118
degree distribution, 117
Social Atom, 12
social balance, 89
social correlation, 276
social media, 11
social media mining, 12
Social Molecule, 12
social network, 25
social similarity, 255
social status, 35, 90
social tie, see edge
sociomatrix, see adjacency matrix
Solomonoff, Ray, 106
sparse matrix, 32
Spearmans rank correlation coefficient, 268, 308
spectral clustering, 187
star, 194
Stevens, Stanley Smith, 133
Strogatz, Steven, 115
random walk, 36
randomization test, 278, 321
rank correlation, 268
Rapoport, Anatol, 106
ratio cut, 185
raw data, 131
recall, 201, 307
reciprocity, 87
recommendation to groups, 297
least misery, 297
maximizing average satisfaction, 297
most pleasure, 298
recommender systems, 285
regression, 322
regular equivalence, 94, 184
regular graph, 44
ring lattice, 115
regular ring lattice, 115
regularization term, 302
relationships, see edge
relaxing cliques, 180
k-plex, 180
RMSE, see root mean squared error
Rogers, Everett M., 228
root mean squared error, 305
rooted PageRank, 327
Renyi, Alfred, 107
scale-free network, 105
self-link, see loop, 32
sender-centric model, 222
sentiment analysis, 164
shortest path, 39
Dijskras algorithm, 49
shuffle test, 276
SI model, 239
379
structural equivalence, 92
submodular function, 225
supervised learning, 140
classification, 141
decision tree learning, 141
naive bayes classifier, 144
nearest neighbor classifier,
146
with network information,
147
evaluation
k-fold cross validation, 154
accuracy, 154
leave-one-out, 154
evaluations, 153
regression, 141, 150
linear, 151
logistic, 152
SVD, see singular value decomposition
TF-IDF,
undirected graph, 33
unsupervised learning, 155
evaluation, 158
cohesiveness, 159
separateness, 159
silhouette index, 160
user migration, 329
user-item matrix, 289
380