Dimitrios Milioris (Auth.) - Topic Detection and Classification in Social Networks - The Twitter Case-Springer International Publishing (2018)
Dimitrios Milioris (Auth.) - Topic Detection and Classification in Social Networks - The Twitter Case-Springer International Publishing (2018)
Dimitrios Milioris (Auth.) - Topic Detection and Classification in Social Networks - The Twitter Case-Springer International Publishing (2018)
Topic
Detection and
Classification in
Social Networks
The Twitter Case
Topic Detection and Classification in Social
Networks
Dimitrios Milioris
123
Dimitrios Milioris
Massachusetts Institute of Technology
Cambridge, MA, USA
This book provides a novel method for topic detection and classification in social
networks. The book addresses several research and technical challenges which are
currently investigated by the research community, from the analysis of relations and
communications between members of a community; quality, authority, relevance,
and timeliness of the content; traffic prediction based on media consumption; and
spam detection to security, privacy, and protection of personal information. Further-
more, the book discusses state-of-the-art techniques to address those challenges and
provides novel techniques based on information theory and combinatorics, which
are applied on real data obtained from Twitter. More specifically, the book:
• Detects topics from large text documents and extracts the main opinion without
any human intervention
• Provides a language-agnostic method, which is not based on a specific grammar
or semantics, e.g. machine learning techniques
• Compares state-of-the-art techniques and provides a smooth transition from
theory to practice with multiple experiments and results
This book discusses dynamic networks, either social or delay-tolerant networks,
and gives insight of specific methods for extracting prominent information along
with methodology useful to students. It goes from theory to practice with experi-
ments on real data.
vii
Acknowledgments
It is a pleasure to thank the many people who made this book possible.
I would like to express my gratitude to my PhD supervisor, Dr. Philippe
Jacquet, who was abundantly helpful and offered invaluable assistance, support, and
guidance.
Deepest gratitude is also due to Prof. Wojciech Szpankowski, without whose
knowledge and assistance this study would not have been successful.
I wish to offer my regards and blessings to all of my best friends in undergraduate
and graduate level for helping me get through the difficult times in Athens, Crete,
Paris, New York, and Boston and for all the emotional support, camaraderie,
entertainment, and caring they provided: Stamatis Z., Marios P., Dimitris K.,
Dimitris A., George T., Christos T., Kostas C., Kostas P., Dani K., Jessica S., Alaa
A., Emanuele M., Ioanna C., Vagelis V., Gérard B., Alonso S. Deepest gratitude
goes to Stella, my star, who shows me the way.
I would like also to convey my thanks to the University of Crete, École Poly-
technique, Inria, Bell Laboratories, Columbia University, and the Massachusetts
Institute of Technology for providing the financial means and laboratory facilities
throughout my career.
Lastly, and most importantly, I wish to thank my sister, Maria, and my parents,
Milioris Spyridonas and Anastasiou Magdalini. They bore me, raised me, supported
me, taught me, and loved me. To them I dedicate this book.
ix
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Dynamic Social Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 The Twitter Social Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Research and Technical Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Problem Statement and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Scope and Plan of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Document-Pivot Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Feature-Pivot Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 Data Preprocessing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.3 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.4 Document-Pivot Topic Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.5 Graph-Based Feature-Pivot Topic Detection . . . . . . . . . . . . . . . 14
2.4.6 Frequent Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.7 Soft Frequent Pattern Mining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.8 BNgram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Chapter Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Joint Sequence Complexity: Introduction and Theory . . . . . . . . . . . . . . . . . . 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Sequence Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Joint Complexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Contributions and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Models and Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Summary of Contributions and Results . . . . . . . . . . . . . . . . . . . . . 27
3.5 Proofs of Contributions and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.1 An Important Asymptotic Equivalence . . . . . . . . . . . . . . . . . . . . . 29
3.5.2 Functional Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
xi
xii Contents
A Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.1 Suffix Tree Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.2 Suffix Trees Superposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Acronyms
xv
xvi Acronyms
Social networks have undergone a dramatic growth in recent years. Such networks
provide an extremely suitable space to instantly share multimedia information
between individuals and their neighbours in the social graph. Social networks
provide a powerful reflection of the structure, the dynamics of the society and the
interaction of the Internet generation with both people and technology. Indeed, the
dramatic growth of social multimedia and user generated content is revolutionizing
all phases of the content value chain including production, processing, distribution
and consumption. It also originated and brought to the multimedia sector a new
underestimated and now critical aspect of science and technology, which is social
interaction and networking. The importance of this new rapidly evolving research
field is clearly evidenced by the many associated emerging technologies and
applications, including (a) online content sharing services and communities, (b)
multimedia communication over the Internet, (c) social multimedia search, (d)
interactive services and entertainment, (e) health care and (f) security applications.
It has generated a new research area called social multimedia computing, in which
well established computing and multimedia networking technologies are brought
together with emerging social media research.
Social networking services are changing the way we communicate with others,
entertain and actually live. Social networking is one of the primary reasons why
more people have become avid Internet users, people who until the emergence of
social networks could not find interests in the Web. This is a very robust indicator
of what is really happening online. Nowadays, users both produce and consume
significant quantities of multimedia content. Moreover, their behaviour through
online communities is forming a new Internet era where multimedia content sharing
through Social Networking Sites (SNSs) is an everyday practice. More than 200
SNSs of worldwide impact are known today and this number is growing quickly.
Many of the existing top web sites are either SNSs or offer some social networking
capabilities.
Except for the major social networks with hundreds of millions of users that span
in the entire world, there are also many smaller SNSs which are equally as popular
as the major social networks within the more limited geographical scope of their
membership, e.g. within a city or a country. There are also many vertically oriented
communities that gather users around a specific topic and have many dedicated
members on all continents.
Facebook is ranked among the most visited sites in the world, with more than
1:78 billion subscribed users to date. Moreover, Friendster is popular in Asia, Orkut
in Brazil and Vkon-takte in Russia. On top of that, there are dozens of other social
networks with vibrant communities, such as Vznet, Xing, Badoo, Netlog, Tuenti,
Barrabes, Hyves, Nasza Klasa, LunarStorm, Zoo, Sapo, Daily-Motion and so on.
There are also many vertically oriented communities which gather users around a
specific topic, such as music, books, etc. LinkedIn with over 450 million users or
Viadeo with 65 million users and Xing with 14 million users are mostly oriented
in establishing professional connections between their users and initiate potential
business collaborations.
The rapid growth in popularity of social networks has enabled large numbers of
users to communicate, create and share content, give and receive recommendations,
and, at the same time, it opened new challenging problems. The unbounded growth
of content and users pushes the Internet technologies to its limits and demands
for new solutions. Such challenges are present in all SNSs to a greater or lesser
extent. Considerable amount of effort has already been devoted worldwide for
problems such as content management in large scale collections, context awareness,
multimedia search and retrieval, social graph modelling analysis and mining, etc.
1.2 Research and Technical Challenges 3
Twitter is an online social networking service that enables users to send and read
short messages of up to 140 characters called “tweets”. Registered users can read
and post tweets, but unregistered users can only read them. Users access Twitter
through the website interface, SMS, or through a mobile device application. Twitter
is one of the most popular social networks and micro-blogging service in the
world, and according to its website it has more than 340 million active users
connected by 24 billion links. In Twitter, “following” someone means that a user will
have in his personal timeline other people’s tweets (Twitter updates). “Followers”
are people who receive other people’s Twitter updates. Approximately 99.89%
of the Twitter accounts have less than 3500 followers and followings. There are
approximately 40 million accounts with less than 10 followers and followings, that
is between 6% and 7% of all Twitter accounts. It is a social trend to ask followed
accounts to follow back in Twitter. There is a limit at 2000 followings that starts
growing after 1800 followers, which is the number of followings set by Twitter
to prevent users monitoring too many accounts whereas they have no active role
in Twitter. Approximately 40% of accounts have no followers and 25% have no
followings. Twitter is interesting to be studied because it allows the information
spread between people, groups and advertisers, and since the relation between its
users is unidirectional, the information propagation within the network is similar to
the way that the information propagates in real life.
This section lists the main research challenges in social networks, which are
currently being investigated by the research community.
• The analysis of relations and communications between members of a community
can reveal the most influential users from a social point of view.
• As social networks will continue to evolve, the discovery of communities,
users’ interests [32], and the construction of specific social graphs from large
scale social networks will continue to be a dynamic research challenge [85].
Research in dynamics and trends in social networks may provide valuable
tools for information extraction that may be used for epidemic predictions or
recommender systems [52, 87, 97].
• The information extracted from social networks proved to be a useful tool
towards security. One example of an application related to security is the
terrorism analysis, e.g. the analysis of the 9–11 terrorist network [106]. This
study was done by gathering public information from major newspapers on
the Internet and analyzed it by means of social networks [102]. Therefore,
cyber surveillance for critical infrastructure protection is another major research
challenge on social network analysis.
4 1 Introduction
• Searching in blogs, tweets and other social media is still an open issue since posts
are very small in size but numerous, with little contextual information. Moreover,
different users have different needs when it comes to the consumption of social
media. Real time search has to balance between quality, authority, relevance and
timeliness of the content [105].
• Crowdsourcing systems gave promising solutions to problems that were unsolved
for years. The research community nowadays is working by leveraging human
intelligence to solve critical problems [57, 86], since social networks contain
immense knowledge through their users. However, it is not trivial to extract that
knowledge [50].
• Traffic prediction based on media consumption may be correlated between
groups of users. This information can be used to dimension media servers and
network resources to avoid congestion and improve the quality of experience and
service.
Content sharing and distribution needs will continue to increase. Mobile
phones, digital cameras and other pervasive devices produce huge amounts of
data which users want to distribute if possible in real time [15].
• Since users population and data production increase, spam and advertisements
will continue growing [58]. In addition, the importance of social networks to
influence the opinions of the users has to be protected with a mechanism that
promotes trustworthy opinions that are relevant to businesses.
• As in every human community, online social communities face also critical social
and ethical issues that need special care and delicate handling. Protection of
personal information and many other problems need special attention [26].
In order to address these challenges, we need to extract the relevant informa-
tion from online social media in real time.
Topic detection and trend sensing is the problem of automatically detecting topics
from large text documents and extracting the main opinion without any human
intervention. This problem is of great practical importance given the massive volume
of documents available online in news feeds, electronic mail, digital libraries and
social networks.
Text classification is the task of assigning predefined labels or categories to texts
or documents. It can provide conceptual views of document collections and has
important applications in real-world problems. Nowadays, the documents which can
be found online are typically organized by categories according to their subject, e.g.
topics. Some widespread applications of topic detection and text classification are
community detection, traffic prediction, dimensioning media consumption, privacy
and spam filtering, as mentioned in Sect. 1.2.
By performing topic detection on social network communities, we can regroup
users in teams and find the most influential ones, which can be used to build specific
1.3 Problem Statement and Objectives 5
and strategic plans. Public information in social networks can be extracted by topic
detection and classification and used for cyber surveillance in an automatic way in
order to avoid the overload. Extracting an opinion from social networks is difficult,
because users are writing in a way which does not have correct syntax or grammar
and contains many abbreviations. Therefore, mining opinions in social networks
can benefit by an automatic topic detection on really short and tremendous posts.
By grouping users and adding labels to discussions or communities we are able to
find their interests and tag people that share information very often. This information
can be used to dimension media servers and network resources to avoid congestion
and improve the quality of experience and service. Finally, by performing topic
classification we can find similarities between posts of users that spread irrelevant
information into the network and enable a spam filter to defend against that.
In this book we present a novel method to perform topic detection, classification
and trend sensing in short texts. The importance of the proposed method comes from
the fact that up to now, the main methods used for text classification are based on
keywords detection and machine learning techniques. By using keywords or bag-
of-words in tweets will often fail because of the wrongly or distorted usage of
the words—which also needs lists of keywords for every language to be built—
or because of implicit references to previous texts or messages. In general, machine
learning techniques are heavy and complex and therefore are not good candidates for
real-time text classification, especially in the case of Twitter where we have natural
language and thousands of tweets per second to process. Furthermore machine
learning processes have to be manually initiated by tuning parameters, and it is
one of the main drawbacks for that kind of application. Some other methods are
using information extracted by visiting the specific URLs on the text, which makes
them a heavy procedure, since one may have limited or no access to the information,
e.g. because of access rights, or data size. In this book we are trying to address the
discussed challenges and problems of other state-of-the-art methods and propose
a method which is not based on keywords, language, grammar or dictionaries, in
order to perform topic detection, classification and trend sensing.
Instead of relying on words as most other existing methods which use bag-of-
words or n-gram techniques, we introduce Joint Complexity (JC), which is defined
as the cardinality of a set of all distinct common factors, subsequences of characters,
of two given strings. Each short sequence of text is decomposed in linear time into
a memory efficient structure called Suffix Tree (ST) and by overlapping two trees,
in linear or sublinear average time, we obtain the JC defined as the cardinality of
factors that are common in both trees. The method has been extensively tested
for text generation by Markov sources of finite order for a finite alphabet. The
Markovian generation of text gives a good approximation for natural text generation
and is a good candidate for language discrimination. One key take-away from this
approach is that JC is language-agnostic since we can detect similarities between
two texts without being based on grammar and vocabulary. Therefore there is no
need to build any specific dictionary or stemming process. JC can also be used to
capture a change in topic within a conversation, as well as a change in the style of a
specific writer of a text.
6 1 Introduction
On the other hand, the inherent sparsity of the data space motivated us in a natural
fashion the use of the recently introduced theory of Compressive Sensing (CS) [12,
18] driven by the problem of target localization [21]. More specifically, the problem
of estimating the unknown class of a message is reduced to a problem of recovering
a sparse position-indicator vector, with all of its components being zero except for
the component corresponding to the unknown class where the message is placed.
CS states that signals which are sparse or compressible in a suitable transform basis
can be recovered from a highly reduced number of incoherent random projections,
in contrast to the traditional methods dominated by the well-established Nyquist-
Shannon sampling theory. The method works in conjunction with a Kalman filter to
update the states of a dynamical system as a refinement step.
In this book we exploit datasets collected by using the Twitter streaming API,
getting tweets in various languages and we obtain very promising results when
comparing to state-of-the-art methods.
In this book, a novel method for topic detection, classification and trend sensing
in Dynamic Social Networks is proposed and implemented. Such system is able to
address the research and technical challenges mentioned in Sect. 1.2. The structure
of this book is organized as follows.
First, Chap. 2 overviews the state-of-the-art of topic detection, classification and
trend sensing techniques for online social networks. First, it describes the document-
pivot and feature-pivot methods, along with a brief overview of the pre-processing
stage of these techniques. Six state of the art methods: LDA, Doc-p, GFeat-p,
FPM, SFPM, BNgram are described in detail, as they serve as the performance
benchmarks to the proposed system.
In Chap. 3, we introduce the Joint Sequence Complexity method. This chapter
describes the mathematical concept of the complexity of a sequence, which is
defined as the number of distinct subsequences of the given sequence. The analysis
of a sequence in subcomponents is done by suffix trees, which is a simple, fast,
and low complexity method to store and recall subsequences from the memory. We
define and use Joint Complexity for evaluating the similarity between sequences
generated by different sources. Markov models well describe the generation of nat-
ural text, and their performance can be predicted via linear algebra, combinatorics
and asymptotic analysis. We exploit Markov sources trained on different natural
language datasets, for short and long sequences, and perform automated online
sequence analysis on information streams in Twitter.
Then, Chap. 4 introduces the Compressive Sensing based classification method.
Driven by the methodology of indoor localization, the algorithm converts the
classification problem into a signal recovery problem, so that CS theory can be
applied. First we employ Joint Complexity to perform topic detection and build
signal vectors. Then we apply the theory of CS to perform topic classification
1.4 Scope and Plan of the Book 7
Abstract Topic detection and tracking aims at extracting topics from a stream of
textual information sources, or documents, and to quantify their “trend” in real
time. These techniques apply on pieces of texts, i.e. posts, produced within social
media platforms. Topic detection can produce two types of complementary outputs:
cluster output or term output are selected and then clustered. In the first method,
referred to as document-pivot, a topic is represented by a cluster of documents,
whereas in the latter, commonly referred to as feature-pivot, a cluster of terms
is produced instead. In the following, we review several popular approaches that
fall in either of the two categories. Six state-of-the-art methods: Latent Dirichlet
Allocation (LDA), Document-Pivot Topic Detection (Doc-p), Graph-Based Feature-
Pivot Topic Detection (GFeat-p), Frequent Pattern Mining (FPM), Soft Frequent
Pattern Mining (SFPM), BNgram are described in detail, as they serve as the
performance benchmarks to the proposed system.
2.1 Introduction
Topic detection and tracking aims at extracting topics from a stream of textual
information sources, or documents, and to quantify their “trend” in real time [3].
These techniques apply on pieces of texts, i.e. posts, produced within social
media platforms. Topic detection can produce two types of complementary outputs:
cluster output or term output are selected and then clustered. In the first method,
referred to as document-pivot, a topic is represented by a cluster of documents,
whereas in the latter, commonly referred to as feature-pivot, a cluster of terms
is produced instead. In the following, we review several popular approaches that
fall in either of the two categories. Six state-of-the-art methods: Latent Dirichlet
Allocation (LDA), Document-Pivot Topic Detection (Doc-p), Graph-Based Feature-
Pivot Topic Detection (GFeat-p), Frequent Pattern Mining (FPM), Soft Frequent
Pattern Mining (SFPM), BNgram are described in detail, as they serve as the
performance benchmarks to the proposed system.
media mining. The idea behind those methods is that “breaking news”, unlike other
discussion topics, happen to reach a fast peak of attention from routine users as
soon as they are tweeted or posted [53, 108]. Accordingly, the common framework
which underlies most of the approaches in this category first identifies bursty terms
and then clusters them together to produce topic definitions.
The diffusion of the services over social media and detection of bursty events
had been studied in generic document sets. The method presented in [25], for
instance, detects bursty terms by looking their frequency in a given time window.
Once the bursty terms are found, they are clustered using a probabilistic model
of cooccurrence. The need for such a global topic term distribution restricts this
approach to a batch mode of computation. Similar methods were tested for topic
detection in social media, such as in the Twitter, but with additional emphasis on
the enrichment of the obtained topics with non-bursty but relevant terms, URLs and
locations [59].
Graph-based approaches detect term clusters based on their pairwise similarities.
The algorithm proposed in [90] builds a term cooccurrence graph, whose nodes are
clustered using a community detection algorithm based on betweenness centrality,
which is an indicator of a node’s centrality in a network and is equal to the number of
shortest paths from all vertices to all others that pass through that node. Additionally,
the topic description is enriched with the documents which are most relevant to the
identified terms. Graphs of short phrases, rather than of single terms, connected
by edges representing similarity have also been used [54]. Graph-based approaches
have also been used in the context of collaborative tagging systems with the goal of
discovering groups of tags pertaining to topics of social interest [80].
Signal processing approaches have also been explored in [103], which compute
df-idf (a variant of tf-idf ) for each term in each considered time slot, and then apply
wavelet analysis on consecutive blocks. The difference between the normalized
entropy of consecutive blocks is used to construct the final signal. Relevant terms
which are bursty are extracted by computing the autocorrelation of the signal and
heuristically learning and determining a threshold to detect new bursty terms. Also
in this case, a graph between selected terms is built based on their cross-correlation
and it is then clustered to obtain event definitions. The Discrete Fourier Transform
is used in [31], where the signal for each term is classified according to its power
and periodicity. Depending on the identified class, the distribution of appearance
of a term in time is modelled using Gaussian distributions. The Kullback–Leibler
divergence (as a relative entropy) between the distributions is then used to determine
clusters and increase the computational complexity of the method.
The knowledge of the community leads to even more sophisticated approaches.
In a recent work [13] a PageRank-like measure is used to identify important users
on the Twitter social network. Such centrality score is combined with a measure of
term frequency to obtain a measure for each term. Then, clustering on a correlation
graph of bursty terms delineates topic boundaries.
These methods are based on the analysis of similarities between terms and often
give wrong correlation of topics, with their main disadvantage being the use of
dictionaries and stemming processes.
12 2 Background and Related Work
We address the task of detecting topics in real-time from social media streams. To
keep our approach general, we consider that the stream is made of short pieces of
text generated by social media users, e.g. posts, messages or tweets in the specific
case of Twitter. Posts are formed by a sequence of words or terms, and each one is
marked with the timestamp of creation. A plethora of methods have a user-centred
scenario in which the user starts up the detection system by providing a set of seed
terms that are used as initial filter to narrow down the analysis only to the posts
containing at least one of the seed terms. Additionally, there exists an assumption
that the time frame of interest (can be indefinitely long) and a desired update rate
are provided (e.g. detect new trending topics every 15 min). The expected output
of the algorithm is a topic, defined as a headline and a list of terms, delivered at
the end of each time slot determined by the update rate. This setup fits well many
real-world scenarios in which an expert of some domain has to monitor specific
topics or events being discussed in social media [17, 87]. For instance, this is the
case for computational journalism in which the media inquirer is supposed to have
enough knowledge of the domain of interest to provide initial terms to perform an
initial filtering of the data stream. Even if it requires an initial human input, this
framework still remains generic and suitable to any type of topic or event.
In Sect. 2.4.6 a frequent pattern mining approach for topic detection was described.
It provided an elegant solution to the problem of feature-pivot methods that takes
into account only pairwise cooccurrences between terms in the case of corpus with
densely interconnected topics. Section 2.4.5 examined only pairwise cooccurrences,
where frequent pattern mining examines cooccurrences between any number of
terms, typically larger than two. A question that naturally arises is whether it is
possible to formulate a method that lies between these two extremes. Such a method
would examine cooccurrence patterns between sets of terms with cardinality larger
2.4 Related Work 17
than two, like frequent pattern mining does, but it would be less strict by not
requiring that all terms in these sets cooccur frequently. Instead, in order to ensure
topic cohesiveness, it would require that large subsets of the terms grouped together,
but not necessarily all, cooccur frequently, resulting in a “soft” version of frequent
pattern mining.
The proposed approach (SFPM) works by maintaining a set of terms S, on which
new terms are added in a greedy manner, according to how often they cooccur with
the terms in S. In order to quantify the cooccurrence match between a set S and a
candidate term t, a vector DS for S and a vector Dt for the term t are maintained,
both with dimension n, where n is the number of documents in the collection. The
i-th element of DS denotes how many of the terms in S cooccur in the i-th document,
whereas the i-th element of Dt is a binary indicator that represents if the term t occurs
in the i-th document or not. Thus, the vector Dt for a term t that frequently cooccurs
with the terms in set S will have a high cosine similarity to the corresponding vector
DS . Note that some of the elements of DS may have the value jSj, meaning that all
items in S occur in the corresponding documents, whereas other may have a smaller
value indicating that only a subset of the terms in S cooccur in the corresponding
documents. For a term that is examined for expansion of S, it is clear that there will
be some contribution to the similarity score also from the documents in which not
all terms cooccur, albeit somewhat smaller compared to that documents in which all
terms cooccur. The “soft” matching between a term that is considered for expansion
and a set S is achieved. Finding the best matching term can be done either using
exhaustive search or some approximate nearest neighbour scheme such as LSH. As
mentioned, a greedy approach that expands the set S with the best matching term is
used, thus a criterion is needed to terminate the expansion process. The termination
criterion clearly has to deal with the cohesiveness of the generated topics, meaning
that if not properly set, the resulting topics may either end up having too few terms
or really being a mixture of topics (many terms related to possibly irrelevant topics).
To deal with this, the cosine similarity threshold .S/ between S and the next best
matching term is used. If the similarity is above the threshold, the term is added,
otherwise the expansion process stops. This threshold is the only parameter of the
proposed algorithm and is set to be a function of the cardinality of S. In particular a
sigmoid function of the following form is used:
1
.S/ D 1
1 C exp..jSj b/=c/
The parameters b and c can be used to control the size of the term clusters and
how soft the cooccurrence constraints will be. Practically, the values of b and c
are set so that the addition of terms when the cardinality of S is small is easier
(the threshold is low), but addition of terms when the cardinality is larger is harder.
A low threshold for the small values of jSj is required so that it is possible for
terms that are associated to different topics and therefore occur in more documents
rather than to ones corresponding to the non-zero elements of DS to join the set
S. The high threshold for the larger values of jSj is required so that S does not
18 2 Background and Related Work
grow without limit. Since a set of topics is required—rather than a single topic—
the greedy search procedure is applied as many times as the number of considered
terms, each time initializing S with a candidate term. This will produce as many
topics as the set of terms considered, many of which will be duplicates, thus a
post-process of the results is needed to remove these duplicates. To limit the search
procedure in reasonable limits the top n terms with the highest likelihood-ratio are
selected by following the methodology in (2.4.5).
When the “soft” frequent pattern matching algorithm runs for some time, the
vector DS may include too many non-zero entries filled with small values, especially
if some very frequently occurring term has been added to the set. This may have the
effect that a term may be deemed relevant to S because it cooccurs frequently only
with a very small number of terms in the set rather than with most of them. In order
to deal with this issue, after each expansion step, any entries of DS that have a value
smaller than jSj=2 are reset to zero. The most relevant documents for a topic can be
directly read from its vector DS : the ones with the highest document counts.
2.4.8 BNgram
Both the frequent itemset mining and soft frequent itemset mining approaches
attempted to take into account the simultaneous cooccurences between more than
two terms. However, it is also possible to achieve a similar result in a simpler way
by using n-grams. This naturally groups together terms that cooccur and it may be
considered to offer a first level of term grouping. Using n-grams makes particularly
sense for Twitter, since a large number of the status updates in Twitter are just
copies or retweets of previous messages, so important n-grams will tend to become
frequent.
Additionally, a new feature selection method is introduced. The changing
frequency of terms over time as a useful source of information to detect emerging
topics is taken into account. The main goal of this approach is to find emerging
topics in post streams by comparing the term frequencies from the current time slot
with those of preceding time slots. The df idft metric which introduces time to the
classic tf-idf score is proposed. Historical data to penalize those topics which began
in the past and are still popular in the present, and which therefore do not define new
topics have been used.
The term indices, implemented using Lucene, are organized into different time
slots. In addition to single terms, the index also considers bigrams and trigrams.
Once the index is created, the df idft score is computed for each n-gram of the
current time slot i based on its document frequency for this time slot and penalized
by the logarithm of the average of its document frequencies in the previous t time
slots:
dfi C 1
scoredf idft D Pt
dfij
log jD1
t
C1 C1
2.5 Chapter Summary 19
This chapter overviewed the state of the art of topic detection, classification and
trend sensing techniques for online social networks. First, it described the document-
pivot and feature-pivot methods, along with a brief overview of the pre-processing
stage of these techniques. Six state-of-the-art methods: LDA, Doc-p, GFeat-p,
FPM, SFPM, BNgram were described in detail, as they serve as the performance
benchmarks to the proposed system.
Chapter 3
Joint Sequence Complexity: Introduction
and Theory
3.1 Introduction
In this chapter we study joint sequence complexity and we introduce its applications
for topic detection and text classification, in particular source discrimination. The
mathematical concept of the complexity of a sequence is defined as the number of
distinct factors of it. The Joint Complexity is thus the number of distinct common
factors of two sequences. Sequences containing many common parts have a higher
Joint Complexity. The extraction of the factors of a sequence is done by suffix trees,
which is a simple and fast (low complexity) method to store and retrieve them
from the memory. Joint Complexity is used for evaluating the similarity between
sequences generated by different sources and we will predict its performance over
Markov sources. Markov models describe well the generation of natural text, and
their performance can be predicted via linear algebra, combinatorics and asymptotic
analysis. This analysis follows in this chapter. We exploit datasets from different
natural languages, for both short and long sequences, with promising results on
complexity and accuracy. We performed automated online sequence analysis on
information streams in Twitter.
In the last decades, several attempts have been made to capture mathematically
the concept of the “complexity” of a sequence. In [48], the sequence complexity
was defined as the number of different factors contained in a sequence. If X is
a sequence and I.X/ its set of factors (distinct substrings), then the cardinality
jI.X/j is the complexity of the sequence. For example, if X D aabaa, then
I.X/ D fv; a; b; aa; ab; ba; aab; aba; baa; aaba; abaa; aabaag, and jI.X/j D 12,
where v denotes the empty string. Sometimes the complexity of the sequence
is called the I-Complexity (IC) [5]. The notion is connected with quite deep
mathematical properties, including the rather elusive concept of randomness in a
string [34, 55, 75].
n
p
˛ log n
for some < 1 and ; ˛ > 0 which depend on the parameters of the two sources.
When the sources are identical, i.e. when their parameters are identical, but the text
3.3 Joint Complexity 23
still being independently generated, then the JC growth is O.n/, hence D 1. When
the texts are identical (i.e. X D Y), then the JC is identical to the I-Complexity and it
2
grows as n2 [48]. Therefore JC method can already be used to detect “copy–paste”
parts between the texts. Indeed the presence of a common factor of length O.n/
would inflate the JC by a term O.n2 /.
We should point out that experiments demonstrate that for memoryless sources
the JC estimate
nk
p
˛ log n
converges very slowly. Therefore, JC is not really meaningful even when n 109 .
In this work we derive second order asymptotics for JC of the following form
nk
p
˛ log n C ˇ
for some ˇ > 0. Indeed it turns out that for text where n < 100 and log n < ˇ, this
new estimate converges more quickly than the estimate
nk
p
˛ log n
thus it can be used for short texts, like tweets. In fact, our analysis indicates that JC
can be refined via a factor for
1
P
˛ log n C ˇ
appearing in the JC, where P is a specific polynomial determined via saddle point
expansion. This additional term further improves the convergence for small values
of n, and also same periodic factors of small amplitude appear when the source
parameters satisfy some specific and very unfrequent conditions.
In this work we extend the JC estimate to Markov sources of any order on a finite
alphabet. Although Markov models are no more realistic than memoryless sources,
say, for a DNA sequence, they seem to be fairly realistic for text generation [43].
An example of Markov simulated text for different order is shown in Table 3.1 from
“The Picture of Dorian Gray”.
In view of these facts, we can use the JC to discriminate between two
identical/non-identical Markov sources [109]. We introduce the discriminant
function as follows
1
d.X; Y/ D 1 log J.X; Y/
log n
24 3 Joint Sequence Complexity: Introduction and Theory
Table 3.1 Markov simulated text from the book “The Picture of Dorian Gray”
Markov order Text simulated by the given Markov order
3 “Oh, I do yourse trought lips whose-red from to his, now far taked. If Dorian,
Had kept has it, realize of him. Ther chariten suddenial tering us. I don’t belige
had keption the want you are ters. I am in the when mights horry for own that
words is Eton of sould the Oh, of him to oblige mere an was not goods”
4 “Oh, I want your lives. It is that is words the right it his find it man at they
see merely fresh impulses. But when you have you, Mr. Gray, a fresh impulse
of mine. His sad stifling round of a regret. She is quite devoted forgot an
arrowed”
5 “Oh, I am so sorry. When I am in Lady Agatha’s black book that the air of
God, which had never open you must go. I am painting, and poisons us. We
are such a fresh impulse of joy that he has done with a funny looked at himself
unspotted”
respectively when the length of X and Y are both equal to n. In this work we
concentrate mainly on the analysis of the JC method, however, we also present some
experimental evidence of how useful our discriminant is for real texts.
In Fig. 3.1, we compared the JC of a real English text with simulated texts of the
same length written in French, Greek, Polish and Finnish (all language is transcribed
in the Latin alphabet, simulated from a Markov source of order 3). It is easy to see
that even for texts smaller in length than a thousand words, one can discriminate
between these languages. By discriminating, we mean that the JC between texts of
different languages drops significantly in comparison to JC for texts of the same
language. The figure shows that Polish, Greek and Finnish are further from English
than French is. On the other hand, in Fig. 3.2, we plot the similarity between real
and simulated texts in French, Greek, Polish, English and Finnish.
In Polish, the second part of the text shifts to a different topic, and we can see
that the method can capture this difference. Clearly, the JC of such texts grows like
O.n/ as predicted by theory. In fact, computations show that with Markov models
of order 3 for English versus French we have D 0:44; versus Greek: D 0:26;
versus Finnish: D 0:04; and versus Polish: D 0:01, which is consistent with
the results in Fig. 3.1, except for the low value of where the convergence to
the asymptotics regime is slower. In fact, they agree with the actual resolution of
Eq. (3.7), which contains the transition to an asymptotics regime. A comparison
between different topics or subjects is presented in Fig. 3.3. We test four texts
from books on constitutional and copyright law as well as texts extracted from two
cookbooks. As we can see, the method can well distinguish the differences, and
shows increased similarity for the same topic.
3.3 Joint Complexity 25
2500
EnRealOverFrSim
EnRealOverGrSim
2000
EnRealOverPlSim
EnRealOverFnSim
Joint Complexity
1500
1000
500
0
0 2000 4000 6000 8000 10000
Text length (n)
Fig. 3.1 Joint Complexity of real English text versus simulated texts in French, Greek, Polish and
Finnish
x 104
2.5
EnRealOverEnSim
FrRealOverFrSim
2
GrRealOverGrSim
PlRealOverPlSim
Joint Complexity
1.5 FnRealOverFnSim
0.5
0
0 2000 4000 6000 8000 10000
Text length (n)
Fig. 3.2 Joint complexity of real and simulated texts (3rd Markov order) in the English, French,
Greek, Polish and Finnish languages
26 3 Joint Sequence Complexity: Introduction and Theory
9000
ConstLawOverCopyLaw
8000 ConstLawOverFoodAndHealth
ConstLawOverItalianCookBook
CopyLawOverFoodAndHealth
7000
CopyLawOverItalianCookBook
FoodAndHealthOverItalianCookBook
6000
Joint Complexity
5000
4000
3000
2000
1000
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Text length (n)
Fig. 3.3 Joint complexity of real text from a variety of books spanning constitutional and
copyright law to healthy cooking and recipes from the cuisine of Italy
The complexity of a single string has been studied extensively. The literature
is reviewed in [48] where precise analysis of string complexity is discussed for
strings generated by unbiased memoryless sources. Another analysis of the same
situation was also proposed in [36] which was the first work to present the joint
string complexity for memoryless sources. It is evident from [36] that a precise
analysis of JC is quite challenging due to the intricate singularity analysis and an
infinite number of saddle points. Here, we deal with the joint string complexity
applied to Markov sources, which to the best of our knowledge has never been
tackled before. The analysis requires two-dimensional asymptotic analysis with two
variable Poisson generating functions. In the following sections, we begin with a
discussion of models, and notations, and we present a summary of the contributions
and main results. Next, we present an extended overview of the theoretical analysis,
and apply the results in the context of the Twitter social network. We present a study
on expending asymptotics and periodic terms.
We detail our main results below. We introduce some general notations and then
present a summary.
3.4 Contributions and Results 27
Let ! and be two strings over a finite alphabet A (e.g. A D fa; bg). We denote
by j!j the number of times occurs as a factor in ! (e.g. jabbbajbb D 2). By
convention j!j D j!j C 1, where is the empty string, because the empty string is
prefix and suffix of ! and also appears between the characters, and |!| is the length
of the string.
Throughout we denote by X a string (text) and we plan to study its complexity.
We also assume that its length jXj is equal to n. Then the string complexity is I.X/ D
f! W jXj! 1g. Observe that
X
jI.X/j D 1jXj 1 ;
2A
where 1A is the indicator function of a Boolean A. Notice that jI.X/j is equal to the
number of nodes in the associated suffix tree of X [40, 48, 92]. We will come back
on the suffix tree in Sect. 3.8.
Let X and Y be two sequences (not necessarily of the same length). We have
defined the Joint Complexity as the cardinality of the set J.X; Y/ D I.X/ \ I.Y/. We
have
X
jJ.X; Y/j D 1jXj 1 1jYj 1 :
2A
We now assume that the strings X and Y are respectively generated by two
independent Markov sources of order r, so-called source 1 and source 2. We
will only deal here with Markov of order 1, but extension to arbitrary order
is straightforward. We assume that source i, for i 2 f1; 2g has the transition
probabilities Pi .ajb/ from term b to term a, where .a; b/ 2 A r . We denote by P1
(resp. P2 ) the transition matrix of Markov source 1 (resp. source 2). The stationary
distributions are respectively denoted by 1 .a/ and 2 .a/ for a 2 A r .
Let Xn and Ym be two strings of respective lengths n and m, generated by Markov
source 1 and Markov source 2, respectively. We write Jn;m D E.jJ.Xn ; Ym /j/ 1 for
the joint complexity, i.e. omitting the empty string.
D min fs1 s2 g
.s1 ;s2 /2K
with < 1.
Lemma 3.1 We have either c1 > 0 or c2 > 0 and .c1 ; c2 / 2 Œ1; 02 .
3.5 Proofs of Contributions and Results 29
Theorem 3.3 Assume P.s1 ; s2 / is not nilpotent and either c1 > 0 or c2 > 0. We
only handle c2 > 0, the case c1 > 0 being obtained by symmetry.
(i) [Noncommensurable Case.] We assume that P2 is not logarithmically balanced.
Let c0 < 0 such that .c0 ; 0/ 2 K . There exists 1 and > 0 such that
The case when both c1 and c2 are between 1 and 0 is the most intricate to
handle.
Theorem 3.4 Assume that c1 and c2 are between 1 and 0.
(i) [Noncommensurable Case.] When P1 and P2 are not logarithmically commen-
surable matrices, then there exist ˛2 , ˇ2 and 2 such that
2 n 1
Jn;n D p 1CO : (3.3)
˛2 log n C ˇ2 log n
where P.w/ is the probability that w is a prefix of Xn0 and Aw .z/ is the “autocorre-
lation” polynomial of word w [40]. For the Markov source, we omit the expression
which carries extra indices which track to the Markov correlations for the starting
symbols of the words. A complete description of the parameters can be found
in [20, 40].
Although it is a closed formula, this expression is not easy to manipulate. To
make the analysis tractable we notice that w 2 I.Xn / is equivalent to the fact that w
is at least a prefix of one of the n suffices of Xn .
If the suffices would have been n independent infinite strings, then P.w 2 I.Xn //
would be equal to 1 .1 P1 .w//n whose generating function is
P1.z/z
.1 z/.1 z C P1 .w/z/
P1 .w/z
.1 z/.Dw .z//
The proof is not developed in [20, 40] and seems to be rather complicated. It can
be found in [44].
3.5 Proofs of Contributions and Results 31
Let a 2 A . We denote
X
Ca;m;n D P.w 2 I1 .n//P.w 2 I2 .n//
w2aA
2ma .a/.1
2 .a//mma Ca;n;m :
since
i .a/ is the probability that a string from source i starts with symbol a.
We introduce the double Poisson transform of Ca;n;m as
X zn1 zm
2 z1 z2
Ca .z1 ; z2 / D Ca;n;m e (3.6)
n;m0
nŠmŠ
which translates the recurrence (in the formula above na tracks the number) into the
following functional equation:
satisfies
The asymptotics of the coefficient Cn;m are extracted from the asymptotics of
function C.z1 ; z2 / where <.z1 ; z2 / ! 1. This is an extension of DePoissonization
theorems of [41, 43, 92], and are used to prove the following lemma.
Lemma 3.3 (DePoissonization) When n and m tend to infinity:
We first present a general result when the Markov sources are identical:
P1 D P2 D P. In this case (3.7) can be rewritten with ca .z/ D Ca .z; z/:
X
cb .z/ D .1 ez /2 C ca .P.ajb/z/ : (3.10)
a2A
R1
This equation is directly solvable by the Mellin transform ca .s/ D 0 ca .x/xs1 dx
defined for 2 < <.s/ < 1. For all b 2 A we find [92]
X
cb .s/ D .2s 2/ .s/ C .P.ajb//s ca .s/ : (3.11)
a2A
R1
RIntroducing c .s/ D
R 0 C.z; z/z
s1
dz [24], and the property of Mellin transform
s
f .ax/x s1
Da f .x/x dx. The definition domain of c .s/ is <.s/ 2 .2; 1/,
s1
Thus
c .s/ D .2s 2/ .s/ 1 C h1.I P.s//1 j.s/i (3.12)
3.5 Proofs of Contributions and Results 33
where 1 is the vector of dimension jA j made of all 1’s, I is the identity matrix, and
P.s/ D P.s; 0/ D P.0; s/, .s/ is the the vector made of coefficients
.a/s and
h:j:i denotes the inner product.
By applying the methodology of Flajolet [23, 92], the asymptotics of c.z/ for
j arg.z/j < is given by the residues of the function c .s/zs occurring at s D 1
and s D 0. They are respectively equal to
2 log 2
z and 1 h1.I P.0; 0//1 .0/i:
h
The first residue comes from the singularity of .I P.s//1 at s D 1. This led
to the formula expressed in Theorem 3.1(i). When P is logarithmically rationally
balanced then there are additional poles on a countable set of complex numbers sk
regularly spaced on the same imaginary axes containing 1 and such that P.sk / has
eigenvalue 1. These poles contribute to the periodic terms in Theorem 3.1(ii).
Computations on the trained transition matrix show that a Markov model of
order 3 for English text has entropy of 0.944221, while French text has an entropy
of 0.934681, Greek text has an entropy of 1.013384, Polish text has an entropy
of 0.665113 and Finnish text has an entropy of 0.955442. This is consistent with
Fig. 3.2.
In this section we identify the constants in Theorems 3.3 and 3.4 with the assumption
P1 ¤ P2 . We cannot obtain a functional equation for Ca .z; z/’s, and we thus
have to deal with R 1 variables z1 and z2 . We define the double Mellin transform
R 1two
Ca .s1 ; s2 / D 0 0 Ca .z1 ; z2 /zs11 1 zs22 1 dz1 dz2 and similarly the double Mellin
transform C .s1 ; s2 / of C.z1 ; z2 /. And thus we should have the identity
which leads to
C .s1 ; s2 / D .s1 / .s2 / 1 C h1.I P.s1 ; s2 //1 j.s1 ; s2 /i (3.14)
@ @
C.z1 ; z2 / C.0; z2 /z1 ez1 C.z1 ; 0/z2 ez2
@z1 @z2
34 3 Joint Sequence Complexity: Introduction and Theory
which leads to exponentially decaying but we omit this technical detail, which is
fully described in [43]. The original value C.z1 ; z2 / is obtained via the inverse Mellin
transform
Z Z
1
C.z1 ; z2 / D C.s1 ; s2 /zs 1 s2
1 z2 ds1 ds2 (3.15)
.2i
/2
thus
Z Z
1
C.z; z/ D C .s1 ; s2 /zs1 s2 ds1 ds2 : (3.16)
.2i
/2 <.s1 /D1 <.s2 /D2
CO.z1 M / (3.17)
where .s/ is the residue of h1.I P.s; s2 //1 .s1 ; s2 /i at point .s; L.s//, that is,
1 ˇ
.s1 / D @
h1j.s1 ; s2 /ihu.s1 ; s2 /j.s1 ; s2 /i ˇs2 DL.s1 / :
@s2
.s1 ; s2 /
where .s1 ; s2 / is the eigenvalue which has value 1 at .s; L.s// and u.s1 ; s2 / and
.s1 ; s2 / are respectively the left and right eigenvectors with the convention that
h.s1 ; s2 /ju.s1 ; s2 /i D 1.
The expression is implicitly a sum since the function L.s/ is meromorphic, but
we retain only the branch where .s1 ; s2 / is the main eigenvalue of P.s1 ; s2 / that
contributes to the leading term in the expansion of C.z; z/. For more details see [43]
where the analysis is specific to a case where one of the sources, namely source 2,
is memoryless uniform, i.e. P2 D jA1 j 1 ˝ 1.
The next step consists in moving the integration line for s1 from 1 to c1 which
corresponds to the position where function s1 L.s1 / (actually equal to ) attains
the minimum value. We only consider the case when L.c1 / D c2 < 0 (the other
case is obtained by symmetry). The poles are due to the function .:/. The first
3.5 Proofs of Contributions and Results 35
pole encountered is s1 D 1 but this pole cancels with the technical arrangement
discussed earlier.
We do not work on the simple case, i.e. when c1 > 0. We meet the second pole
at s D 0 and the residue is equal to
.0/ .c0 /zc0 since L.0/ D c0 . This quantity
turns out to be the leading term of C.z; z/ since the integration on <.s1 / D c1 is
O.z /. This proves Theorem 3.3. When P2 is logarithmically balanced, there exists
! such that .s; L.s/ C ik!/ D 1 for k 2 Z and the terms zc0 Cik! lead to a periodic
contribution.
The difficult case is when 1 < c1 < 0. In this case, C.z; z/ D O.z / but
to find precise estimates one must useR the saddle point method [23], at s D c1
since the integration is of the form <.s/ D c1 f .s/ exp..s C L.s//A/ds, where
f .s/ D
.s/ .s/ .L.s//, and A D log z ! 1. We naturally get an expansion
when <.z/ ! 1
e log z
.c1 / 1
C.z; z/ D p 1CO p
.˛2 log z C ˇ2 / log z
0
with ˛2 D L00 .c1 / and ˇ2 D
.c.c11// . In fact, the saddle point expansion is extendible
1
to any order of plog z
. This proves Theorem 3.4 in the general case. However, in
the case when P1 and P2 are logarithmically commensurable, the line <.s1 / D c1
contains an infinite number of saddle points that contribute in a double periodic
additional term.
Example: Assume that we have abinary alphabet AD fa; bg with memory 1, and
0:5 0:5 0:2 0:8
transition matrices P1 D and P2 D
0:5 0:5 0:8 0:2
The numerical analysis gives
n0:92
Cn;n D 13:06 p H) Cn;n D 3:99.log 2/n:
1:95 log n C 56:33
Inspired from the general results about the asymptotic digital trees and suffix tree
parameters distribution we conjecture the following [40, 45, 74].
Conjecture 3.1
(i) The variance Vn;n of the joint complexity of two random texts of same
length n generated by two Markov sources is of order O.n / when n !
1.
(ii) The normalized distribution of the joint complexity Jn;n of these two texts
tends to the normal distribution when n ! 1.
Jn;n Jn;n
Remark By “normalized distribution” we mean the distribution of p .
Vn;n
36 3 Joint Sequence Complexity: Introduction and Theory
The estimate
nk 1
Jn;n D p 1 C Q.log n/ C O
˛ log n C ˇ log n
which appears in the case of a different Markov source comes from a saddle point
analysis. The potential periodic terms Q.log n/ occur in a case where the Kernel K
shows an infinite set of saddle points. It turns out that the amplitude of the periodic
terms is of the order of
2i
log jAj
i.e. of the order of 106 for binary alphabet, but it rises when jAj increases. For
example, when jAj 26 such as in the Latin alphabet used in English (including
spaces, commas and other punctuation) we get an order within 101 .
Figure 3.4 shows the number of common factors from two texts generated from
two memoryless sources. One source is a uniform source over the 27 Latin symbols
(such source is so-called monkey typing), the second source takes the statistic of
letters occurrence in English. The trajectories are obtained by incrementing each
text one by one. Although not quite significant, the logarithmic oscillations appear
in the trajectories. We compare this with the expression
nk
p
˛ log n C ˇ
without the oscillation terms which are actually
n0:92
13:06 p :
1:95 log n C 73:81
In fact it turns
out that the saddle point expression has a poor convergence
term since the O log1 n is indeed in ˛ log1nCˇ made poorer since the latter does not
make less than ˇ1 for the text length range that we consider. But the saddle point
approximation leads to the estimate factor Pk ..˛ log n C ˇ/1 / of
!
nk 1 1
Jn;n D p .1 C Pk ..˛ log n C ˇ/ // C O 1
C Q.log n/
˛ log n C ˇ .log n/kC 2
(3.18)
Pk
where Pk .x/ D j
jD1 Ai x is a specific series polynomial of degree k. The error
1
term is thus in .˛ log n C ˇ/k 2 / but is not uniform for k. Indeed, the expansion
3.7 Numerical Experiments in Twitter 37
Fig. 3.4 Joint Complexity (y axis) of memoryless English text (x axis) versus monkey typing. The
first order theoretical average is shown in red (cont. line)
Fig. 3.5 Relative error in the saddle point expansion versus order for x D 1=ˇ
CNN, and BBC Breaking. In Twitter the maximum length of the messages is
140 characters. We take the hypothesis that the sources are Markov sources of
finite order. Individual tweets are of arbitrary length. The alphabets of the different
languages of tweet sets are converted on ASCII.
We compute the JC value for pairs of tweet sets in Fig. 3.8. We used tweets
from the 2012 Olympic Games and 2012 United States elections. We took two
sets from each of these tweet sets to run our experiments, but first we removed
the tags similar to the topic, such as #elections, #USelections, #USelections2012,
#Olympics, #Olympics2012, #OlympicGames and so on. As we can see in Fig. 3.8,
the JC is significantly high when we compare tweets in the same subjects, for both
real and simulated tweets (simulated tweets are generated from a Markov source
of order 3 trained on the real tweets). We observe the opposite when we compare
different subjects. In the US elections topic, we can see that the JC increases
significantly when the number of characters is between 1700 and 1900. This is
because users begin to write about and discuss the same subject. We can observe
the same in the Olympic Games topic between 6100 and 6300 and between 9500
and 9900. This shows the applicability of the method to distinguish information
sources. In Fig. 3.9, we plot the JC between the simulated texts and compare with
the theoretical average curves expected by the proposed methodology.
3.8 Suffix Trees 39
Fig. 3.6 Joint Complexity (y axis) of memoryless English text (x axis) versus monkey typing. The
optimal fifth order theoretical average is shown in red (cont. line)
A Suffix Tree [93] is a compressed trie [77] containing all the suffixes of the given
text as their keys and positions in the text as their values. We may refer as PAT
tree or, in an earlier form, position tree. The suffix tree allows particularly fast
implementations of many important string operations.
The construction of such a tree for a string S of length n takes on average
O.n log n/ time and space linear in the length of n. Once constructed, several
operations can be performed quickly, for instance locating a substring in S, locating
a substring if a certain number of mistakes are allowed, locating matches for a
regular expression pattern and other useful operations.
Every node has outgoing edges labeled by symbols in the alphabet of S. Thus
every node in the Suffix Tree can be identified via the word made of the sequence of
labels from the root to the node. The Suffix Tree of S is the set of the nodes which
are identified by any factor of S.
Suffix Tree Compression: A unitary sequence of nodes is a chain where nodes
have all degree 1. If a unitary chain ends to a leaf then it corresponds to a factor
40 3 Joint Sequence Complexity: Introduction and Theory
Fig. 3.7 Average Joint Complexity (y axis) of memoryless English text versus monkey typing.
The optimal fifth order theoretical (in red, cont. line) plus periodic terms
which appears only once in S. The chain can be compressed into a single node. In
this case the concatenation of all the labels of the chain correspond to a suffix of
S and the compressed leaf will contain a pointer to this suffix. The other (internal)
nodes of the Compressed Suffix Tree correspond to the factors which appear at least
twice in S. This is the Compressed Suffix Tree version whose size is O.n/ in average,
otherwise the uncompressed version is O.n2 /.
Similarly any other unitary chain which does not go to a leaf can also be
compressed in a single node, the label of the edges to this node is the factor obtained
by concatenating all the labels. This is called the Patricia compression and in general
gives very small reduction in size.
The Suffix Tree implementation and the comparison process (ST superposition)
between two Suffix Trees in order to extract the common factors of the text
sequences can be found in the Appendix A.
3.9 Snow Data Challenge 41
1.5
0.5
0
0 2000 4000 6000 8000 10000
Text length (n)
Fig. 3.8 Joint Complexity of four tweet sets from the 2012 United States elections and Olympic
Games. The text is an incremental aggregation of tweets from these sets
The proposed tree structure in the Appendix A needs O.n/ time to be stored and
sub linear time for the superposition (finding overlaps). Two main examples with
graph tree representation follow in Figs. 3.10 and 3.11 for the sequence “apple” and
“maple”, respectively, which have nine common factors. Figures 3.12 and 3.13 show
the Suffix Tree for the sequence “healthy” and “sealed”, which have seven common
factors. The construction of the Suffix Tree for the sequence “apple” and “maple”,
as well as the comparison between them (ST superposition) is shown in Fig. 3.14.
In February 2014 the Snow Data Challenge of the World Wide Web Conference
(WWW’14) was announced. Every year the WWW community organizes a different
challenge. In 2014 the challenge was about extracting topics in Twitter. The volume
of information in Twitter is very high and it is often difficult to extract topics in real
42 3 Joint Sequence Complexity: Introduction and Theory
2
Joint Complexity
1.5
0.5
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Text length (n)
Fig. 3.9 Joint Complexity of tweet sets from the 2012 United States elections and Olympic games,
in comparison with theoretic average curves. The text is an incremental aggregation of tweets from
these sets
pple ν e
l p
e le
time. The task of this challenge was to automatically mine social streams to provide
journalists a set of headlines and complementary information that summarize the
most important topics for a number of timeslots (time intervals) of interest.
The Snow Challenge organization provided a common framework to mine the
Twitter stream and asked to automatically extract topics corresponding to known
3.9 Snow Data Challenge 43
pple ν e aple le
Root
a h
e l t y
althy ν
Fig. 3.12 Suffix Tree for the sequence “healthy”, where is the empty string
Root
a e
d l s
led ν ed ealed
a d
led ν
Fig. 3.13 Suffix Tree for the sequence “sealed”, where is the empty string
events (e.g. politics, sports, entertainment). The crawled data were divided into
timeslots and we had to produce a fixed number of topics for selected timeslots.
Each topic should be in the form of a short headline that summarizes a topic related
to a piece of news occurring during that timeslot, accompanied by a set of tweets,
URLs of pictures (extracted from the tweets), and a set of keywords. The expected
output format was the following: [headline, keywords, tweetIds, picture urls].
We got the third Prize in the Challenge, while our method was discussed to
receive the first Prize. The main advantage of the method was its language agnostics
44 3 Joint Sequence Complexity: Introduction and Theory
Root
p
ap maple
e le
S1 S1 S1
S2
S2 S2 S2
le ple le ple
S1
S2 S1 S1
S2
Fig. 3.14 Suffix Tree superposition for the sequences S1 D apple and S2 D maple
and we were able to report topics in many different languages other than English,
e.g. French, Spanish, Korean, etc. The Challenge organization restricted us to report
topics only in English, since the evaluation was decided to be done in that language,
but we decided to report the exact output of the method [81].
First, we collected tweets for 24 h; between Tuesday Feb. 25, 18:00 and
Wednesday Feb. 26, 18:00 (GMT). The crawling collected more than 1;041;062
tweets between the Unix timestamps 1393351200000 and 1393437600000 and was
conducted through the use of the Twitter streaming API by following 556;295 users
and also looking for four specific keywords: Syria; terror; Ukraine; bitcoin. The
dataset was split into 96 timeslots, where each timeslot contains tweets for every
15 min, starting at 18:00 on Tuesday 25th 2014. The challenge then consisted in
providing a minimum of one and a maximum of ten different topics per timeslot,
along with a headline, a set of keywords and a URL of a relevant image for each
detected topic. The test dataset activity and the statistics of the dataset crawl are
described more extensively in [81].
Until the present, the main methods used for text classification are based on
keywords detection and machine learning techniques as was extensively described
in Chap. 2. Using keywords in tweets has several drawbacks because of wrong
spelling or distorted usage of the words—it also requires lists of stop-words for
every language to be built—or because of implicit references to previous texts or
messages. The machine learning techniques are generally heavy and complex and
therefore may not be good candidates for real-time text processing, especially in the
case of Twitter where we have natural language and thousands of tweets per second
to process. Furthermore, machine learning processes have to be manually initiated
by tuning parameters, and it is one of the main drawbacks for the kind of application,
3.9 Snow Data Challenge 45
...
JC(t1n,tMn )
JC(t1n ,t4n)
t8 n n
t1 t4
JC(t ,t )
1 8
JC(t1n ,t5n)
JC(t1n ,t7n) JC(t1n ,t6n)
t7 t5
t6
where we want minimum if any human intervention. Some other methods are using
information extracted by visiting the specific URLs on the text, which makes them
a heavy procedure, since one may have limited or no access to the information, e.g.
because of access rights, or data size and throughput.
In our method [9] we use the Joint Complexity (computed via Suffix Trees) as a
metric to quantify the similarity between the tweets. This is a significant achieve-
ment because we used a general method adapted to the Snow Data Challenge.
According to the dataset described in Sect. 3.9 and in [81] we have N D 96
timeslots with n D 1 : : : N. For every tweet tin , where i D 1 : : : Mn , with Mn being
the total number of tweets, in the n-th timeslot, we build a Suffix Tree, ST.tin /, as
described in Sect. 3.3. Building a Suffix Tree is an operation that costs linear time
and takes O.m/ space in memory, where m is the length of the tweet.
Then we compute the Joint Complexity metric as mentioned earlier, JC.tin ; tjn / of
the tweet tin with every other tweet tjn of the n-th timeslot, where j D 1 : : : Mn , and
j ¤ i (by convention we choose JC.tin ; tin / D 0). For the N timeslots we store the
results of the computation in the matrices T1 ; T2 ; : : : ; TN of Mn Mn dimensions.
We represent each matrix Tn by fully connected weighted graphs. Each tweet is a
node in the graph and the two-dimensional array Tn holds the weight of each edge,
as shown in Fig. 3.15. Then, we calculate the score for each node in our graph by
summing all the edges which are connected to the node. The node that gives the
highest score is the most representative and central tweet of the timeslot.
0 n 1
0 JC.t1n ; t2n / JC.t1n ; t3n / JC.t1n ; tM /
B JC.t2 ; t1 /
n n
0 JC.t2n ; t3n / n C
JC.t2n ; tM /C
B
B n C
Tn D B JC.t3 ; t1 / JC.t3 ; t2 / 0 JC.t3n ; tM /C
n n n n
B : : :: :: C
@ :: :: : : A
JC.tM ; t1 / JC.tM ; t2 / JC.tM
n n n n n n
; t3 / 0
46 3 Joint Sequence Complexity: Introduction and Theory
3.9.2 Headlines
While running through the list of related tweets we computed the bag-of-words
used to construct the list of keywords and we also checked the original :json data to
find a URL pointing to a valid image related to the topic.
We chose to print the first top eight topics for each timeslot, which are the heads
of the first eight lists of related tweets.
In order to produce a list of keywords per topic as requested from the Data
Challenge, we first removed articles (stop-words), punctuation, special characters,
etc., from the bag-of-words constructed from the list of related tweets of each topic.
We got a list of words, and then we ordered them by decreasing frequency of
occurrence. Finally, we reported the k most frequent words, in a list of keywords
1 2
K D ŒK1:::k ; K1:::k ; : : : ; K1:::k
N
, for the N total number of timeslots.
Apart from the specific implementation for the Snow Data Challenge, the main
benefits of our method are that we can both classify the messages and identify the
growing trends in real time, without having to manually set up lists of keywords for
every language. We can track the information and timelines within a social network
and find groups of users which agree on the same topics.
The official evaluation results of our method in the Snow Data Challenge are
included in [81]. Although the dataset that was used for this challenge did not allow
to show this properly, one key advantage of using Joint Complexity is that it can deal
with languages other than English [46, 66] without requiring any additional feature.
Joint Complexity was also used for classification in Twitter [65]. The innovation
brought by the method is in the use of the information contained in the redirected
URLs of tweets. We use this information to augment the similarity measure of
3.10 Tweet Classification 49
JC, which we call tweet augmentation. It must be noted that this method does not
have access to the redirected URLs as described above about the prior art existing
solution.
The method proceeds in two phases: (3.10.2) Training phase, and (3.10.3) Run
phase.
During the Training phase we construct the training databases (DBs) by using
Twitter’s streaming API with filters for specific keywords. For example, if we want
to build a class about politics, then we ask the Twitter API for tweets that contain the
word “politics”. Using these requests we build M classes on different topics. Assume
that each class contains N tweets (e.g. M D 5 Classes: politics, economics, sports,
technology, lifestyle of N D 5000 tweets). To each class we allocate K keywords
(e.g. the keywords used to populate the class; their set is smaller than the bag-of-
words). The tweets come in the .json format which is the basic format delivered by
the Twitter API.
Then we proceed to the URL extraction and tweet augmentation. The body of
a tweet (in the .json file format) contains a URL information if the original author
of the tweet has inserted one. In general Twitter applies a hashing code in order to
reduce the link size in the tweets delivered to users (this is called URL shortening).
However the original URL comes in clear in the .json format provided by the Twitter
API. While extracting the tweet itself, we get both the hashed URL and the original
URL posted by the user. Then, we replace the short URL in the tweet’s text by the
original URL and we get the augmented tweet.
In the next step, we proceed with the Suffix Tree construction of the augmented
tweet. Building a suffix tree is an operation that costs O.n log n/ operations and
takes O.n/ space in memory, where n is the length of the augmented tweet. The
tweet itself does not exceed 140 characters, so the total length of the augmented
tweet is typically smaller than 200 characters.
During the Run phase (shown in Algorithm 2), we get tweets from the Twitter
Streaming Sample API. For every incoming tweet we proceed to its classification by
the following operations: At first, we augment the tweet as described in Sect. 3.10.2
Training phase. Then, we compute the matching metric of the augmented incoming
tweet with each class. The score metric is of the form:
MJC ˛ C PM ˇ (3.19)
50 3 Joint Sequence Complexity: Introduction and Theory
constructClasses.M; N/I
for i D 1 to M do
for i D 1 to N do
URL
ti;j extractURL.tij /I
aug
ti;j tweetAugmentation.tij ; ti;j
URL
/I
aug
ST
ti;j suffixTreeConstruction.tij /I
end for
end for
Run phase:
while .tjson TwitterAPI:getSample./ Š D null/ do
t tjson :getText./I
tURL extractURL.tjson /I
taug tweetAugmentation.t; tURL /I
tST suffixTreeConstruction.taug /I
for i D 1 to M do
PMi .t/ patternMatching.tURL /I
avg
JCi averageJC.taug /I
JCimax maximum.JC.taug //I
avg
JCimax JCi
ˇ JCimax
˛ 1ˇ
end for
where MJC is the max of Joint Complexity (JC) of the augmented incoming tweet
over the tweets already present in the class, and PM is the pattern matching score
of the incoming tweet over the class keywords. Quantities ˛ and ˇ are weight
avg
parameters, which depend on the average Joint Complexity, JCi , of the i-th class,
max
and the maximum JC (best fitted), JCi . We construct those as follows:
avg
JCimax JCi
ˇD
JCimax
˛ D 1ˇ
avg JCmax
When the average Joint Complexity, JCi D 2i the weight ˛ D ˇ D 0:5, and if
the pattern matching on the URL returns zero, then ˇ D 0 and ˛ D 1.
3.10 Tweet Classification 51
The Joint Complexity between two tweets is the number of the common factors
defined in language theory and can be computed efficiently in O.n/ operations
(sublinear on average) by Suffix Tree superposition. We also compute the Pattern
Matching score with the keywords of the i-th class, i.e. as the number of keywords
actually present in the augmented tweet URL. The metric is a combination of the
MJC and PM.
We then assign an incoming tweet to the class that maximizes the matching
metric defined at (3.19) and we also link it to the best fitted tweet in this class,
i.e. the tweet that maximizes the Joint Complexity inside this class.
In the case described above where newly classified tweets are added to the
reference class (which is useful for trend sensing), then in order to limit the size of
each reference class we delete the oldest tweets or the least significant ones (e.g. the
ones which got the lowest JC score). This ensures the low cost and efficiency of our
method.
The main benefits of our method are that we can both classify the messages and
identify the growing trends in real time, without having to manually identify lists
of keywords for every language. We can track the information and timeline within
a social network and find groups of users that agree or have the same interests, i.e,
perform trend sensing.
Politics
(39%)
Technology
(11%)
Economics Sports
(5%) (28%)
The other and more important difference with DP (without changing the output of
the method) is that instead of building Suffix Trees, this time the method constructs
a tf-idf bag of words, and then classifies each tweet of the Run phase by selecting
the category containing the closest tweet to our test tweet. The notion of closest
is because we used Locality Sensitive Hashing based on the Cosine Similarity in
a vector space where each possible word is a dimension and its tf-idf score is the
coordinate in that dimension. In such a space when the cosine between the two
vectors is close to 1, it means that the vectors are pointing in the roughly same
direction, in other words the two tweets represented by the vectors should share a
lot of words and thus should probably speak about or refer to the same subject.
true positives
Precision D
true positives C false positives
3.11 Chapter Summary 53
true positives
Recall D
true positives C false negatives
where, for a class C , true positives are tweets that were classified in C by both the
algorithm and the Ground Truth, false positives are tweets that were classified in C
by the algorithm but in some other class by the Ground Truth and false negatives
are tweets that were classified in C by the Ground Truth but in some other class by
the algorithm.
We also computed the F-score in order to combine into a single metric both
precision and recall (for faster comparison at a glance):
Recall Precision
F-score D 2
Recall C Precision
A global overview of the results is presented in Table 3.2 where we can see
that, on average, JC outperforms DP, JCurl outperforms DPurl and JCurlPM clearly
outperforms them all.
Looking in more details for each category, the global tendency is confirmed
except for a couple of categories like Technology where DP has a slightly better
Precision but a worse Recall. In the Sports category, on the other hand, the situation
is reversed as DP seems to provide a slightly better Recall. In both cases the
differences are too small to be really significant and what can be noted is that
JCurlPM always outperforms all other methods (Fig. 3.17). The mediocre precision
obtained by DP and JC in Fig. 3.18 can be explained by the fact that the Economics
category was under-represented in the Ground Truth dataset and given the fact that
Politics and Economics are often very close subjects, both methods classified a few
Politics tweets into the Economics category thus lowering the Precision. It can be
noted that the Recall, on the other hand, is quite good for both methods (Figs. 3.19,
3.20 and 3.21).
In this chapter we studied the Joint Sequence Complexity and its applications, which
range from finding similarities between sequences to source discrimination. Markov
models well described the generation of natural text, and we exploited datasets
from different natural languages using both short and long sequences. We provided
54 3 Joint Sequence Complexity: Introduction and Theory
1
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
DP JC DPurl JCurl JCurlPM
Classification, Politics
Fig. 3.17 Precision (left), recall (middle) and F-score (right) for the classified tweets in the class
Politics
1
Precision
Ground Truth vs. Classification Methods
0.9 Recall
F−score
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
DP JC DPurl JCurl JCurlPM
Classification, Economics
Fig. 3.18 Precision (left), recall (middle) and F-score (right) for the classified tweets in the class
Economics
3.11 Chapter Summary 55
1
Precision
Ground Truth vs. Classification Methods
0.9 Recall
F−score
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
DP JC DPurl JCurl JCurlPM
Classification, Sports
Fig. 3.19 Precision (left), recall (middle) and F-score (right) for the classified tweets in the class
Sports
1
Precision
Ground Truth vs. Classification Methods
0.9 Recall
F−score
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
DP JC DPurl JCurl JCurlPM
Classification, Lifestyle
Fig. 3.20 Precision (left), recall (middle) and F-score (right) for the classified tweets in the class
Lifestyle
56 3 Joint Sequence Complexity: Introduction and Theory
1
Precision
Ground Truth vs. Classification Methods
0.9 Recall
F−score
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
DP JC DPurl JCurl JCurlPM
Classification, Technology
Fig. 3.21 Precision (left), recall (middle) and F-score (right) for the classified tweets in the class
Technology
models and notations, and presented the theoretical analysis. A study on expending
asymptotics and periodic terms was also mentioned. We applied our methodology
to real messages from Twitter in the Snow Data Challenge of the World Wide
Web Conference in 2014, where we evaluated our proposed methodology on topic
detection, classification and trend sensing in Twitter in real time. Our proposed
method based on Joint Complexity was praised by a committee of experts and we
won the third Prize.
Chapter 4
Text Classification via Compressive Sensing
4.1 Introduction
In this chapter we apply the theory of Compressive Sensing (CS) [18] to achieve
low dimensional classification. According to Compressive Sensing theory, signals
that are sparse or compressible in a suitable transform basis can be recovered from
a highly reduced number of incoherent linear random projections, which overcomes
the traditional signal processing methods. Traditional methods are dominated by the
well-established Nyquist–Shannon sampling theorem, which requires the sampling
rate to be at least twice the maximum bandwidth.
We introduce a hybrid classification and tracking method, which extends our
recently introduced Joint Complexity method [46, 66], which was tailored to the
topic detection and trend sensing of user’s tweets. More specifically, we propose a
two-step detection, classification and tracking method:
First we employ the Joint Complexity, already described in detail in the previous
chapter, as the cardinality of a set of all distinct factors of a given string represented
by suffix trees, to perform topic detection. Second, based on the nature of the data,
we apply the methodology of Compressive Sensing to perform topic classification
Let us first describe the main theoretical concepts of CS [4, 12, 18] and how it is
applied on the problem classification [61, 68]. Consider a discrete-time signal x in
RN . Such signal can be represented as a linear combination of a set of basis f i gNiD1 .
Constructing an N N basis matrix D Œ 1 ; 2 ; : : : ; N , the signal x can be
expressed as
X
N
xD si i D s (4.1)
iD1
x D s C (4.2)
with 2 RN being the noise, where E./ D 0 and var. / D O.j sj/. The
efficiency of a CS method for signal approximation or reconstruction depends highly
on the sparsity structure of the signal in a suitable transform domain associated with
an appropriate sparsifying basis 2 RNN . It has been demonstrated [12, 18] that if
x is K-sparse in (meaning that the signal is exactly or approximately represented
by K elements of this basis), it can be reconstructed from M D rK N non-
adaptive linear projections onto a second measurement basis, which is incoherent
with the sparsity basis, and where r is a small overmeasuring factor (r > 1).
The measurement model in the original space-domain is expressed as g D ˚x ;
where g 2 RM is the measurement vector and ˚ 2 RMN denotes the measurement
matrix. By noting that x can be expressed in terms of the basis as in (4.2) the
measurement model has the following equivalent transform-domain representation
g D ˚ s C ˚ : (4.3)
elements of the first are not represented sparsely by the elements of the second,
and vice versa. Since the original vectors of signals, x, are not sparse in general,
in the following study we focus on the more general case of reconstructing their
equivalent sparse representations, s, given a low-dimensional set of measurements
g and the measurement matrix ˚.
By employing the M compressive measurements and given the K-sparsity
property in basis , the sparse vector s, and consequently the original signal x,
can be recovered perfectly with high probability by taking a number of different
approaches. In the case of noiseless CS measurements the sparse vector s is
estimated by solving a constrained `0 -norm optimization problem of the form,
where ksk0 denotes the `0 norm of the vector s, which is defined as the number of
its non-zero components. However, it has been proven that this is an NP-complete
problem, and the optimization problem can be solved in practice by means of a
relaxation process that replaces the `0 with the `1 norm,
which will give s with a relative error of O p1nN . In [12, 18] it was shown that
these two problems are equivalent when certain conditions are satisfied by the two
matrices ˚, (restricted isometry property (RIP)).
The objective function and the constraint in (4.5) can be combined into a single
objective function, and several of the most commonly used CS reconstruction
methods solve the following problem,
sO D arg min ksk1 C kg .˚ s/k2 ; (4.6)
s
where is a regularization factor that controls the trade-off between the achieved
sparsity (first term in (4.6)) and the reconstruction error (second term). Commonly
used algorithms are based on linear programming [14], convex relaxation [12, 96]
and greedy strategies (e.g. Orthogonal Matching Pursuit (OMP) [19, 98]).
During the training phase, we built our classes as described in Sect. 3.3 and for
each class we extract the most central/representative tweet(s) (CTs) based on the
Joint Complexity method. The vector iT consists of the highest JC scores of
60 4 Text Classification via Compressive Sensing
CONSTRUCT SUFFIX
TREE FOR THE TWEET
GENERATE JOINT
COMPLEXITY SCORES
IDENTIFY REFERENCE
TWEET FOR CURRENT TIME
INTERVAL BASED ON
SCORES FOR EACH TWEET
YES
NEXT TIME INTERVAL?
NO
END
the i-th CT. The matrix T is used as the appropriate sparsifying dictionary for
the training phase. Moreover, a measurement matrix ˚ iT is associated with each
transform matrix iT . In the proposed algorithm, a standard Gaussian measurement
matrix is employed, with its columns being normalized to unit `2 norm (Fig. 4.1).
Figure 4.2 shows a flowchart of the preprocessing phase classification based on
Compressive Sensing in conjunction with the part of Joint Complexity shown in the
flowchart of Fig. 4.1.
A similar process is followed during the runtime phase. More specifically, we denote
xc;R as the Joint Complexity score of the incoming tweet with the CTi classified at the
current class c, where R denotes the runtime phase. The runtime CS measurement
model is written as
gc D ˚ R xc;R ; (4.7)
where ˚ R denotes the corresponding measurement matrix during the runtime phase.
4.3 Compressive Sensing Classification 61
GENERATE
SPARSIFYING
MATRIX
GENERATE
MEASUREMENT
MATRIX
COMPUTE MEASUREMENT
VECTORS BY APPLYING
MEASUREMENT MATRIX TO
TWEETS
DETERMINE CLASSES OF
TWEETS BASED ON
MEASUREMENT VECTORS
REPORT CLASSES
The measurement vector gc is formed for each CTi according to (4.7) and the
reconstruction takes place via the solution of (4.6), with the training matrix T
being used as the appropriate sparsifying dictionary.
Figure 4.3 shows the flowchart for the runtime phase of the classification based
on Compressive Sensing.
In this work, we are based on the assumption that the CS-based classification
method involves the mobile device that collects the tweets from the Twitter API and
performs the core CS algorithm. The performance analysis, described in Sect. 4.5,
reveals an increased accuracy of the proposed CS-based classification algorithm
when compared with other methods described in Sects. 2.4 and 4.5.
62 4 Text Classification via Compressive Sensing
Most of the tracking methods use past state estimates and motion dynamics to refine
the current state estimate determined by the above topic detection and classification
methods. In addition, the dynamic motion model can also be used in conjunction
with the current state estimate to predict the future possible states.
We are based on the assumption that the Compressive Sensing based classifica-
tion method involves the mobile device that collects the tweets from the Twitter API
and performs the core Joint Complexity and Compressive Sensing algorithm.
If we had a model of Joint Complexity to detect the change of topics, we could
use the Kalman filter to track a user according to his/her tweets. In this work, we
assume that the change of topics is uniform.
Kalman filtering is a well-established method for estimating and tracking mobile
targets. A typical Kalman filter [29] is applied recursively on a given dataset in
two phases: (1) Prediction and (2) Update. The main advantage of this algorithm is
that it can be executed in real time, since it is only based on the currently available
information and the previously estimated state.
Focusing on the problem of classification, the user tweets periodically, and we
check that information with the CTs at a specific time interval t.
Then, the classification system estimates the user’s class at time t, which is
denoted by p .t/ D Œx .t/T . Following a Kalman filtering approach, we assume that
the process and observation noises are Gaussian, and also that the motion dynamics
model is linear. The process and observation equations of a Kalman filter-based
model are given by
where x.t/ D Œx.t/; vx .t/T is the state vector, with x being the correct class in the
space (user’s tweets) and vx .t/ the tweeting frequency, z.t/ is the observation vector,
while matrices F and H define the linear motion model. The process noise .t/
N.0; S/ and the observation noise v.t/ N.0; U/ are assumed to be independent
zero-mean Gaussian vectors with covariance matrices S and U, respectively. The
current class of the user is assumed to be the previous one plus the information
provided by the JC metric, which is computed as the time interval t multiplied by
the current tweeting speed/frequency.
The steps to update the current estimate of the state vector x .t/, as well as
its error covariance P.t/, during the prediction and update phase are given by the
following equations
where the superscript “” denotes the prediction at time t, and K.t/ is the optimal
Kalman gain at time t.
The proposed Kalman system exploits not only the highly reduced set of com-
pressed measurements, but also the previous user’s class to restrict the classification
set. The Kalman filter is applied on the CS-based classification [62], described
briefly in Sect. 4.3, to improve the estimation accuracy of the mobile user’s path.
More specifically, let s be the reconstructed position-indicator vector. Of course
in practice s will not be truly sparse, thus the current estimated position ŒxCS , or
equivalently, cell cCS , corresponds to the highest-amplitude index of s . Then, this
estimate is given as an input to the Kalman filter by assuming that it corresponds
to the previous time t 1, that is, x .t 1/ D ŒxCS ; vx .t 1/T , and the current
position is updated using (4.10). At this point, we would like to emphasize the
computational efficiency of the proposed approach, since it is solely based on the
use of the very low-dimensional set of compressed measurements given by (4.3),
which are obtained via a simple matrix-vector multiplication with the original
high-dimensional vector. Given the limited memory and bandwidth capabilities of
a small mobile device, the proposed approach can be an effective candidate to
achieve accurate information propagation, while increasing the device’s lifetime.
Since M N we have a great complexity improvement given by Compressive
Sensing, which reduces the overall complexity of the Kalman filter. Algorithm 3
shows the combination of JC and CS method in conjunction with the Kalman filter,
and summarizes the proposed information propagation system. Finally, Fig. 4.4
presents the flowchart for generating a tracking model and predicting classes of
tweets.
CLASSIFY TWEET
PREDICT CLASS OF
SUBSEQUENT TWEET
YES
TWEET AVAILABLE
FOR USER?
NO
END
1
DP
0.9 DPurl
JC+CS with L1EQ−PD
0.8 JCurl+CS with L1EQ−PD
0.7
F−Score
0.6
0.5
0.4
0.3
0.2
25 50 75 100
Percentage of Total Number of Measurements
Fig. 4.5 Classification accuracy measured by F-Score for the DP, DPurl and JC+CS, JCurl+CS
method as a function of the number of measurements (%) by using the `1 -norm minimization
The classification performance is compared for: (a) Document Pivot (DP), (b) Joint
Complexity with Compressive Sensing (JC+CS), (c) Document Pivot with URL
(DPurl), (d) Joint Complexity and Compressive Sensing with URL (JCurl+CS),
where (c) and (d) include the information of the compressed URL of a tweet
concatenated with the original tweet’s text and extracted from the .json file.
An overview of the results is presented in Table 4.1 where we can see that, on
average, JC with CS outperforms DP, and JCurl with CS outperforms DPurl.
Figure 4.5 compares the classification accuracy of the DP, DPurl and JC+CS,
JCurl+CS method as a function of the number of measurements by using the `1 -
norm minimization. Figure 4.6 compares the reconstruction performance between
several widely used norm-based techniques and Bayesian CS algorithms. More
specifically, the following methods are employed1 : (1) `1 -norm minimization using
the primal-dual interior point method (L1EQ-PD), (2) Orthogonal Matching Pursuit
1
For the implementation of methods (1)–(5) the MATLAB codes can be found in: http://sparselab.
stanford.edu/, http://www.acm.caltech.edu/l1magic, http://people.ee.duke.edu/~lcarin/BCS.html
66 4 Text Classification via Compressive Sensing
0.9
0.8
0.7
F−Score
0.6
L1EQ−PD
0.5 OMP
StOMP
0.4
Lasso
BCS
0.3
BCS−GSM
0.2
25 50 75 100
Percentage of Total Number of Measurements
Fig. 4.6 Classification accuracy measured by F-Score as a function of the number of measure-
ments (%) by using several reconstruction techniques, for the JCurl+CS method
1
BCS
0.9 BCS−GSM
0.8
0.7
F−Score
0.6
0.5
0.4
0.3
0.2
25 50 75 100
Percentage of Total Number of Measurements
Fig. 4.7 Classification accuracy measured by F-Score as a function of the number of mea-
surements (%) by using Kalman, for the JCurl+CS method, with the BCS and BCS-GSM
reconstruction techniques
(OMP), (3) Stagewise Orthogonal Matching Pursuit (StOMP), (4) LASSO, (5) BCS
and (6) BCS-GSM [100]. Figure 4.6 shows that BCS and BCS-GSM outperform the
introduced reconstruction techniques, while Fig. 4.7 shows that we achieve better
performance of 10% when using the Kalman filter.
4.6 Chapter Summary 67
Abstract In this chapter, the theory of Joint Complexity and Compressive Sensing
has been extended to three research subjects, (a) classification encryption via com-
pressed permuted measurement matrices, (b) dynamic classification completeness
based on Matrix Completion and (c) encryption based on the Eulerian circuits
of original texts. In the first additional research subject we study the encryption
property of Compressive Sensing in order to secure the classification process in
Twitter without an extra cryptographic layer. The measurements obtained are con-
sidered to be weakly encrypted due to their acquisition process, which was verified
by the experimental results. In the second additional research subject we study
the application of Matrix Completion (MC) in topic detection and classification.
Based on the spatial correlation of tweets and the spatial characteristics of the
score matrices, we apply a novel framework which extends the Matrix Completion
to build dynamically complete matrices from a small number of random sample
Joint Complexity scores. In the third additional research subject, we present an
encryption system based on Eulerian circuits, that destructs the semantics of a text
while retaining it in correct syntax. We study the performance on Markov models,
and perform experiments on real text.
5.1 Introduction
In this chapter, the theory of Joint Complexity and Compressive Sensing has been
extended to three research subjects, (a) classification encryption via compressed
permuted measurement matrices, (b) dynamic classification completeness based on
Matrix Completion and (c) encryption based on the Eulerian circuits of original
texts.
In the first additional research subject we study the encryption property of
Compressive Sensing in order to secure the classification process in Twitter without
an extra cryptographic layer. The measurements obtained are considered to be
weakly encrypted due to their acquisition process, which was verified by the
experimental results.
In the second additional research subject we study the application of Matrix
Completion (MC) in topic detection and classification. Based on the spatial correla-
tion of tweets and the spatial characteristics of the score matrices, we apply a novel
g D ˚ s C ˚ (5.1)
where is a regularization factor that controls the trade-off between the achieved
sparsity and the reconstruction error.
During the Preprocessing phase, we built our classes as described in Sect. 3.3 and
for each class we extract the most representative tweet(s) (CTs) based on the Joint
Complexity method. The vector iT consists of the highest JC scores of the i-th CT.
The matrix T is used as the appropriate sparsifying dictionary for the training
phase. Moreover, a measurement matrix ˚ iT is associated with each transform
matrix iT , while T denotes the preprocessing phase.
The matrix iT 2 RNi C is used as the appropriate sparsifying dictionary for
the i-th CT, since in the ideal case the vector of tweets at a given class j received
from CT i should be closer to the corresponding vectors of its neighboring classes,
and thus it could be expressed as a linear combination of a small subset of the
columns of iT . Moreover, a measurement matrix ˚ iT 2 RMi Ni is associated with
each transform matrix iT , where Mi is the number of CS measurements. In the
proposed algorithm, a standard Gaussian measurement matrix is employed, with its
columns being normalized to unit `2 norm. A random matrix or a PCA matrix could
be also used.
A similar process is followed during the runtime phase. More specifically, we denote
xc;R as the Joint Complexity score of the incoming tweet with the CTi classified at the
current class c, where R denotes the runtime phase. The runtime CS measurement
model is written as
gc D ˚ R xc;R (5.3)
72 5 Extension of Joint Complexity and Compressive Sensing
where ˚ iR 2 RMc;i Nc;i denotes the corresponding measurement matrix during the
runtime phase. In order to overcome the problem of the difference in dimensionality
between the preprocessing and run phase, while maintaining the robustness of the
reconstruction procedure, we select ˚ iR to be a subset of ˚ iT with an appropriate
number of rows such as to maintain equal measurement ratios.
The measurement vector gc;i is formed for each CT i according to (5.3) and
transmitted to the server, where the reconstruction takes place via the solution
of (5.2), with the training matrix iT being used as the appropriate sparsifying
dictionary. We emphasize at this point the significant conservation of the processing
and bandwidth resources of the wireless device by computing only low-dimensional
matrix-vector products to form gc;i (i D 1; : : : ; P) and then transmitting a highly
reduced amount of data (Mc;i Nc;i ). Then, the CS reconstruction can be performed
remotely (e.g. at a server) for each CT independently.
Last, we would like to note the assumption that the CS-based classification
method involves the mobile device that collects the tweets from the Twitter API
and a server that performs the core CS algorithm.
The method consists of two parts: (5.2.3.1) Privacy system, and (5.2.3.2) Key
description.
vectors and the correct one is used, as shown in Figs. 5.1 and 5.2 and described
extensively in [67, 71].
The device sends the measurement vector g to the server along with N 1 false
vectors, where the reconstruction takes place. Then, the server uses the information
5.2 Classification Encryption via Compressed Permuted Measurement Matrices 73
Fig. 5.1 N 1 false vectors plus the correct one. This key, i.e., the sequence of the measurement
vectors reaches the server
WIRELESS
DEVICE SERVER
GENERATE
ΨTiAND ΦT
i
i
GENERATE ΨR,c
SEND nc ,i i
EXTRACT ΦR
i
FROM ΦT
i
GENERATE ΦR,p
i
SEND ΦR,p BY PERMUTATION
i i
gc,i = Φ R,p
Ψ R,c
SEND gc ,i
CS
WITH A CONCATENATION OF N-1 WRONG g RECONSTRUCTION
of the topic detection and the representative tweets based on their JC scores, etc.,
and performs classification as described in Sect. 4.3.
1
For the implementation of methods (1)–(5) the MATLAB codes can be found in: http://sparselab.
stanford.edu/, http://www.acm.caltech.edu/l1magic, http://people.ee.duke.edu/~lcarin/BCS.html
5.2 Classification Encryption via Compressed Permuted Measurement Matrices 75
1
DP
0.9 DPurl
JC+CS
0.8 JCurl+CS
0.7
F−Score
0.6
0.5
0.4
0.3
0.2
25 50 75 100
Percentage of Total Number of Measurements
Fig. 5.3 Classification accuracy measured by F-Score for the DP, DPurl and JC+CS, JCurl+CS
method as a function of the number of measurements (%) by using the `1 -norm min
0.9
0.8
0.7
F−Score
0.6
L1EQ−PD
0.5 OMP
StOMP
0.4
Lasso
BCS
0.3
BCS−GSM
0.2
25 50 75 100
Percentage of Total Number of Measurements
Fig. 5.4 Classification accuracy measured by F-Score as a function of the number of measure-
ments (%) by using several reconstruction techniques, for the JCurl+CS method
76 5 Extension of Joint Complexity and Compressive Sensing
1
BCS
BCS−GSM
0.8
0.6
F−Score
0.4
0.2
0
0 20 40 60 80 100
Percentage of Permutations
Fig. 5.5 Evaluation of the encryption property using BCS and BCS-GSM, for a varying number
of permuted lines of ˚Ri
In this study we want to address the major challenges of topic detection and
classification by using less information, and thus reducing the complexity. First we
perform topic detection based on Joint Complexity and then we introduce the theory
of dynamic Matrix Completion (DynMC) to reduce the computational complexity
of JC scores based on the simple Matrix Completion method.
While there are recent works that propose solutions to the exhaustive com-
putations [76] they are not taking into account the dynamics of the users and
use synthetic data instead. We extend these methods by proposing a dynamic
framework [51, 64, 70] that takes advantage of the spatial correlation of tweets,
and thus reduces the computational complexity.
5.3 Dynamic Classification Completeness 77
5.3.1 Motivation
k.i j/
j˝j D (5.5)
h
while the sampling map A˝ .M/ has zero entries at the j-th position of the i-th
timeslot if s.i; j/ … ˝.
During the runtime phase we need to recover the unobserved measurements of
matrix s, denoted by s , by solving the following minimization problem
where F denotes the Euclidean norm, and is the noise parameter. The convex
optimization problem in (5.6) can be solved by an interior point solver, e.g.
CVX [28], or via singular value thresholding, e.g. FPC and SVT [10], which applies
a singular value decomposition algorithm and then projection on the already known
measurements in each step.
In this section, we describe our proposed framework, DynMC, led by the intuition of
the spatio-temporal correlations between JC scores among the several representative
tweets. During the training phase we collect tweets and compute the JC scores at
each time t.
Assume that C 2 Rii defines the temporal correlation of the tweet in specific
classes, while indicates the noise. The relationship of the tweets between the JC
scores and the representative tweets over time can be expressed as:
where ŒA.M/t and ŒA.M/t1 2 Ri1 represent the JC scores at time t and t 1,
respectively, received at a specific class.
As it was mentioned earlier, tweets have a spatial correlation, since closer tweets
or classes show similar measurement vectors. In this study we try to address this
problem by introducing a dynamic Matrix Completion technique. The proposed
technique is able to recover the unknown matrix at time t by following a random
sampling process and reduce the exhaustive computation of JC scores.
As it was mentioned in Sect. 5.3.1 subsampling gives matrix Mt at each time t of
the sampling period and we receive a subset ˝t
jij jjj of the entries of Mt , where
j˝jt D k i. The sampling operator At (as defined in Sect. 5.3.1) gives
n
ŒAt .M/j;i D Pj;i ; .j; i/ 2 ˝t 0; otherwise: (5.8)
where Pj;i is the JC score received at j; i and ˝t is a subset of the complete set of
entries jij jjj, where ˝t [ ˝tC D jij jjj.
While the sampling operator ACt .Mt / collects the unobserved measurements at
time t, we also define the sampling operator AI D At1 \ ACt as the intersection of
the training measurements of the classes by time t 1.
We need to recover the fingerprint map Mt that will be used during the runtime
phase by taking into account the JC scores received on previous time windows.
The proposed technique reconstructs matrix Mt that has the minimum nuclear norm,
subject to the values of Mt 2 ˝t and the sampled values at time t1. There is a clear
correlation with measurements at time t via C according to the model in Eq. (5.7).
5.3 Dynamic Classification Completeness 79
Matrix C and the original matrix Mt can be recovered by solving the following
optimization problem
Q t jj s.t.
min jjM (5.9)
Q t ;C
M
Considering the topic detection and classification evaluation, the accuracy of the
tested methods was measured with the standard F-score metric, using a ground
truth over the database of more than 1.5M tweets. The Document-Pivot method
was selected to compare with our method, since it outperformed the other state-
of-the-art techniques in a Twitter context as shown in [2]. The tweets are collected
by using specific queries and hashtags and then a bag-of-words is defined, which
uses weights with term frequency-inverse document frequency (tf-idf ). Tweets are
ranked and merged by considering similarity context between existing classified and
incoming tweets. The similarity is computed by using Locality Sensitive Hashing
(LSH) [2], with its main disadvantage being the manual observation of training and
test tweets [6].
The classification performance is compared for: (a) Document Pivot (DP), (b)
Joint Complexity with Compressive Sensing (JC+CS) [12, 69], (c) Document Pivot
with URL (DPurl), (d) Joint Complexity and Compressive Sensing with URL
(JCurl+CS) [12, 69], where (c) and (d) include the information of the compressed
URL of a tweet concatenated with the original tweet’s text; extracted from the .json
file.
Figure 5.6 shows the recovery error of the score matrices s based on FPC
algorithm. We can recover the si matrices by using approximately 77% of the
original symmetric part with the error of completeness ! 0, while addressing the
problem of exhaustive computations.
80 5 Extension of Joint Complexity and Compressive Sensing
1
Relative recovery error in Frob.
Relative recovery in the spectral norm
0.8
0.6
Error
0.4
0.2
0
0 20 40 60 80 100 150
Entries of Matrix s (Total 225M)
Fig. 5.6 Reconstruction error of si by using the FPC algorithm. Approximately 77% of the
symmetric part needed while the error of completeness ! 0
2
For the implementation of methods (1)–(5) the MATLAB codes can be found in: http://sparselab.
stanford.edu/, http://www.acm.caltech.edu/l1magic, http://people.ee.duke.edu/~lcarin/BCS.html
5.4 Stealth Encryption Based on Eulerian Circuits 81
1
DP
0.9 DPurl
JC
0.8 JCurl
0.7
F−Score
0.6
0.5
0.4
0.3
0.2
25 50 75 100
Percentage of Total Number of Measurements
Fig. 5.7 Topic detection accuracy measured by F-Score for the DP, DPurl and JC, JCurl method
as a function of the number of measurements (%) on the recovered matrix s by using the `1 -norm
min. on 67% of the measurements
1
DynMC
MC
0.8
0.6
Error
0.4
0.2
0
0 20 40 60 80 100 150
Entries of Matrix s (Total 225M)
Fig. 5.8 Reconstruction error of si by using the FPC algorithm. DynMC has a faster convergence
compared to MC as the error of completeness ! 0
In this work, we introduce an encryption system that destructs the semantics of a text
while retaining it almost in correct syntax. The encryption is almost undetectable,
since the text is not able to be identified as different to a regular text. This makes the
82 5 Extension of Joint Complexity and Compressive Sensing
system resilient to any massive scanning attack. The system is based on the Eulerian
circuits of the original text. We provide the asymptotic estimate of the capacity of
the system when the original text is a Markovian string, and we aim to make the
encrypted text hardly detectable by any automated scanning process.
The practice and study of techniques for secure communication between users
has become emergent and important in every day life. The meaning of cryptology
is huge in the disciplines of computer systems security and telecommunications.
The main goal is to provide mechanisms such that two or more users or devices
can exchange messages without any third party intervention. Nowadays, it is used
in wireless networks, information security in banks, military purposes, biometric
recognition, smart cards, VPN, WWW, satellite TV, databases, VOIP and plethora
of systems.
In the very early stages, cryptographic mechanisms were dealing with the
language structure of a message, where nowadays, cryptography has to deal with
numbers, and is based on discrete mathematics, number theory, information theory,
computational complexity, statistics and combinatorics.
Cryptography consists of four basic functions: (a) confidentiality, (b) integrity, (c)
non-repudiation and (d) certification. The encryption and decryption of a message is
based on a cryptographic algorithm and a cryptographic key. Usually the algorithm
is known, so the confidentiality of the encrypted transmitted message is based on the
confidentiality of the cryptographic key. The size of that key is counted in number
of bits. In general, the larger the cryptographic key, the harder the decryption of the
message.
There are two main categories of crypto-systems: (a) classic crypto-systems,
which are divided into substitution ciphers and transposition ciphers, and (b) modern
crypto-systems, which are divided into symmetric (share a common key) and
asymmetric (use public and private key). Systems based on symmetric cryptography
are divided into block ciphers and stream ciphers. However, users based on
asymmetric cryptographic systems know the public key, but the private key is secret.
The information being encrypted by one of the keys can be decrypted only by the
other key.
Up to now, the main methods used for text encryption are sophisticated algo-
rithms that transform original data into encrypted binary stream. The problem is
that such streams are easily detectable under an automated massive attack, because
they are in very different format and aspect of non-encrypted data, e.g. texts in
natural language. This way any large scale data interception system would very
easily detect the encrypted texts. The result of this detection is twofold. First
detected encrypted texts can be submitted to massive decryption processes on
large computing resources, and finally be deciphered. Second, even when the texts
would not eventually be deciphered, the source or the destination of the texts are
at least identified as hiding their communications and therefore can be subject to
other intrusive investigations. In other words, encryption does not protect against a
massive spying attack if encrypted texts are easily detectable.
In our method we use a permutation of the symbols i.e. n-grams of the original
text. Doing so leads to the apparent destruction of the semantic information of the
5.4 Stealth Encryption Based on Eulerian Circuits 83
text while keeping the text quasi correct in its syntax, and therefore undetectable
under an automated syntactic interception process. To retrieve the original informa-
tion the n-grams would be reordered in their original arrangement.
5.4.1 Background
According to the terminology given earlier, we define the syntax graph G of text
T. We assume a fixed integer r and we denote Gr .T/ D .V; E/ the directed syntax
graph of the r-grams of text T. There is an edge between two r-grams a and b, if b
is obtained by the translation by one symbol of r-gram a in the text T. For example,
the 3-gram b D “mpl” follows the 3-gram a D “amp” in the text T D “example”.
The graph Gr .T/ is a multi-graph, since several edges can exist between two r
grams a and b, as many as b follows an instance of a in the text T.
Figure 5.9 shows the syntax graph of the famous Shakespeare’s text T = “to be
or not to be that is the question”, with jVj D 13 (for 1-grams) and jEj D 39.
An Eulerian path in a multi-graph is a path which visits every edge exactly once.
An Eulerian circuit or cycle is an Eulerian path which starts and ends on the same
vertex and visits every edge exactly once. The number of Eulerian paths in a given
multi-graph is easy to compute (in a polynomial time) as a sequence of binomials
and a determinantI or can be adapted from BEST theorem [1, 99], to enumerate the
number of Eulerian circuits [35, 42, 60] which will be explained in Sect. 5.4.3.
The originality of the system is that the appearance of the transmitted information,
although encoded, is not changed, so that malicious software applications cannot
determine that a flow of information under study is encoded or not. Consequently,
being undetected, the encoded information will not be subject to massive decipher-
84 5 Extension of Joint Complexity and Compressive Sensing
Fig. 5.9 Syntax Graph (1-grams) of the famous Shakespeare’s sentence T = “to be or not to be
that is the question”, where “ 00 denotes the space
Let the text “Up to now, the main . . . easily detectable”: written in the Introduc-
tion, be the text T. We design the Syntax Graph G.T/ with the 4-grams of T, which
gives 3:42 1042 discrete Eulerian circuits. One possible circuit is the following:
T 0 D “In In other words encryptions and finally be deciphered, therefore can be
submitted binary stream. This way any large communication does not protectable
under an automated massive investination is twofold. First detect the texts can be
decrypted texts are in very easily detection of this detectable. Up to now, they are
easily detected encryption processes on large scale data into encrypted to massive
deciphered. Second, even when the main methods used form original data intrusive
spying resources, and as hiding attack if encrypted data, e.g. text encrypted aspect
of non-encrypted texts are attack, because the source or texts would very different
for the encryption are easily detect against a massive at least identified algorithms
that transformat and their computing the texts in natural language. The problem is
that such streams are sophisticated texts. The result of the destigation system would
not eventually be subject to other interceptions. In”
In this example, it is clear that the encoded text looks like an English text, and
that it would require either manual intervention to determine that it is not a normal
text, or a very complex automatic semantic analysis. Even the small text presented
in Fig. 5.9 gives 1:19 1010 discrete Eulerian circuits.
In general, the block ciphers method, breaks into blocks the initial message, which
is going to be encrypted, and encrypts every block separately. Usually these blocks’s
size is 64 or 128 bits. This way the encryption can be adapted to the length of the
information to be coded. The encryption is based on a mathematical encryption or
cryptographic function f .x; k0 / where x is the block and k0 is a shared key. The result
is a new cryptographic block y that usually has the same length with the initial block.
The length of every block has to be large enough, in order to avoid dictionary
attacks. If the length of every block is small, then a malicious user can find a pair
of clean and respectfully encrypted block to design a dictionary that corresponds
every clean block to a encrypted block. Based on that dictionary, every text that is
encrypted with the particular key could be decrypted.
The decryption process is the inverted encryption process g.y; k0 / such that
g.f .x; k0 /; k0 / D x. In that case, a decryption function is used instead of the
encryption function. Some classic block cipher algorithms are Data Encryption
Standard [73], Triple DES which uses three keys instead of one, Advanced
Encryption Standard [16], Tiny Encryption Algorithm [104] and others [79].
In our algorithm we assume that there is an encryption function f .x; N; k0 / where
k0 is a shared key that produces an integer y 2 .0; N 1/, when x is an integer, and 2
.0; N 1). Let g.y; N; k0 / be an inverse function, such that x D g.f .x; N; k0 /; N; k0 /.
The function can be made of the concatenation of m blocks encryption of size ` (64
log N
or 128 bits) with m D d `2 e. In this case f .x; N; k0 / D x1 y2 : : : ym if x is made of
86 5 Extension of Joint Complexity and Compressive Sensing
the concatenation of blocks x1 : : : xm with yi D f .xi ; k0 / for all 1 < i < m. The first
block is left unencrypted in order to have y < N with very high probability.
Let r0 be an integer, the proposed scheme is described in the following algorithm:
Corollary 5.1 The mutual information rate 1n I.Xn ; Yn / tends to zero when n ! 1.
Proof We already know that H.Yn / D H.Xn / since Xn and Yn have same probability
in the Markov process, thus I.Xn ; Yn / D H.Xn / H.Yn jXn /, where the last term is
the conditional entropy of Yn with respect to Xn . Since H.Yn jXn / D Ln the results
hold.
Remark
The orderpof magnitude of I.Xn ; Yn / that is obtained in the proof of Theorem 5.1 is
O.log3 n n/ but the right order should be O.log n/ with a more careful handling
of the error terms. Anyhow this proves that our scheme is efficiently erasing the
information of Xn while keeping the text in a formal shape.
We will also evaluate the average number En of Eulerian circuits in Xn . To this
end, for s 0 we denote P.s/ the matrix made of coefficients .pab /s (if pab D 0 we
assume .pab /s D 0). We denote .s/ the main eigenvalues of matrix P.s/.
Theorem 5.2 The average number of Eulerian circuits of string Xn of length n is
equivalent to ˛n 2n . 12 / for some ˛ > 0 that can be explicitly computed.
Remark
This average is purely theoretical, since it is impossible to simulate this result
when n is large like in the example text, since the most important and decisive
contributions come from strings with extremely low probabilities. In order to prove
our theorem we need some notations and lemmas.
Let k be a V V integer matrix defined on A A which is the adjacency matrix
of syntax graph G.Xn /, i.e. the coefficient kab of k is equal to the number of time
symbol b follows symbol a in Xn , we say that k is the type of string Xn as defined
in [42]. For .a; b/ 2 A 2 we also denote
P ıab the typec of the Pstring ab.
For c 2 A we denote kc D k
d2A cd and k D d2A kdc respectively the
outdegree and indegree of symbol c in the syntaxicP graph. Let Fn the set of balanced
types, i.e. such that 8c 2 A W kc D kc , and such .c;d/2A 2 kcd D n.
Lemma 5.1 The set Fn is a lattice of dimension V 2 V [42]. Its size a.n/ D
2
O.nV VC1 / and we denote ! the volume of its elementary element.
Proof The set of matrix is embedded in the vector space of real matrices which is
a dimension V 2 . The V balance equations are in fact V 1 since any of them can
be deduced from the sum of the other. There is a last equation to specify that all
coefficients sum to n.
We denote F .1/ the set of balanced real matrices with positive coefficients
that sum to 1. For y a real
P non-negative matrix and s 0, we denote L.y; s/ D
yc s
.c;d/2A 2 yc;d log p
ycd cd
88 5 Extension of Joint Complexity and Compressive Sensing
@ yc s
L.y; s/ D log p : (5.11)
@ycd ycd cd
The maximum on F .1/ or F .1/ 1n ıba must be member of the vector space
generated by the matrix 1 (made of all one), and the matrices Aj , j 2 A , the
coefficients of Aj are all zeros excepted 1 on the jth column and 1 on the jth row,
and zero on the diagonal. These matrices are the orthogonal matrices that define
F .1/ (or a translation of it). Membership to this vector space is equivalent to the
fact that @y@cd L.y; s/ must be of the form ˛ C zc zd for some ˛ and .zc /c2A , which
is equivalent to the fact that
ycd xd pscd
D (5.12)
yc xc
P
for some and .xc /c2A . From the fact that d2A yycdc D 1 we get D .s/ and
P
.xc /c2A D .uc .s//c2A , i.e. xc D d2A pscd xd . Consequently
X xc
L.y; s/ D ycd log. / (5.13)
xd
.c;d/2A 2
X
D log. / ycd (5.14)
.c;d/2A 2
X
C .yc yc / log.xc / (5.15)
c2A
Thus L.Qy.s/; s/ D log .s/ and L.Qyn .s/; s/ D .1 1n / log .s/ C 1n log uuab .s/
.s/
.
To simplify our proofs we will assume in the sequel that all strings start with a
fixed
Q initialkcdsymbol a. The strings starting with a having type k have probability
k
.c;d/2A 2 pcd that we denote P . We denote
!
Y kc
Bk D : (5.16)
.kcd /b2A
c2A
5.4 Stealth Encryption Based on Eulerian Circuits 89
In passing Nkb Pk is the probability that string Xn has type k and we have the
identity
X X
Nkb Pk D 1 : (5.18)
b2A kCıba 2Fn
1
Ekb D Bk detbb .I .k C ıba / / : (5.19)
kba C 1
P
For convenience we will handle only natural logarithms. We have Ln D b2A Lnb
with
X
Lnb D Nkb Pk log Ekb : (5.20)
kCıba 2Fn
Proof
p (Proof of Theorem 5.1) Using the P Stirling approximation: kŠ D
2
kkk ek .1 C O. 1k // and defining `.y/ D cd ycd log yycdc we have
1 X
Lnb D .n C O.1// rb .y/ exp.nL.y; 1//`.y/
n.V1/V=2
kCıba 2Fn
CO.log n/ (5.21)
From (5.22) we have Lnb D Lnb .B/.1 C O.n1 /. Since function L.y; 1/ is infinitely
derivable and attains its maximum in F .1/, which is zero, on yQ n .1/ we have for all
y 2 F .1/ 1n ıba
2 CV
Since 1n Fn is a lattice of degree .V 1/V with elementary volume !nV and
Dn ./ converge to some the non-negative quadratic form D./ on F .1/:
2
r.Qy.1// nV V 1
Lnb .B/ D n`.Qy.1// .V1/V=2 1CO p
n ! n
Z
enD.yQy.1// dyV.V1/ (5.26)
F .1/
where det.
D/ must be understood as the determinant of the quadratic operator
The same analysis can be done by removing log Ekb and since via (5.18) we shall get
P r .Qy.1//
b2A
pb
det.
D/
D 1, we get
!!
log3 n
Ln D n`.Qy.1// 1 C O p : (5.29)
n
We terminate the proof of Theorem 5.1 by the fact that n`.Qy.1// D H.Xn / (since
8c 2 A : uc .1/ D 1 and .vc .1//c2A is the Markov stationary distribution).
Proof (Proof of Theorem
P 5.2) The proof of Theorem 5.2 proceeds equivalently
except that En D b2A Enb with
X
Enb D Nkb Pk Ekb : (5.30)
kCıba 2Fn
The main factor Pk B2k leads to a factor exp 2nL y; 12 . Consideration on the order
of the n factors leads to the estimate in An1 2n . 12 /.
More extensive studies can be found in our later works in [37–39, 47].
Figure 5.10 shows I.Xn ; Yn /, the discrepancy between Ln and H.Xn /, more precisely
the mean value of log Nk C log Pk versus the string length n, when Xn is generated
by a Markov process of memory 1 based on the statistics of the syntax graph of
the sentence to be or not to be that is the question (depicted on Fig. 5.9) [38]. As
predicted by the conjecture it is quite sub-linear and seems to be in log n. Each point
has been simulated 100 times.
In this chapter we presented our work in three additional research subjects, (a) clas-
sification encryption via compressed permuted measurement matrices, (b) dynamic
classification completeness based on Matrix Completion and (c) encryption based
on the Eulerian circuits of original texts.
In the first additional research subject we study the encryption property of
Compressive Sensing in order to secure the classification process in Twitter without
an extra cryptographic layer. First we performed topic detection based on Joint
Complexity and then we employed the theory of Compressive Sensing to classify the
tweets by taking advantage of the spatial nature of the problem. The measurements
92 5 Extension of Joint Complexity and Compressive Sensing
80
70
60
50
40
30
20
10
0
0 2000 4000 6000 8000 10000
Fig. 5.10 Mutual information I.Xn ; Yn / versus n for the Markov process of Fig. 5.9. Y axis is the
mutual information (in bit), X axis is the length of the string up to 10,000
This book introduced and compared two novel topic detection and classification
methods based on Joint Complexity and Compressive Sensing. In the first case,
the joint sequence complexity and its application was studied, towards finding
similarities between sequences up to the discrimination of sources. We exploited
datasets from different natural languages using both short and long sequences. We
provided models and notations, presented the theoretical analysis, and we applied
our methodology to real messages from Twitter, where we evaluated our proposed
methodology on topic detection, classification and trend sensing, and we performed
automated online sequence analysis.
In the second case, the classification problem was reduced in a sparse
reconstruction problem in the framework of Compressive Sensing. The
dimensionality of the original measurements was reduced significantly via random
linear projections on a suitable measurement basis, while maintaining an increased
classification accuracy. By taking advantage of the weak capability properties we
were able to design a secure system for classification without the need of any
cryptographic layer.
The empirical experimental evaluation revealed that the methods outperform
previous approaches based on bag-of-words and semantic analysis, while the Com-
pressive Sensing based approach achieved a superior performance when compared
to state-of-the-art techniques. We performed an evaluation of various datasets in
Twitter and compete in the Data Challenge of the World Wide Web Conference
2014, which verified the superiority of our method.
Motivated by the philosophy of posting tweets, a hybrid tracking system is
presented, which exploits the efficiency of a Kalman filter in conjunction with
the power of Compressive Sensing to track a Twitter user, based on his tweets.
The experimental evaluation reveals an increased classification performance, while
maintaining a low computational complexity.
in which case we simply add the edge at the correct sublevel. If at any point
during this process we reach a leaf node, we need to expand it and create the
corresponding internal nodes as long as the two substrings coincide and sprout
new leaves as soon as they differ.
In order to compute the Joint Complexity of two sequences, we simply need their
two respective Suffix Trees.
Comparing Suffix Trees can be viewed as a recursive process which starts at
the root of the trees and walks along both trees simultaneously. When comparing
subparts of the trees we can face three situations:
• both parts of the trees we are comparing are leaves. A leaf is basically a string
representing the suffix. Comparing the common factors of two strings can be
done easily by just incrementing a counter each time the characters of both strings
are equal and stop counting as soon as they differ. For example, comparing the
two suffixes “nalytics” and “nanas” would give a result of 2 as they only share
the first two characters; while comparing suffixes “nalytics” and “ananas” would
return 0 as they do not start with the same character.
• one subpart is a leaf while the other subpart is an internal node (i.e. a branch).
Comparing a non-leaf node and a leaf is done by walking through the substring
of the leaf and incrementing the score as long as there is an edge whose label
corresponds to the current character of the leaf. Note that the keys (edge’s labels)
are sorted so that we can stop looking for edges as soon as an edge is sorted after
the current character (allowing average sublinear computation). When we reach
a leaf on the subtree side, we just need to compare the two leaves (starting from
where we are at in the substring and the leaf that we just reached).
• both parts of the trees we are comparing are internal nodes (i.e. we are trying to
compare two branches). Comparing two non-leaf nodes is done by initializing
a current counter to zero and walking through both subtrees while doing the
following at each step:
– first check whether we have reached the end (epsilon) of one of the subtrees,
in which case we can stop comparing and return the current counter.
– then check whether we have reached two leaves at the same time, in which
case we add the current counter to the output of the previously described
method for comparing two leaves.
– check whether we have reached a leaf on one side only, in which case we
add the current counter to the output of the previously described method for
comparing a leaf with an internal node.
– keep walking through subtrees as long as we find identical edges (and call
ourselves recursively), each time incrementing our internal counter.
References
1. Aardenne-Ehrenfest TV, de Bruijn NG (1951) Circuits and trees in oriented linear graphs.
Simon Stevin 28:203–217
2. Aiello LM, Petkos G, Martin C, Corney D, Papadopoulos S, Skraba R, Goker A, Kom-
patsiaris Y, Jaimes A (2013) Sensing trending topics in twitter. IEEE Trans Multimedia
15(6):1268–1282
3. Allan J (2002) Topic detection and tracking: event-based information organization. Kluwer
Academic Publishers, Norwell
4. Baraniuk R (2007) Compressive sensing. IEEE Signal Process Mag 24:118–121
5. Becher V, Heiber PA (2011) A better complexity of finite sequences, abstracts of the 8th. In:
International conference on computability, complexity, and randomness, p 7
6. Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identifica-
tion on twitter. In: 5th international AAAI conference on web and social media
7. Blei DM, Lafferty JD (2006) Dynamic topic models. In: 23rd ACM international conference
on machine learning, New York, pp 113–120
8. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
9. Burnside G, Milioris D, Jacquet P (2014) One day in twitter: topic detection via joint
complexity. In: Proceedings of SNOW 2014 data challenge (WWW’14), Seoul
10. Cai J, Candès E, Shen Z (2010) A singular value thresholding algorithm for matrix
completion. SIAM J Optim 20:1956–1982
11. Candès E, Recht B (2009) Exact matrix completion via convex optimization. Found Comput
Math 9:717–772
12. Candés E, Romberg J, Tao T (2006) Robust uncertainty principles: exact signal reconstruction
from highly incomplete frequency information. IEEE Trans Inform Theory 52:489–509
13. Cataldi M, Caro LD, Schifanella C (2010) Emerging topic detection on twitter based on
temporal and social terms evaluation. In: 10th international workshop on multimedia data
mining, New York, pp 1–10
14. Chen S, Donoho D, Saunders M (1999) Atomic decomposition by basis pursuit. SIAM J Sci
Comput 20(1):33–61
15. Conti M, Delmastro F, Passarella A (2009) Social-aware content sharing in opportunistic
networks. In: 6th annual IEEE communications society conference on sensor, mesh and ad
hoc communications and networks workshops, pp 1–3
16. Daemen J, Rijmen V (2002) The design of Rijndael: AES - the advanced encryption standard.
Springer, Berlin
42. Jacquet P, Szpankowski W (2004) Markov types and minimax redundancy for Markov
sources. IEEE Trans Inform Theory 50(7):1393–1402
43. Jacquet P, Szpankowski W (2012) Joint string complexity for Markov sources. In: 23rd
international meeting on probabilistic, combinatorial and asymptotic methods for the analysis
of algorithms, vol 12, pp 303–322
44. Jacquet P, Szpankowski W (2015) Analytic pattern matching: from DNA to twitter. Cam-
bridge University Press, Cambridge
45. Jacquet P, Szpankowski W, Tang J (2001) Average profile of the Lempel-Ziv parsing scheme
for a Markovian source. Algorithmica 31(3):318–360
46. Jacquet P, Milioris D, Szpankowski W (2013) Classification of Markov sources through joint
string complexity: theory and experiments. In: IEEE international symposium on information
theory (ISIT), Istanbul
47. Jacquet P, Milioris D, Burnside G (2014) Textual steganography - undetectable encryption
based on n-gram rotations. European Patent No. 14306482.2
48. Janson S, Lonardi S, Szpankowski W (2004) On average sequence complexity. Theor Comput
Sci 326:213–227
49. Ji S, Xue Y, Carin L (2008) Bayesian compressive sensing. IEEE Trans Signal Process
56(6):2346–2356
50. Jiao J, Yan J, Zhao H, Fan W (2009) Expertrank: an expert user ranking algorithm in online
communities. In: International conference on new trends in information and service science,
pp 674–679
51. Kondor D, Milioris D (2016) Unsupervised classification in twitter based on joint complexity.
In: International conference on computational social science (ICCSS’16), Chicago, IL
52. Lampos V, Cristianini N (2010) Tracking the flu pandemic by monitoring the social web. In:
International workshop on cognitive information processing (CIP), pp 411–416
53. Lehmann J, Goncalves B, Ramasco JJ, Cattuto C (2012) Dynamical classes of collective
attention in twitter. In: 21st ACM international conference on world wide web (WWW),
New York, pp 251–260
54. Leskovec J, Backstrom L, Kleinberg J (2009) Meme-tracking and the dynamics of the news
cycle. In: 15th ACM international conference on knowledge discovery and data mining
(KDD), New York, pp 497–506
55. Li M, Vitanyi P (1993) Introduction to Kolmogorov Complexity and its Applications.
Springer, Berlin
56. Li H, Wang Y, Zhang D, Zhang M, Chang EY (2008) Pfp: parallel fp-growth for query
recommendation. In: ACM conference on recommender systems, New York, pp 107–114
57. Li G, Li H, Ming Z, Hong R, Tang S, Chua T (2010) Question answering over community
contributed web video. Trans Multimedia 99, 17(4):46–57
58. Markines B, Cattuto C, Menczer F (2009) Social spam detection. In: Proceedings of the 5th
ACM international workshop on adversarial information retrieval on the web, New York,
pp 41–48
59. Mathioudakis M, Koudas N (2010) Twittermonitor: trend detection over the twitter stream. In
International conference on management of data (SIGMOD), New York, pp 1155–1158
60. McKay B, Robinson RW (1995) Asymptotic enumeration of Eulerian circuits in the complete
graph. Combinatorica 4(10):367–377
61. Milioris D (2014) Compressed sensing classification in online social networks. Technical
report, Columbia University, New York
62. Milioris D (2014) Text classification based on joint complexity and compressed sensing.
United States Patent No. 14/540770
63. Milioris D (2016) Classification encryption via compressed permuted measurement matrices.
In: IEEE international workshop on security and privacy in big data (BigSecurity’16),
INFOCOM’16, San Francisco, CA
64. Milioris D (2016) Towards dynamic classification completeness in twitter. In: IEEE European
signal processing conference (EUSIPCO’16), Budapest
100 References
65. Milioris D, Jacquet P (2013) Method and device for classifying a message. European Patent
No. 13306222.4
66. Milioris D, Jacquet P (2014) Joint sequence complexity analysis: application to social
networks information flow. Bell Labs Tech J 18(4):75–88
67. Milioris D, Jacquet P (2014) Secloc: encryption system based on compressive sensing
measurements for location estimation. In: IEEE international conference on computer
communications (INFOCOM’14), Toronto
68. Milioris D, Jacquet P (2015) Classification in twitter via compressive sensing. In: IEEE
international conference on computer communications (INFOCOM’15)
69. Milioris D, Jacquet P (2015) Topic detection and compressed classification in twitter. In: IEEE
European signal processing conference (EUSIPCO’15), Nice
70. Milioris D, Kondor D (2016) Topic detection completeness in twitter: Is it possible? In:
International conference on computational social science (ICCSS’16), Chicago
71. Milioris D, Tzagkarakis G, Papakonstantinou A, Papadopouli M, Tsakalides P (2014) Low-
dimensional signal-strength fingerprint-based positioning in wireless lans. Ad Hoc Netw J
Elsevier 12:100–114
72. Murtagh F (1983) A survey of recent advances in hierarchical clustering algorithms. Comput
J 26(4):354–359
73. National Bureau of Standards, U. D. o. C., editor. Data encryption standard. Washington DC,
Jan 1977
74. Neininger R, Rüschendorf L (2004) A general limit theorem for recursive algorithms and
combinatorial structures. Ann Appl Probab 14(1):378–418
75. Niederreiter H (1999) Some computable complexity measures for binary sequences. In: Ding
C, Hellseth T, Niederreiter H (eds) Sequences and their applications. Springer, Berlin, pp 67–
78
76. Nikitaki S, Tsagkatakis G, Tsakalides P (2012) Efficient training for fingerprint based posi-
tioning using matrix completion. In: 20th European signal processing conference (EUSIPCO),
Boucharest, Romania, pp 27–310
77. Nilsson S, Tikkanen M (1998) Implementing a dynamic compressed Trie. In: Proceedings of
2nd workshop on algorithm engineering
78. O’Connor B, Krieger M, Ahn D (2010) Tweetmotif: exploratory search and topic summa-
rization for twitter. In: Cohen WW, Gosling S, Cohen WW, Gosling S (eds) 4th international
AAAI conference on web and social media. The AAAI Press, Menlo Park
79. Paar C, Pelzl J (2009) Understanding cryptography, a textbook for students and practitioners.
Springer, Berlin
80. Papadopoulos S, Kompatsiaris Y, Vakali A (2010) A graph-based clustering scheme for iden-
tifying related tags in folksonomies. In: 12th international conference on data warehousing
and knowledge discovery, pp 65–76
81. Papadopoulos S, Corney D, Aiello L (2014) Snow 2014 data challenge: assessing the
performance of news topic detection methods in social media. In: Proceedings of the SNOW
2014 data challenge
82. Petrovic S, Osborne M, Lavrenko V (2010) Streaming first story detection with application
to twitter. In: Annual conference of the North American chapter of the association for
computational linguistics, pp 181–189
83. Phuvipadawat S, Murata T (2010) Breaking news detection and tracking in twitter. In:
IEEE/WIC/ACM international conference on web intelligence and intelligent agent technol-
ogy, pp 120–123
84. Porter MF (1997) An algorithm for suffix stripping. In: Readings in information retrieval.
Morgan Kaufmann Publishers Inc., San Francisco, pp 313–316
85. Prakash BA, Seshadri M, Sridharan A, Machiraju S, Faloutsos C (2009) Eigenspokes:
surprising patterns and scalable community chipping in large graphs. In: IEEE international
conference on data mining workshops (ICDMW), pp 290–295
86. Rodrigues EM, Milic-Frayling N, Fortuna B (2008) Social tagging behaviour in community-
driven question answering. In: IEEE/WIC/ACM international conference on web intelligence
and intelligent agent technology, pp 112–119
References 101
87. Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes twitter users: real-time event
detection by social sensors. In: Proceedings of the 19th ACM international conference on
world wide web (WWW’10), New York
88. Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill,
New York
89. Sankaranarayanan J, Samet H, Teitler BE, Lieberman MD, Sperling J (2009) Twitterstand:
news in tweets. In: 17th ACM international conference on advances in geographic information
systems, New York, pp 42–51
90. Sayyadi H, Hurst M, Maykov A (2009) Event detection and tracking in social streams. In:
Adar E, Hurst M, Finin T, Glance NS, Nicolov N, Tseng BL (eds) 3rd international AAAI
conference on web and social media. The AAAI Press, Menlo Park
91. Shamma DA, Kennedy L, Churchill EF (2011) Peaks and persistence: modeling the shape
of microblog conversations. In: ACM conference on computer supported cooperative work,
New York, pp 355–358
92. Szpankowski W (2001) Analysis of algorithms on sequences. Wiley, New York
93. Tata S, Hankins R, Patel J (2004) Practical suffix tree construction. In: 30th VLDB
conference, vol 30
94. Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat
Assoc 101(476):1566–1581
95. Teh YW, Newman D, Welling M (2007) A collapsed variational Bayesian inference algorithm
for latent Dirichlet allocation. Adv Neural Inf Process Syst 19:1353–1360
96. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B
(Methodol) 58(1):267–288
97. Toninelli A, Pathak A, Issarny V (2011) Yarta: a middleware for managing mobile social
ecosystems. In: International conference on grid and pervasive computing (GPC), Oulu,
Finland, pp 209–220
98. Tropp J, Gilbert A (2007) Signal recovery from random measurements via orthogonal
matching pursuit. IEEE Trans Inform Theory 53:4655–466
99. Tutte WT, Smith CAB (1941) On unicursal paths in a network of degree 4. Am Math Mon
48:233–237
100. Tzagkarakis G, Tsakalides P (2010) Bayesian compressed sensing imaging using a gaussian
scale mixture. In: 35th international conference on acoustics, speech, and signal processing
(ICASSP’10), Dallas, TX
101. Tzagkarakis G, Milioris, D, Tsakalides P (2010) Multiple-measurement Bayesian compres-
sive sensing using GSM priors for doa estimation. In: 35th IEEE international conference on
acoustics, speech and signal processing (ICASSP), Dallas, TX
102. Valdis K (2002) Uncloaking terrorist networks, vol 4. First Monday 7
103. Weng J, Lee B-S (2011) Event detection in twitter. In: 5th international conference on weblogs
and social media
104. Wheeler DJ, Needham RM (1994) Tea, a tiny encryption algorithm. In: International
workshop on fast software encryption. Lecture notes in computer science, Leuven, Belgium,
pp 363–366
105. White T, Chu W, Salehi-Abari A (2010) Media monitoring using social networks. In: IEEE
second international conference on social computing, pp 661–668
106. Wiil UK, Gniadek J, Memon N (2010) Measuring link importance in terrorist networks. In:
International conference on advances in social networks analysis and mining, pp 9–11
107. Xu X, Yuruk N, Feng Z, Schweiger TAJ (2007) Scan: a structural clustering algorithm for
networks. In: 13th ACM international conference on knowledge discovery and data mining
(KDD), New York, pp 824–833
108. Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In: 4th ACM
international conference on web search and data mining, New York, pp 177–186
109. Ziv J (1988) On classification with empirically observed statistics and universal data
compression. IEEE Trans Inform Theory 34:278–286
Index
A D
Advanced Encryption Standard, 85 Daily-Motion, 2
AES, 85 Data Encryption Standard, 85
Aggregation, 13 DES, 85
API, 6, 44, 47, 49, 51, 61, 62, 72 df-idft, 19
ASCII, 38 dictionaries, 5
asymptotics, 36, 56 dimension media servers, 5
attacks, 73 Discrete Fourier Transform, 11
autocorrelation, 30 DNA, 23
Doc-p, 6, 19
document–pivot, 6, 9
B Document–Pivot Topic Detection, 9
Badoo, 2 Document-Pivot Topic Detection, 9
bag-of-words, 5, 10, 74, 79 Double DePoissonization, 32
Barrabes, 2 DP, 53, 79
Bayesian CS algorithms, 65 DPurl, 53, 65, 74, 79
BCS, 66, 74, 80 dynamic Matrix Completion, 76
BCS-GSM, 66, 74, 80 dynamical system, 6
BNgram, 6, 9, 18, 19 dynamics, 1, 3
DynMC, 76, 78
C
cardinality of a set, 5
central tweets, 59, 71 E
classification encryption, 7 entertainment, 2
Collapsed Variational Bayesian, 13 Euclidean norm, 78
compressed trie, 39 Eulerian circuits, 7, 69, 70, 82, 83, 92
Compressive Sensing, 6, 62, 67, 69, 91, 93 Eulerian path, 83
computational complexity, 70, 76
congestion, 5
Content sharing, 4 F
Crowdsourcing, 4 F-score, 51, 52, 64
CS, 57, 63, 70 Facebook, 2
CT, 59, 71 feature–pivot, 6, 9
CVX, 78 FP, 16
cyber surveillance, 5 FPC, 78
FPM, 6, 19 M
Frequent Pattern, 16 machine learning, 5
Frequent Pattern Mining, 9, 16 Markov models, 23, 53, 69, 70, 92
Friendster, 2 Markov sources, 5, 38
Frobenious norm, 79 Markovian generation, 5
Matrix Completion, 7, 69, 92
MC, 69
G measurement matrix, 58, 70
GFeat-p, 6, 19 measurement vector, 61, 72, 74
grammar, 5 mining opinions, 5
Graph–Based Feature–Pivot Topic Detection, multimedia communication, 1
9, 14
Ground Truth, 51, 64, 74
GT, 51, 64 N
n-gram, 5, 18
Nasza Klasa, 2
H Netlog, 2
health care, 2 network resources, 5
Hyves, 2 Nyquist-Shannon, 6, 57
I O
I-Complexity, 23 obsolescence rate, 94
incoherent random projections, 6, 57 OMP, 66, 74, 80
influence, 4 optimization, 70
influential, 4 Orkut, 2
Orthogonal Matching Pursuit, 66, 74, 80
J
JC, 22, 53 P
JC+CS, 65, 79 PageRank, 11
JCurl, 53 PAT tree, 39
JCurl+CS, 65, 74, 79 periodic terms, 36
JCurlPM, 53 personal information, 4
Joint Complexity, 5, 22, 46, 57, 69, 91, 93 Precision, 51, 52, 64
privacy, 72
K
Kalman filter, 6, 58, 63, 93 Q
key, 72 quality of experience and service, 5
keywords, 5
Kullback–Leibler, 11
R
Recall, 51, 52, 64
L recommender systems, 3
L1EQ-PD, 74, 80 relevant information, 4
language, 5 Restricted Isometry Property, 59, 77
LASSO, 66, 74, 80 RIP, 59
Latent Dirichlet Allocation, 9, 13
LDA, 6, 19
LinkedIn, 2 S
Locality Sensitive Hashing, 14, 74 Sapo, 2
LSH, 17, 74, 79 SCAN, 14
LunarStorm, 2 security applications, 2
Index 105