Clustering of Facebook Fanpages Using Vector Representationg Based On LDA Model
Clustering of Facebook Fanpages Using Vector Representationg Based On LDA Model
Clustering of Facebook Fanpages Using Vector Representationg Based On LDA Model
Viet Hoang Phan1 , Duy Khanh Ninh1(B) , and Chi Khanh Ninh2
1 The University of Danang, University of Science and Technology, Danang, Vietnam
hoangvietit15@gmail.com, nkduy@dut.udn.vn
2 The University of Danang – Vietnam - Korea, University of Information and Communication
Technology, Danang, Vietnam
nkchi@vku.udn.vn
Abstract. Social networks have become an important part of human life. There
have been recently several studies on using Latent Dirichlet Allocation (LDA)
to analyze text corpora extracted from social platforms to discover underlying
patterns of user data. However, when we wish to discover the major contents of a
social network (e.g., Facebook) on a large scale, the available approaches need to
collect and process published data of every person on the social network. This is
against privacy rights as well as time and resource consuming. This paper tackles
this problem by focusing on fan pages, a class of special accounts on Facebook
that have much more impact than those of regular individuals. We proposed a
vector representation for Facebook fan pages by using a combination of LDA-
based topic distributions and interaction indices of their posts. The interaction
index of each post is computed based on the number of reactions and comments,
and works as the weight of that post in making of the topic distribution of a fan
page. The proposed representation shows its effectiveness in fan page topic mining
and clustering tasks when experimented on a collection of Vietnamese Facebook
fan pages. The inclusion of interaction indices of the posts increases the fan page
clustering performance by 9.0% on Silhouette score in the case of optimal number
of clusters when using K-means clustering algorithm. These results will help us
to build a system that can track trending contents on Facebook without acquiring
the individual user’s data.
1 Introduction
Nowadays, social networks have become an essential part of human life. Based on a
recent research of Statista, more than 3.5 billion people on earth have at least one account
on a social platform in 2019 [1]. With rapid growth of users, comes giant amount of
data. This can bring a lot of opportunities for those who can discover patterns inside of
the user data and find out meaningful usages of them.
In Vietnam, Facebook is the social network having the largest number of users
[1]. Posts on Facebook can come from individuals (particularly famous figures) or from
organizations in a form of what is called a fan page. Because of the great ability of sharing
posts to the fans (i.e., people who follow pages), fan pages are playing an important role
in spreading information, news, and facts on Facebook. If we can model the topics of
posts on popular pages, we will have a good chance to find out trending contents on
Facebook.
In recent years, there have been a lot of researches on using Latent Dirichlet Alloca-
tion (LDA) to cluster the scientific documents [2, 3] and news articles [4, 5]. For social
networks document analysis, there were some studies about modeling the topic on Twit-
ter [6, 7] or favorite topics on Facebook posts [8]. This research focuses on modeling
Facebook fan pages by using the method of topic modeling from documents (i.e., the
fan page’s posts).
In this paper, we propose a solution of modeling the topic of documents with LDA
combined with calculating the interaction index of the Facebook posts to find an effec-
tive vector representation of Facebook fan pages. Then we apply this representation to
analyze topic distribution of each fan page and to find out groups of similar fan pages.
The proposed solution shows the effectiveness on clustering the fan pages into subsets
by increasing the clustering performance than modeling using just LDA. The fan page
representation also helps point out the similarities between fan pages and give us an idea
about what is happening on Facebook in a particular period of time.
This paper is organized as follows. Section 2 reviews past studies leading to the
motivation of our work. Section 3 describes our proposed solution. Experiments and
results are given in Sect. 4. Section 5 presents the conclusion and future work of the
current research.
Topic modeling using LDA is not a new technique in Natural Language Processing. LDA
uses an unsupervised learning model, therefore it is a good technique for document
classification, especially on unlabeled datasets such as social network’s textual data.
There were several researches taking this advantage of LDA to model and analyze
Twitter conversations [6] or favorite topics of young Thai Facebook users [8]. The main
focus of these studies is the modeling and mining the topics from the text corpus of social
network users. Their proposed methods can help us to obtain the topics in which a part of
users interested in, for example educational workers and students at National University
of Colombia [6] or students at Assumption University in Thailand [8]. However, when
we wish to discover the major contents of a social network on a large scale such as
finding trending topics among users of a nation, the available approaches exhibit their
limitations, that is it is almost impossible to collect published data (i.e., the posts) of
every person on the social network because it goes against privacy rights as well as takes
a lot of time and resource to collect and process the data.
This paper tackles this problem by focusing on a class of special users that have much
more impact on social networks than other individuals, which are key opinion leaders
(KOLs) and popular organizations. A post of a KOL or a well-known organization,
676 V. H. Phan et al.
usually on their fan pages, may lead the opinions, represent for the thoughts, and attract
the interests of many people which follow them on social networks. Therefore, instead
of collecting data from each regular account on a social network, we only need to get and
analyze data from a number of influential accounts of KOLs and organizations, thereby
achieving the equivalent effectiveness in capturing the trends of the social network on a
large scale. In this paper, we selected the most reputable Facebook fan pages in Vietnam
for topic mining and other data analyses.
3.1 Observations
To know what a Facebook fan page is talking about, we have to analyze the contents of
its posts. In this research, we are only interested in the textual part (called the document
hereinafter) and the interactional part (i.e., reactions and comments) of a post. If we can
extract the topics of every document in the text corpus of a fan page, we probably can
figure out the most popular topics of a fan page.
Assuming that each document in a fan page’s corpus has its own topic probability
distribution or, in other words, each document can be represented by a fixed-dimensional
vector depending on what its content is about. For example, if a document has its topic
proportions of 30% about sport, 50% about technology, 20% about politics, the topic
distribution vector of this document will be [0.3;0.5;0.2]. In practice, the results of
vectorizing documents are not clearly visible like the above example, but usually hidden
in the textual data. We need a solution to combine the topic distribution vectors of the
documents in a fan page’s corpus to find the topic distribution vector representing the
fan page.
After studying about Facebook data properties, we realized that different posts (thus
their corresponding documents) have different degrees of importance to the topic pro-
portions of a fan page. The posts that receive more interactions from users are likely
to contribute more to the composition of the topics of a fan page and to the distinction
among fan pages.
Figure 1 presents the proposed process flow. Firstly, the raw data of fan pages are col-
lected from crawlers, from which textual data and interactional data are extracted. After
that, the textual data is pre-processed by removing page signatures, special characters,
icons, and stop words. Pre-processed documents are then applied LDA-based topic mod-
eling process, returning topic distribution vectors of all documents of each fan page’s
corpus. Meanwhile, the interactional data is used to calculate the interaction indices of
all documents of each fan page. Finally, the vector representation for every fan page
is obtained by combining the topic distribution vectors and interaction indices of all
documents of the page. How the combination is done is described in details in Sect. 3.4.
An Effective Vector Representation of Facebook Fan Pages 677
LDA is a method widely employed for modeling the topics of documents in a corpus,
which was proposed by Blei et al. in 2003 [9]. This method assumes that each document in
the corpus is a probability distribution of topics and each topic is a probability distribution
of words in the vocabulary of the corpus. Given a corpus D, LDA assumes that the corpus
can be generated by the following process [10] (Fig. 2):
Step 1. For each topic k in K topics, draw a distribution over words in the vocabulary:
ϕ(k) ∼ Dirichlet(β)
θd ∼ Dirichlet(α)
where K is the number of latent topics in the corpus and α, β are the parameters of
the corresponding Dirichlet distributions.
The above process results in the following joint distribution [10]:
where w is the vocabulary and z is the topic assignment for each word in w.
Figure 3 illustrates the process to obtain the topic distribution vector for a particular
document in a fan page’s text corpus by using LDA. LDA model gives us two outputs,
the cluster of words for each topic and the topic assignment for each word in the corpus.
Therefore, we can know exactly how many times a topic appears in the document or, in
other words, how many times a topic has been assigned to any word in the document by
counting. We then get the topic distribution vector of the document by calculating the
probability of each topic being assigned to a word in that document. Consequently, we
can generate the topic distribution vector for each fan page in some way.
We propose a simple way to calculate the topic distribution vector for each fan page
by summing over the topic distribution vectors of all documents in its corpus. However,
each document in the sum should be associated with a weight reflecting how interactive
its corresponding post is, as presented in Sect. 3.1. Therefore, we additionally propose
to use the number of reactions (e.g., like, haha, angry, etc.) and the number of comments
on each post as the parameters to compute the weight of that post, thus its document, in
making of the topic distribution of a fan page.
Let V = {v1 ; v2 ; . . . ; vn } the set of topic distribution vectors of the documents of a
fan page’s corpus; n is the number of documents of the corpus; ti , ri , ci are respectively
An Effective Vector Representation of Facebook Fan Pages 679
the interaction index, number of reactions and number of comments of the ith document
(1 ≤ i ≤ n). The interaction index of the ith document can be calculated as
where η, μ respectively represents the relative importance between reactions and com-
ments in the interaction index. Since comments are considered more valuable than
reactions in terms of the degree of interaction, we experimentally set η = 1 and μ = 3.
Let P the topic distribution vector represented a fan page. P can be calculated as the
weighted sum of topic distribution vectors of all documents of the page, i.e.
P = w1 v1 + w2 v2 + . . . + wi vi + . . . wn vn , (3)
where the weight of each document is its interaction index normalized among all
documents of the fan page, i.e.
ti
wi = . (4)
n
ti
i=1
• Number of topics: K = 20
• Parameters of Dirichlet distributions: α = β = 1
K = 0.05
If the number of topics is too small, there will be little diversity among topic dis-
tributions of the corpus. On the contrary, if the number of topics is too big, it will be
difficult to interpret what the topics are about since the topics are not obvious anymore.
Therefore, we set the number of topics K to 20 in the experiments.
the page based on the topic distribution vectors of its documents. The page’s topic
distribution vector is defined as the weighted sum of the document vectors as described
in Sect. 3.4. Thus it has the same dimension of 20 with the document vectors (due to K
= 20).
,
As an example, the resulting topic distribution vector of the fan page for “Báo Ðò i
Sống Pháp Luâ.t” (Law and Life Journal) is displayed in Fig. 4. As can be observed,
the topic probability distribution attains notable peaks at ,three topics, which are: Topic
2 – a justice-related topic with the keywords such as “ca nh sát” (police), “vu.” (case),
and “di- ều tra” (investigate); Topic 10 – a family-related topic with the keywords such as
“me.” (mother), “vo.,” (wife), and “tiền” (money); Topic 13 – a transportation-related topic
with the keywords such as “xe” (vehicle), “du - ,ò,ng” (street), and “giao thông” (traffic).
682 V. H. Phan et al.
This result is quite reasonable because justice, family, and transportation are the most
concerns of this journal.
,
Fig. 4. Topic distribution of fan page “Báo Ðò i Sống Pháp Luâ.t” (Law and Life Journal).
With the resulting vector representations of fan pages, we can group them into different
clusters so that the pages in each cluster have similar topic distributions and the resulting
clusters are well separated each other. We have tried to cluster the topic distribution
vectors of all fan pages in the dataset with the K-mean Clustering algorithm. With the
optimal number of clusters of 12 (see the results in Table 2), we got several example
results as follows.
Table 2. Silhouette scores comparison between two methods of fan pages modeling.
Method # of clusters
2 4 6 8 10 12 14
LDA with interaction indices 0.1398 0.1606 0.1896 0.2328 0.2669 0.3008 0.2748
LDA only 0.1404 0.1821 0.1965 0.2345 0.2370 0.2759 0.2523
Method # of clusters
16 18 20 22 24 26 28
LDA with interaction indices 0.2665 0.2462 0.2503 0.2756 0.2580 0.2556 0.2324
LDA only 0.2623 0.2525 0.2403 0.2279 0.2273 0.2029 0.2322
,
Cluster 1 includes several fan pages such as “Gia i trí TV” (Entertainment ,TV),
“HTV3 - DreamsTV”, “Kênh Nha.c Viê.t” (Vietnamese Music Channel), “VTV Gia i trí
VTV6” (VTV Entertainment VTV6). All of these are the pages of entertainment channels
(Fig. 5).
An Effective Vector Representation of Facebook Fan Pages 683
Cluster 2 includes fan pages of broadcasters about news and politics such as “BBC
Tiếng Viê.t” (BBC Vietnamese), “Ðài Châu Á Tu., Do” (Radio Free Asia), “RFI Tiếng
Viê.t” (RFI Vietnamese), “VOA Tiếng Viê.t” (VOA Vietnamese) (Fig. 6).
As can be seen on Fig. 5 and Fig. 6, those fan pages having similar topic distributions
were grouped quite well thanks to their vector representations.
To quantitatively evaluate the clustering performance, we used Silhouette score [14]
to measure how well the clusters are separated to each other. The higher the score,
the better clustering process. We compared the clustering performance between the two
684 V. H. Phan et al.
vector representations of fan pages: our proposed method (LDA-based topic distributions
combined with interaction indices, i.e., each document has a different weight in Eq. (3))
and conventional one (LDA-based topic distributions only, i.e., all documents have the
same weight in Eq. (3)). The results in Table 2 show that when the number of clusters is
high enough (more than 8), our proposed method outperforms the conventional one on
Silhouette score. In particular, both of the two fan page representation methods achieve
optimal clustering performance when the number of clusters is set to 12. In that case,
our proposed method improves 9.0% on Silhouette score compared to the conventional
one (0.3008 vs. 0.2759).
5 Conclusion
In this paper, we have proposed a method to represent a fan page by a vector using
LDA-based topic modeling on all fan pages in the corpus combined with interaction
index analysis of their posts. Experiment results showed that this representation can be
used to cluster a set of fan pages effectively and obtained better clustering performance
than the conventional one just based on LDA. The proposed vector representation of fan
pages also showed its effectiveness in figuring out hot topics as well as regular issues
posted on Facebook in a fixed period of time. The main benefit of our approach to fan
page modeling and mining is that it helps us to follow trending contents on this social
platform on a large scale without collecting the data of regular individual users. In the
future, we will apply other models that focus more on the segmentation of documents
such as lda2vec [15] to find out how positive or negative different fan pages talk about
the same topic. We also want to extend the proposed method so that the time factor is
included to reflect how the relationship between fan pages changes over time.
References
1. Datareportal, “Social Media Users by Platform,” 2019. Available: https://datareportal.com/
social-media-users. Accessed 19 Nov 2019
2. Yau, C.-K., Porter, A., Newman, N., Suominen, A.: Clustering scientific documents with topic
modeling. Scientometrics 100, 767–786 (2014)
3. Kim, S.-W., Gil, J.-M.: Research paper classification systems based on TF-IDF and LDA
schemes. Hum. centric Comput. Inf. Sci. 9(1), 1–21 (2019). https://doi.org/10.1186/s13673-
019-0192-7
4. Pengtao, X., Eric, P.X.: Integrating document clustering and topic modeling. In: Proceed-
ings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, AUAI Press,
Virginia, United States, pp. 694–703 (2013)
5. Gialampoukidis, I., Vrochidis, S., Kompatsiaris, I.: A hybrid framework for news cluster-
ing based on the DBSCAN-martingale and LDA. MLDM 2016. LNCS (LNAI), vol. 9729,
pp. 170–184. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41920-6_13
6. Eliana, S., Camilo, M., Raimundo, A.: Topic modeling of Twitter conversations (2018)
An Effective Vector Representation of Facebook Fan Pages 685
7. Zhou, T., Haiy, Z.: A text mining research based on LDA topic modelling. In: International
Conference on Computer Science, Engineering and Information Technology, pp. 201–210
(2016)
8. Jiamthapthaksin, R.: Thai text topic modeling system for discovering group interests of Face-
book young adult users. In: 2016 2nd International Conference on Science in Information
Technology (ICSITech), pp. 91–96. IEEE 2016
9. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–
1022 (2003)
10. Darling, W.M.: A theoretical and practical implementation tutorial on topic modeling
and gibbs sampling. In: Proceedings of the 49th annual meeting of the association for
computational linguistics: Human language technologies, pp. 642–647 (2011)
11. Social Bakers Vietnamese Statistics. Available: https://www.socialbakers.com/statistics/fac
ebook/pages/total/vietnam. Accessed 19 Nov 2019
12. pyvi toolkit. Available: https://pypi.org/project/pyvi/. Accessed 19 Nov 2019
13. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–
2830 (2011)
14. Rousseeuw, P.J.: A graphical aid to the interpretation and validation of cluster analysis. J.
Comput. Appl. Math. 20, 53–65 (1987)
15. Moody, C.E.: Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv
arXiv:1605.02019 (2016)