Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3534678.3539449acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Multi-View Clustering for Open Knowledge Base Canonicalization

Published: 14 August 2022 Publication History

Abstract

Open information extraction (OIE) methods extract plenty of OIE triples <noun phrase, relation phrase, noun phrase> from unstructured text, which compose large open knowledge bases (OKBs). Noun phrases and relation phrases in such OKBs are not canonicalized, which leads to scattered and redundant facts. It is found that two views of knowledge (i.e., a fact view based on the fact triple and a context view based on the fact triple's source context) provide complementary information that is vital to the task of OKB canonicalization, which clusters synonymous noun phrases and relation phrases into the same group and assigns them unique identifiers. However, these two views of knowledge have so far been leveraged in isolation by existing works. In this paper, we propose CMVC, a novel unsupervised framework that leverages these two views of knowledge jointly for canonicalizing OKBs without the need of manually annotated labels. To achieve this goal, we pro- pose a multi-view CH K-Means clustering algorithm to mutually reinforce the clustering of view-specific embeddings learned from each view by considering their different clustering qualities. In order to further enhance the canonicalization performance, we propose a training data optimization strategy in terms of data quantity and data quality respectively in each particular view to refine the learned view-specific embeddings in an iterative manner. Additionally, we propose a Log-Jump algorithm to predict the optimal number of clusters in a data-driven way without requiring any labels. We demonstrate the superiority of our framework through extensive experiments on multiple real-world OKB data sets against state-of-the-art methods.

Supplemental Material

MP4 File
Presentation video

References

[1]
Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D. Manning. 2015. Leveraging linguistic structure for open domain information extraction. In ACL. 344--354.
[2]
Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In IJCAI. 2670--2676.
[3]
Amine M Bensaid, Lawrence O Hall, James C Bezdek, Laurence P Clarke, Martin L Silbiger, John A Arrington, and Reed F Murtagh. 1996. Validity-guided (re) clustering with applications to image segmentation. IEEE Transactions on Fuzzy Systems, Vol. 4, 2 (1996), 112--123.
[4]
James C. Bezdek. 1973. Cluster validity with fuzzy sets. Journal of Cybernetics, Vol. 3, 3 (1973), 58--73.
[5]
J. C. Bezdek. 1975. Mathematical models for systematics and taxonomy. In Proceedings of the Eighth International Conference on Numerical Taxonomy, Vol. 3. 143--166.
[6]
Steffen Bickel and Tobias Scheffer. 2004. Multi-view clustering. In ICDM. 19--26.
[7]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL, Vol. 5 (2017), 135--146.
[8]
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD. 1247--1250.
[9]
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In NIPS. 2787--2795.
[10]
Jerzy Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, Vol. 3, 1 (1974), 1--27.
[11]
Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. 2009. Clueweb09 data set.
[12]
Chia-Hui Chang, Mohammed Kayed, Moheb R Girgis, and Khaled F Shaalan. 2006. A survey of web information extraction systems. IEEE TKDE, Vol. 18, 10 (2006), 1411--1428.
[13]
Sarthak Dash, Gaetano Rossiello, Nandana Mihindukulasooriya, Sugato Bagchi, and Alfio Gliozzo. 2021. Open knowledge graphs canonicalization using variational autoencoders. In EMNLP. 10379--10394.
[14]
David L Davies and Donald W Bouldin. 1979. A cluster separation measure. IEEE TPAMI 2 (1979), 224--227.
[15]
Joseph C Dunn 1973. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics 3 (1973), 32--57.
[16]
Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information extraction. In EMNLP. 1535--1545.
[17]
Shaohua Fan, Xiao Wang, Chuan Shi, Emiao Lu, Ken Lin, and Bai Wang. 2020. One2multi graph autoencoder for multi-view graph clustering. In WWW. 3070--3076.
[18]
A. Fujita, Takahashi Y. D., and Patriota G. A. 2014. A non-parametric method to estimate the number of clusters. Computational Statistics & Data Analysis, Vol. 73 (2014), 27--39.
[19]
Y. Fukuyama. 1989. A new method of choosing the number of clusters for the fuzzy c-mean method. In Proceedings of IEEE 5th International Fuzzy Systems. 247--250.
[20]
Sugar A. C.and James M. G. 2003. Finding the number of clusters in a data set: An information theoretic approach C. J. Amer. Statist. Assoc., Vol. 98, 463 (2003), 750--763.
[21]
Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag Subramanya. 2013. FACC1: Freebase annotation of ClueWeb corpora, Version 1. (06 2013).
[22]
Luis Galárraga, Geremy Heitz, Kevin Murphy, and Fabian M Suchanek. 2014. Canonicalizing open knowledge bases. In CIKM. 1679--1688.
[23]
Luis Antonio Galárraga, Christina Teflioudi, Katja Hose, and Fabian Suchanek. 2013. AMIE: association rule mining under incomplete evidence in ontological knowledge bases. In WWW. 413--422.
[24]
Kiril Gashteovski, Rainer Gemulla, and Luciano Del Corro. 2017. MinIE: minimizing facts in open information extraction. In EMNLP. 2620--2630.
[25]
Kiril Gashteovski, Sebastian Wanner, Sven Hertling, Samuel Broscheit, and Rainer Gemulla. 2019. OPIEC: an open information extraction corpus. In AKBC.
[26]
Avisek Gupta, Shounak Datta, and Das Swagatam. 2018. Fast automatic estimation of the number of clusters from the minimum inter-center distance for k-means clustering. Pattern Recognition Letters, Vol. 116 (2018), 72--79.
[27]
Maria Halkidi and Michalis Vazirgiannis. 2001. Clustering validity assessment: Finding the optimal partitioning of a data set. In ICDM. 187--194.
[28]
Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, and Philip S. Yu. 2022. A survey on knowledge graphs: representation, acquisition, and applications. IEEE TNNLS, Vol. 33, 2 (2022), 494--514.
[29]
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. 4171--4186.
[30]
Xueling Lin and Lei Chen. 2019. Canonicalization of open knowledge bases with side information from the source text. In ICDE. 950--961.
[31]
Lihui Liu, Boxin Du, Yi Ren Fung, Heng Ji, Jiejun Xu, and Hanghang Tong. 2021 a. KompaRe: a knowledge graph comparative reasoning system. In SIGKDD. 3308--3318.
[32]
Yinan Liu, Wei Shen, Yuanfei Wang, Jianyong Wang, Zhenglu Yang, and Xiaojie Yuan. 2021 b. Joint open knowledge base canonicalization and linking. In SIGMOD. 2253--2261.
[33]
Ujjwal Maulik and Sanghamitra Bandyopadhyay. 2002. Performance evaluation of some clustering algorithms and validity indices. IEEE TPAMI, Vol. 24, 12 (2002), 1650--1654.
[34]
Arash Mehrjou, Reshad Hosseini, and Babak Nadjar Araabi. 2016. Improved bayesian information criterion for mixture model selection. Pattern Recognition Letters, Vol. 69 (2016), 22--27.
[35]
Ndapandula Nakashole, Gerhard Weikum, and Fabian Suchanek. 2012. PATTY: a taxonomy of relational patterns with semantic types. In EMNLP. 1135--1145.
[36]
Rajesh N.Dave. 1996. Validating fuzzy partitions obtained through c-shells clustering. Pattern Recognition Letters, Vol. 17, 6 (1996), 613--623.
[37]
Malay K Pakhira, Sanghamitra Bandyopadhyay, and Ujjwal Maulik. 2004. Validity index for crisp and fuzzy clusters. Pattern Recognition, Vol. 37, 3 (2004), 487--501.
[38]
Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2015. PPDB 2.0: better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In ACL. 425--430.
[39]
Hugh Perkins and Yi Yang. 2019. Dialog intent induction with deep multi-view clustering. In EMNLP. 4016--4025.
[40]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In NAACL. 2227--2237.
[41]
Min Ren, Peiyu Liu, Zhihao Wang, and Jing Yi. 2016. A self-adaptive fuzzy c-means algorithm for determining the optimal number of clusters. Computational Intelligence and Neuroscience (2016).
[42]
Babak Rezaee. 2010. A cluster validity index for fuzzy clustering. Fuzzy Sets and Systems, Vol. 161, 23 (2010), 3014--3025.
[43]
M Ramze Rezaee, Boudewijn PF Lelieveldt, and Johan HC Reiber. 1998. A new cluster validity index for the fuzzy c-mean. Pattern Recognition Letters Vol. 19, 3--4 (1998), 237--246.
[44]
Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., Vol. 20 (1987), 53--65.
[45]
S. Salvador and P. Chan. 2004. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In ICTAI. 576--584.
[46]
Wei Shen, Yuhan Li, Yinan Liu, Jiawei Han, Jianyong Wang, and Xiaojie Yuan. 2021. Entity Linking Meets Deep Learning: Techniques and Solutions. IEEE TKDE (2021).
[47]
Wei Shen, Jianyong Wang, Ping Luo, and Min Wang. 2012. LINDEN: linking named entities with knowledge base via semantic knowledge. In WWW. 449--458.
[48]
Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of semantic knowledge. In WWW. 697--706.
[49]
Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2018a. RotatE: knowledge graph embedding by relational rotation in complex space. In ICLR.
[50]
Zequn Sun, Wei Hu, Qingheng Zhang, and Yuzhong Qu. 2018b. Bootstrapping entity alignment with knowledge graph embedding. In IJCAI. 4396--4402.
[51]
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: networked science in machine learning. SIGKDD Explorations, Vol. 15, 2 (2013), 49--60.
[52]
Shikhar Vashishth, Prince Jain, and Partha Talukdar. 2018. Cesi: canonicalizing open knowledge bases using embeddings and side information. In WWW. 1317--1327.
[53]
William E Winkler. 1999. The state of record linkage and current research problems. In Statistical Research Division, US Census Bureau.
[54]
Kuo-Lung Wu and Miin-Shen Yang. 2005. A cluster validity index for fuzzy clustering. Pattern Recognition Letters, Vol. 26, 9 (2005), 1275--1291.
[55]
Tien-Hsuan Wu, Zhiyong Wu, Ben Kao, and Pengcheng Yin. 2018. Towards practical open knowledge base canonicalization. In CIKM. 883--892.
[56]
Xuanli Lisa Xie and Gerardo Beni. 1991. A validity measure for fuzzy clustering. IEEE TPAMI, Vol. 13, 8 (1991), 841--847.
[57]
Chenyan Xiong, Russell Power, and Jamie Callan. 2017. Explicit semantic ranking for academic search via knowledge graph embedding. In WWW. 1271--1279.
[58]
Lei Xu and Bayesian Ying-Yang machine. 1997. Clustering and number of clusters. Pattern Recognition Letters, Vol. 18, 11--13 (1997), 1167--1178.
[59]
Qinpei Zhao, Mantao Xu, and Pasi Fränti. 2008. Knee point detection on Bayesian information criterion. In ICTAI. 431--438.
[60]
Qinpei Zhao, Mantao Xu, and Pasi Fränti. 2009. Sum-of-squares based cluster validity index and significance analysis. In ICANNGA. 313--322.

Cited By

View all
  • (2025)CSMDC: Exploring consistently context semantics for multi-view document clusteringExpert Systems with Applications10.1016/j.eswa.2024.125386261(125386)Online publication date: Feb-2025
  • (2024)Large Language Models Enable Few-Shot ClusteringTransactions of the Association for Computational Linguistics10.1162/tacl_a_0064812(321-333)Online publication date: 5-Apr-2024
  • (2024)Topology-Driven Multi-View Clustering via Tensorial Refined Sigmoid Rank MinimizationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672070(920-931)Online publication date: 25-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2022
5033 pages
ISBN:9781450393850
DOI:10.1145/3534678
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multi-view clustering
  2. open knowledge base canonicalization
  3. training data optimization

Qualifiers

  • Research-article

Funding Sources

  • CAAI-Huawei MindSpore Open Fund
  • National Natural Science Foundation of China
  • CAST

Conference

KDD '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)101
  • Downloads (Last 6 weeks)11
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)CSMDC: Exploring consistently context semantics for multi-view document clusteringExpert Systems with Applications10.1016/j.eswa.2024.125386261(125386)Online publication date: Feb-2025
  • (2024)Large Language Models Enable Few-Shot ClusteringTransactions of the Association for Computational Linguistics10.1162/tacl_a_0064812(321-333)Online publication date: 5-Apr-2024
  • (2024)Topology-Driven Multi-View Clustering via Tensorial Refined Sigmoid Rank MinimizationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672070(920-931)Online publication date: 25-Aug-2024
  • (2024)Jointly Canonicalizing and Linking Open Knowledge Base via Unified Embedding LearningProceedings of the ACM on Web Conference 202410.1145/3589334.3645700(2304-2314)Online publication date: 13-May-2024
  • (2024)Self‐supervised multi‐view clustering in computer visionIET Computer Vision10.1049/cvi2.1229918:6(709-734)Online publication date: 2-Jul-2024
  • (2024)Open knowledge base canonicalization with multi-task learningWorld Wide Web10.1007/s11280-024-01288-x27:5Online publication date: 18-Jul-2024
  • (2023)Structuring Information from Plant Morphological Descriptions using Open Information ExtractionBiodiversity Information Science and Standards10.3897/biss.7.1130557Online publication date: 21-Sep-2023
  • (2023)Enabling Dataspaces Using Foundation Models: Technical, Legal and Ethical Considerations and Future Trends2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386933(4712-4721)Online publication date: 15-Dec-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media