research-article

Multi-View Clustering for Open Knowledge Base Canonicalization

Authors:

Yinan LiuAuthors Info & Claims

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 1578 - 1588

https://doi.org/10.1145/3534678.3539449

Published: 14 August 2022 Publication History

Abstract

Open information extraction (OIE) methods extract plenty of OIE triples <noun phrase, relation phrase, noun phrase> from unstructured text, which compose large open knowledge bases (OKBs). Noun phrases and relation phrases in such OKBs are not canonicalized, which leads to scattered and redundant facts. It is found that two views of knowledge (i.e., a fact view based on the fact triple and a context view based on the fact triple's source context) provide complementary information that is vital to the task of OKB canonicalization, which clusters synonymous noun phrases and relation phrases into the same group and assigns them unique identifiers. However, these two views of knowledge have so far been leveraged in isolation by existing works. In this paper, we propose CMVC, a novel unsupervised framework that leverages these two views of knowledge jointly for canonicalizing OKBs without the need of manually annotated labels. To achieve this goal, we pro- pose a multi-view CH K-Means clustering algorithm to mutually reinforce the clustering of view-specific embeddings learned from each view by considering their different clustering qualities. In order to further enhance the canonicalization performance, we propose a training data optimization strategy in terms of data quantity and data quality respectively in each particular view to refine the learned view-specific embeddings in an iterative manner. Additionally, we propose a Log-Jump algorithm to predict the optimal number of clusters in a data-driven way without requiring any labels. We demonstrate the superiority of our framework through extensive experiments on multiple real-world OKB data sets against state-of-the-art methods.

Supplemental Material

MP4 File

Presentation video

Download
32.65 MB

References

[1]

Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D. Manning. 2015. Leveraging linguistic structure for open domain information extraction. In ACL. 344--354.

[2]

Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In IJCAI. 2670--2676.

[3]

Amine M Bensaid, Lawrence O Hall, James C Bezdek, Laurence P Clarke, Martin L Silbiger, John A Arrington, and Reed F Murtagh. 1996. Validity-guided (re) clustering with applications to image segmentation. IEEE Transactions on Fuzzy Systems, Vol. 4, 2 (1996), 112--123.

Digital Library

[4]

James C. Bezdek. 1973. Cluster validity with fuzzy sets. Journal of Cybernetics, Vol. 3, 3 (1973), 58--73.

[5]

J. C. Bezdek. 1975. Mathematical models for systematics and taxonomy. In Proceedings of the Eighth International Conference on Numerical Taxonomy, Vol. 3. 143--166.

[6]

Steffen Bickel and Tobias Scheffer. 2004. Multi-view clustering. In ICDM. 19--26.

[7]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL, Vol. 5 (2017), 135--146.

[8]

Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD. 1247--1250.

[9]

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In NIPS. 2787--2795.

[10]

Jerzy Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, Vol. 3, 1 (1974), 1--27.

[11]

Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. 2009. Clueweb09 data set.

[12]

Chia-Hui Chang, Mohammed Kayed, Moheb R Girgis, and Khaled F Shaalan. 2006. A survey of web information extraction systems. IEEE TKDE, Vol. 18, 10 (2006), 1411--1428.

[13]

Sarthak Dash, Gaetano Rossiello, Nandana Mihindukulasooriya, Sugato Bagchi, and Alfio Gliozzo. 2021. Open knowledge graphs canonicalization using variational autoencoders. In EMNLP. 10379--10394.

[14]

David L Davies and Donald W Bouldin. 1979. A cluster separation measure. IEEE TPAMI 2 (1979), 224--227.

Digital Library

[15]

Joseph C Dunn 1973. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics 3 (1973), 32--57.

[16]

Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information extraction. In EMNLP. 1535--1545.

[17]

Shaohua Fan, Xiao Wang, Chuan Shi, Emiao Lu, Ken Lin, and Bai Wang. 2020. One2multi graph autoencoder for multi-view graph clustering. In WWW. 3070--3076.

[18]

A. Fujita, Takahashi Y. D., and Patriota G. A. 2014. A non-parametric method to estimate the number of clusters. Computational Statistics & Data Analysis, Vol. 73 (2014), 27--39.

Digital Library

[19]

Y. Fukuyama. 1989. A new method of choosing the number of clusters for the fuzzy c-mean method. In Proceedings of IEEE 5th International Fuzzy Systems. 247--250.

[20]

Sugar A. C.and James M. G. 2003. Finding the number of clusters in a data set: An information theoretic approach C. J. Amer. Statist. Assoc., Vol. 98, 463 (2003), 750--763.

[21]

Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag Subramanya. 2013. FACC1: Freebase annotation of ClueWeb corpora, Version 1. (06 2013).

[22]

Luis Galárraga, Geremy Heitz, Kevin Murphy, and Fabian M Suchanek. 2014. Canonicalizing open knowledge bases. In CIKM. 1679--1688.

[23]

Luis Antonio Galárraga, Christina Teflioudi, Katja Hose, and Fabian Suchanek. 2013. AMIE: association rule mining under incomplete evidence in ontological knowledge bases. In WWW. 413--422.

[24]

Kiril Gashteovski, Rainer Gemulla, and Luciano Del Corro. 2017. MinIE: minimizing facts in open information extraction. In EMNLP. 2620--2630.

[25]

Kiril Gashteovski, Sebastian Wanner, Sven Hertling, Samuel Broscheit, and Rainer Gemulla. 2019. OPIEC: an open information extraction corpus. In AKBC.

[26]

Avisek Gupta, Shounak Datta, and Das Swagatam. 2018. Fast automatic estimation of the number of clusters from the minimum inter-center distance for k-means clustering. Pattern Recognition Letters, Vol. 116 (2018), 72--79.

[27]

Maria Halkidi and Michalis Vazirgiannis. 2001. Clustering validity assessment: Finding the optimal partitioning of a data set. In ICDM. 187--194.

[28]

Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, and Philip S. Yu. 2022. A survey on knowledge graphs: representation, acquisition, and applications. IEEE TNNLS, Vol. 33, 2 (2022), 494--514.

[29]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. 4171--4186.

[30]

Xueling Lin and Lei Chen. 2019. Canonicalization of open knowledge bases with side information from the source text. In ICDE. 950--961.

[31]

Lihui Liu, Boxin Du, Yi Ren Fung, Heng Ji, Jiejun Xu, and Hanghang Tong. 2021 a. KompaRe: a knowledge graph comparative reasoning system. In SIGKDD. 3308--3318.

[32]

Yinan Liu, Wei Shen, Yuanfei Wang, Jianyong Wang, Zhenglu Yang, and Xiaojie Yuan. 2021 b. Joint open knowledge base canonicalization and linking. In SIGMOD. 2253--2261.

[33]

Ujjwal Maulik and Sanghamitra Bandyopadhyay. 2002. Performance evaluation of some clustering algorithms and validity indices. IEEE TPAMI, Vol. 24, 12 (2002), 1650--1654.

Digital Library

[34]

Arash Mehrjou, Reshad Hosseini, and Babak Nadjar Araabi. 2016. Improved bayesian information criterion for mixture model selection. Pattern Recognition Letters, Vol. 69 (2016), 22--27.

Digital Library

[35]

Ndapandula Nakashole, Gerhard Weikum, and Fabian Suchanek. 2012. PATTY: a taxonomy of relational patterns with semantic types. In EMNLP. 1135--1145.

[36]

Rajesh N.Dave. 1996. Validating fuzzy partitions obtained through c-shells clustering. Pattern Recognition Letters, Vol. 17, 6 (1996), 613--623.

Digital Library

[37]

Malay K Pakhira, Sanghamitra Bandyopadhyay, and Ujjwal Maulik. 2004. Validity index for crisp and fuzzy clusters. Pattern Recognition, Vol. 37, 3 (2004), 487--501.

[38]

Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2015. PPDB 2.0: better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In ACL. 425--430.

[39]

Hugh Perkins and Yi Yang. 2019. Dialog intent induction with deep multi-view clustering. In EMNLP. 4016--4025.

[40]

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In NAACL. 2227--2237.

[41]

Min Ren, Peiyu Liu, Zhihao Wang, and Jing Yi. 2016. A self-adaptive fuzzy c-means algorithm for determining the optimal number of clusters. Computational Intelligence and Neuroscience (2016).

[42]

Babak Rezaee. 2010. A cluster validity index for fuzzy clustering. Fuzzy Sets and Systems, Vol. 161, 23 (2010), 3014--3025.

Digital Library

[43]

M Ramze Rezaee, Boudewijn PF Lelieveldt, and Johan HC Reiber. 1998. A new cluster validity index for the fuzzy c-mean. Pattern Recognition Letters Vol. 19, 3--4 (1998), 237--246.

[44]

Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., Vol. 20 (1987), 53--65.

Digital Library

[45]

S. Salvador and P. Chan. 2004. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In ICTAI. 576--584.

[46]

Wei Shen, Yuhan Li, Yinan Liu, Jiawei Han, Jianyong Wang, and Xiaojie Yuan. 2021. Entity Linking Meets Deep Learning: Techniques and Solutions. IEEE TKDE (2021).

[47]

Wei Shen, Jianyong Wang, Ping Luo, and Min Wang. 2012. LINDEN: linking named entities with knowledge base via semantic knowledge. In WWW. 449--458.

[48]

Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of semantic knowledge. In WWW. 697--706.

[49]

Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2018a. RotatE: knowledge graph embedding by relational rotation in complex space. In ICLR.

[50]

Zequn Sun, Wei Hu, Qingheng Zhang, and Yuzhong Qu. 2018b. Bootstrapping entity alignment with knowledge graph embedding. In IJCAI. 4396--4402.

[51]

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: networked science in machine learning. SIGKDD Explorations, Vol. 15, 2 (2013), 49--60.

Digital Library

[52]

Shikhar Vashishth, Prince Jain, and Partha Talukdar. 2018. Cesi: canonicalizing open knowledge bases using embeddings and side information. In WWW. 1317--1327.

[53]

William E Winkler. 1999. The state of record linkage and current research problems. In Statistical Research Division, US Census Bureau.

[54]

Kuo-Lung Wu and Miin-Shen Yang. 2005. A cluster validity index for fuzzy clustering. Pattern Recognition Letters, Vol. 26, 9 (2005), 1275--1291.

Digital Library

[55]

Tien-Hsuan Wu, Zhiyong Wu, Ben Kao, and Pengcheng Yin. 2018. Towards practical open knowledge base canonicalization. In CIKM. 883--892.

[56]

Xuanli Lisa Xie and Gerardo Beni. 1991. A validity measure for fuzzy clustering. IEEE TPAMI, Vol. 13, 8 (1991), 841--847.

Digital Library

[57]

Chenyan Xiong, Russell Power, and Jamie Callan. 2017. Explicit semantic ranking for academic search via knowledge graph embedding. In WWW. 1271--1279.

Digital Library

[58]

Lei Xu and Bayesian Ying-Yang machine. 1997. Clustering and number of clusters. Pattern Recognition Letters, Vol. 18, 11--13 (1997), 1167--1178.

Digital Library

[59]

Qinpei Zhao, Mantao Xu, and Pasi Fränti. 2008. Knee point detection on Bayesian information criterion. In ICTAI. 431--438.

[60]

Qinpei Zhao, Mantao Xu, and Pasi Fränti. 2009. Sum-of-squares based cluster validity index and significance analysis. In ICANNGA. 313--322.

Cited By

Bai RHuang RXu LQin Y(2025)CSMDC: Exploring consistently context semantics for multi-view document clusteringExpert Systems with Applications10.1016/j.eswa.2024.125386261(125386)Online publication date: Feb-2025
https://doi.org/10.1016/j.eswa.2024.125386
Viswanathan VGashteovski KGashteovski KLawrence CWu TNeubig G(2024)Large Language Models Enable Few-Shot ClusteringTransactions of the Association for Computational Linguistics10.1162/tacl_a_0064812(321-333)Online publication date: 5-Apr-2024
https://doi.org/10.1162/tacl_a_00648
Gu ZLi ZFeng SBaeza-Yates RBonchi F(2024)Topology-Driven Multi-View Clustering via Tensorial Refined Sigmoid Rank MinimizationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672070(920-931)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3672070
Show More Cited By

Index Terms

Multi-View Clustering for Open Knowledge Base Canonicalization
1. Information systems
  1. Data management systems
    1. Information integration
  2. Information systems applications
    1. Data mining
      1. Data cleaning

Recommendations

Joint Open Knowledge Base Canonicalization and Linking
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Open Information Extraction (OIE) methods extract a large number of OIE triples (noun phrase, relation phrase, noun phrase) from text, which compose large Open Knowledge Bases (OKBs). However, noun phrases (NPs) and relation phrases (RPs) in OKBs are not ...
Jointly Canonicalizing and Linking Open Knowledge Base via Unified Embedding Learning
WWW '24: Proceedings of the ACM Web Conference 2024

Recent years have witnessed increasing attention on the semantic knowledge integration between curated knowledge bases (CKBs) and open knowledge bases (OKBs), which is non-trivial due to the intrinsically heterogeneous features involved in CKBs and OKBs. ...
Multi-view clustering via spectral partitioning and local refinement

A new multi-view clustering algorithm is proposed.The proposed MVNC algorithm uses spectral partitioning and local refinement.MVNC is compared to state-of-the-art algorithms using three real-world datasets.MVNC significantly outperforms the other ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2022

5033 pages

ISBN:9781450393850

DOI:10.1145/3534678

General Chairs:
Aidong Zhang
University of Virginia
,
Huzefa Rangwala
Amazon/George Mason University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

CAAI-Huawei MindSpore Open Fund
National Natural Science Foundation of China
CAST

Conference

KDD '22

Sponsor:

KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 14 - 18, 2022

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
387
Total Downloads

Downloads (Last 12 months)101
Downloads (Last 6 weeks)11

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bai RHuang RXu LQin Y(2025)CSMDC: Exploring consistently context semantics for multi-view document clusteringExpert Systems with Applications10.1016/j.eswa.2024.125386261(125386)Online publication date: Feb-2025
https://doi.org/10.1016/j.eswa.2024.125386
Viswanathan VGashteovski KGashteovski KLawrence CWu TNeubig G(2024)Large Language Models Enable Few-Shot ClusteringTransactions of the Association for Computational Linguistics10.1162/tacl_a_0064812(321-333)Online publication date: 5-Apr-2024
https://doi.org/10.1162/tacl_a_00648
Gu ZLi ZFeng SBaeza-Yates RBonchi F(2024)Topology-Driven Multi-View Clustering via Tensorial Refined Sigmoid Rank MinimizationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672070(920-931)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3672070
Shen WYang BLiu YChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Jointly Canonicalizing and Linking Open Knowledge Base via Unified Embedding LearningProceedings of the ACM on Web Conference 202410.1145/3589334.3645700(2304-2314)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645700
Wang JXu ZYang XLi HLi BMeng X(2024)Self‐supervised multi‐view clustering in computer visionIET Computer Vision10.1049/cvi2.1229918:6(709-734)Online publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1049/cvi2.12299
Liu BPeng HZeng WZhao XLiu SPan LLi X(2024)Open knowledge base canonicalization with multi-task learningWorld Wide Web10.1007/s11280-024-01288-x27:5Online publication date: 18-Jul-2024
https://dl.acm.org/doi/10.1007/s11280-024-01288-x
Mora-Cross MUlate WRetana Chacón BBiarreta Portillo MCastro Ramírez JChavarria Madriz J(2023)Structuring Information from Plant Morphological Descriptions using Open Information ExtractionBiodiversity Information Science and Standards10.3897/biss.7.1130557Online publication date: 21-Sep-2023
https://doi.org/10.3897/biss.7.113055
Timilsina MBuosi SSong PYang YHaque RCurry E(2023)Enabling Dataspaces Using Foundation Models: Technical, Legal and Ethical Considerations and Future Trends2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386933(4712-4721)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386933

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten