survey

A Comprehensive Survey on Automatic Knowledge Graph Construction

Authors:

Lingfeng Zhong,

Jia Wu,

Qian Li,

Hao Peng,

Xindong WuAuthors Info & Claims

ACM Computing Surveys, Volume 56, Issue 4

Article No.: 94, Pages 1 - 62

https://doi.org/10.1145/3618295

Published: 30 November 2023 Publication History

Get Access

Abstract

Automatic knowledge graph construction aims at manufacturing structured human knowledge. To this end, much effort has historically been spent extracting informative fact patterns from different data sources. However, more recently, research interest has shifted to acquiring conceptualized structured knowledge beyond informative data. In addition, researchers have also been exploring new ways of handling sophisticated construction tasks in diversified scenarios. Thus, there is a demand for a systematic review of paradigms to organize knowledge structures beyond data-level mentions. To meet this demand, we comprehensively survey more than 300 methods to summarize the latest developments in knowledge graph construction. A knowledge graph is built in three steps: knowledge acquisition, knowledge refinement, and knowledge evolution. The processes of knowledge acquisition are reviewed in detail, including obtaining entities with fine-grained types and their conceptual linkages to knowledge graphs; resolving coreferences; and extracting entity relationships in complex scenarios. The survey covers models for knowledge refinement, including knowledge graph completion, and knowledge fusion. Methods to handle knowledge evolution are also systematically presented, including condition knowledge acquisition, condition knowledge graph completion, and knowledge dynamic. We present the paradigms to compare the distinction among these methods along the axis of the data environment, motivation, and architecture. Additionally, we also provide briefs on accessible resources that can help readers to develop practical knowledge graph systems. The survey concludes with discussions on the challenges and possible directions for future exploration.

Appendices

A Applications of Knowledge Graph Construction

Research communities have been incorporating knowledge graph (KG) construction techniques into real-world AI applications, besides tasks to build KG systems. Applications of KG construction extend methods to construct KGs into more user-concerned scenarios, integrated with KG side information and task-related approaches. These advances have covered fake news detection, dialogue systems, and more applications. In this section, we review some remarkable achievements.

A.1 Recommendation Systems

A recommendation system predicts how users interact with related objects (such as items and other users). Methods incorporating KB systems for this task have effectively handled the cold-start and data-sparsity problems in big data scenarios.

Current knowledge-enhanced applications also focus on building KG structures in user-item interactive social networks while combining KG side information via embedding-based approaches [245]. DKN [246] performs entity linking on the related news content items to the external knowledge base (KB) for obtaining a subgraph to learn knowledge embeddings. The model then utilizes an attention-based method to aggregate user embeddings for interpreting user-item interactions for recommendation tasks. To incorporate cross-task features via end-to-end training, KGeRec [247] utilizes feature interaction units to combine the KG construction and the recommendation task to gather deep conceptualized semantics. Meanwhile, previous designs do not fully consider high-order semantics within the graph structures. KRAN [6] proposes a GCN-based framework to obtain neighborhood features that denotes intrinsic user preferences. Here, the presented knowledge-refining attention mechanism captures inter-entity attention coefficients for the recommendation.

A.2 Fake News Detection

Fake news detection judges whether a factual statement reflects the truth. In practice, applications for such tasks apply NER and EL methods to dock on KG systems, then compare knowledge features with contextual features to determine trustworthy contents.

As an early achievement, Pan et al. [248] separately build KGs from real/fake news sources to obtain KG embeddings via the TranE model. The model then performs fact-checking by contrasting the related knowledge embeddings from different content sources. To further incorporate external existing KG information with content features, KAN [249] recognizes entities in contents and aligns them with an external KG, then incorporates feature embeddings via attention mechanisms to detect factual information. However, the previous methods underestimate long-term context characteristics in the content. Furthermore, CompareNet [8] utilizes a heterogeneous document-level graph to capture interactive features among sentences and compare embeddings with high-order contextual features to a KB. More knowledge-aware models also consider multi-modal information for assessment. For example, KMAGCN [250] unifies textual and multi-modal and KG information via an adaptive graph convolution network (GCN), achieving breakthroughs in different scenarios.

A.3 Dialog Systems

A dialogue system serves the demands for human-machine conversations in natural languages. A KG-based dialogue system traverses a KG system based on user inputs via KG walk approaches and performs task-related knowledge reasoning and semantic augmentation for generating consistent multi-round responses.

Many applications for knowledge-based multi-turn conversation tasks handle this challenge via encoder-decoder architectures. Moon et al. [251] aggregate dialogue-level semantics of inputs via multiple Bi-LSTM encoders and utilize a KG walker (reasoner) based on attention-based pruning for decoding responses. Similarly, KCMC [7] presents a generative seq2seq solution that consolidates hierarchical attention-based dialogue contextual encoder and knowledge-enhanced embedding via dynamic graph attention decoder for knowledge reasoning. KCMC generates fluent user-concerned responses through high-performance copying mechanisms rather than conventional ranking. Further, more efforts also concentrate on diversified users’ interests in open-domain conversations. DKRN [252] intriguingly proposes a dynamic knowledge routing strategy of reasoning to retrieve conceptualized information for automatical personalized dialogues. Designing more effective solutions for complex conversations is now a prevailing trend.

B More KG-related Resources

B.1 More Practical KG Datasets

In this part, we portray more practical KG projects for readers. We provide information on the KG projects in Table 5.

Table 5.

Categorization	Project	KG Inclusion	Year	URL
Encyclopedia KG	YAGO	2B+ facts, 64M+ entities	2007	https://yago-knowledge.org
	Freebase	360M+ fact triples	2007	https://freebase-easy.cs.uni-freiburg.de/dump/
	DBpedia	320 classes with 1, 650 different properties, 247M+ triples	2007	https://github.com/DBpedia/
	CN-DBpedia	9M+ entities, 67M+ triples	2015	http://kw.fudan.edu.cn/cnDBpedia/download/
	Probase	5.4M+ concepts	2010	https://concept.research.microsoft.com/
	Wikidata	96M+items	2012	https://www.wikidata.org/wiki
	CN-Probase	17M+ entities, 33M+ “is-a” relations	2017	http://kw.fudan.edu.cn/apis/cnprobase/
Linguistic KG	WordNet	117 000 synsets	1985	https://wordnet.princeton.edu/
	ConceptNet	34M+ items	1999	https://www.conceptnet.io/
	HowNet	35, 202 concepts, 2, 196 sememes	1999	https://openhownet.thunlp.org/download
	Babelnet	13M nodes	2010	http://babelnet.org/rdf/page/
	THUOCL	157K+ word nodes in 7.3B+ documents	2016	http://thuocl.thunlp.org/
Commonsense KG	OpenCyc	2M+ fact triples	1984	https://sourceforge.net/projects/opencyc/
	ASER	438M+ nodes, 648M+ edges	2020	https://github.com/HKUST-KnowComp/ASER
	TransOMCS	18M+ tuples	2020	https://github.com/HKUST-KnowComp/TransOMCS
Enterprise support KG	Google Knowledge Graph	500B+ facts on 5B+ entities	2012	https://developers.google.com/knowledge-graph
Enterprise support KG	Facebook Graph Search	dynamic social network of users, User-generated contents	2013	https://developers.facebook.com/docs/graph-api/
Domain-specific KG	Drugbank	14K+ drug entities	2006	https://go.drugbank.com/releases/latest
	AMiner ASN	2M+ paper nodes, 8M+ citation relations	2007	https://www.aminer.cn/aminernetwork
	Huapu	17M+ person nodes	2017	https://www.zhonghuapu.com/
	OAG	369M+ authors, 380M+ papers 92M+ linking relations	2017	https://www.aminer.cn/data/?nav=openData#Open-Academic-Graph
	COVID-19 Concepts	4784 entities, 35172 relation links	2020	http://openkg.cn/dataset/covid-19-concept
	Aminer COVID-19 Open Data	reports, news, research and other achieves of COVID-19	2020	https://www.aminer.cn/data-covid19/
	CPubMed-KG	3.9M+ tuples	2021	https://cpubmed.openi.org.cn/graph/wiki
Federated KG	GEDMatch	1.2M+ DNA profiles	2010	https://www.gedmatch.com/
Federated KG	OpenKG.cn	200+ datasets from 94 organizations	2015	http://www.openkg.cn

Table 5. The Information of Practical KG Projects

B.1.1 Encyclopedia KGs.

After the early attempt of the DBpedia project [40] (developed from Wikipedia), more KG projects incorporate automatic extraction tools, such as Freebase [9], Wikidata, and CN-DBpedia [253] (developed from Wikipedia, Baidu Baike, Hudong Baike, and automatically-extracted content). Max-Planck-Institution develops YAGO [254] integrates temporal and geographical structures in Wikipedia with WordNet ontology. Minz et al. [119] applied distance supervision Freebase for automatical entity-relationships annotation. KGs of eventualities are also concerned by the research community. CN-Probase [255] extends Probase with concepts in Chinese to comprehend general modes of textual data that involve uncertain occurrences.

B.1.2 Linguistic KGs.

Besides WordNet [43], BabelNet [256] extends WordNet with cross-lingual attributes and relations of words from encyclopedias. ConceptNet [257], as a part of the Link Open Data, gathers conceptual knowledge based on crown sourcing, while HowNet [258] manually collects sememe (minimum indivisible semantic units) information of word concepts and attributes. THUOCL [259] records the document frequency (DF) of the words from the well-filtered web corpus. Developers create high-performance word embeddings based on well-built linguistic KGs for downstream applications.

B.1.3 Commonsense KGs.

Besides OpenCyc [44], ASER [24] provides a weighted KG that describes commonsense by modeling entities of actions, states, events, and relationships among these objects, which acquire its nodes via dependency patterns selection and conceptualized by Probase. TransOMCS [23] develops an auto-generated dataset covering 20 commonsense relations obtained from linguistic graphs.

B.1.4 Enterprise Support KGs.

Similar to Google Knowledge Graph (GKG) [45], Facebook Graph Search delivers the powerful semantic search engine of Facebook, providing user-specific answers through the dynamic Facebook social KB.

B.1.5 Domain-specific KGs.

Besides Drugbank [46], CPubMed-KG innovatively develops a medical KG presented in Chinese. Many KG collection efforts are also contributed to fighting against the COVID-19 pandemic, such as the COVID-19 Concept dataset and Aminer COVID-19 Open Data. As for academic activities, the Academic Social Network (ASN) of AMiner [260] and Open Academic Graph (OAG) [261] discloses academic activities including social networks and papers.

B.1.6 Federated KGs.

Federation strategies have been applied to more KG systems with sensitive data to build integrated knowledge models while preventing data exchange. Researchers also focus on federated KG platforms. OpenKG.cn [262] as a crowd-sourcing community, provides a knowledge-sharing platform to develop knowledge applications with federated learning while supporting the decentralization of knowledge blockchains.

B.2 More Off-the-shelf KG Tools

In this part, we review more off-the-shelf tools for KG construction. The details of these tools are presented in Table 6.

Table 6.

Task	Tool	Year	URL
Data Pre-processing	WebCollector	2016	https://github.com/CrawlScript/WebCollector
Data Pre-processing	Web Scraper (v0.4.0)	2019	https://webscraper.io/
Knowledge Acquisition	NLTK	2002	https://www.nltk.org/
	StanfordNLP	2002	https://stanfordnlp.github.io/stanfordnlp/
	KnowItAll	2005	https://github.com/knowitall
	TextRunner	2007	https://www.cs.washington.edu/research/textrunner/
	OpenCalais	2008	https://www.refinitiv.com/en/products/intelligent-tagging-tex
	ReVerb	2011	https://github.com/knowitall/reverb
	OLLIE	2012	https://knowitall.github.io/ollie/
	spaCy	2017	https://spacy.io/
	TableMiner+	2017	https://github.com/ziqizhang/sti
	MantisTable	2019	http://mantistable.disco.unimib.it/
	OpenNRE	2019	https://github.com/thunlp/OpenNRE
	gBuilder	2021	http://gbuilder.gstore.cn/
Knowledge Acquisition	Falcon-AO	2008	http://ws.nju.edu.cn/falcon-ao/
	OpenKE	2018	https://github.com/thunlp/OpenKE
	OpenNE	2019	https://github.com/thunlp/OpenNE
	OpenEA	2020	https://github.com/nju-websoft/OpenEA

Table 6. The Infomation of Off-the-shelf KG Tools

B.2.2 Knowledge Acquisition.

Early toolkits that directly extract fact triples through rules, patterns, and statistic features, which are also known as Information Extraction (IE) toolkits. Besides KnowItAll [12], more toolkits leverage semi-supervision designs to collect relational information, such as TextRunner [11], ReVerb [263] that produces refined verbal triples via syntactic and lexical information and OLLIE [264] that supports non-verbal triples discovery.

Many NLP applications can direct achieve knowledge acquisition sub-tasks, including NER, RE, and CO tasks, or provide linguistic features for their related applications. NLTK [265] and StanfordNLP [266] are powerful toolkits for knowledge acquisition based on statistic-based algorithms like CRF, MEM, which can also provide background features such as POS tags and NP chunks. TableMiner+ [267] and MantisTable [268] extract knowledge from semi-structured table forms.

Recent developers have been drawn into DL-based toolkits. spaCy [269] is a comprehensive practical NLP toolkit that integrates NeuralCoref for CO tasks, and also provides a trainable deep-learning (DL) module for specialized relation extraction (RE) (in spaCy v3). OpenNRE [270] provides various extensible neural network models such as CNN and LSTM to perform supervised RE.

B.2.3 Knowledge Refinement.

Besides integrated DL-based toolkits OpenKE [49] and OpenEA [50], OpenNE integrates embedding models such as Node2VEC [271] and LINE [272] to obtain global representations from a complete KG for completions. As for the knowledge fusion (KG merging) task, Falcon-AO [273] utilizes multiple algorithms to measure semantic similarity for aligning concepts in different notations.

C More Discussions On Knowledge Graph Construction

In this part, we present more discussions over directions and challenges for KG construction techniques.

C.1 Strategies for Privacy Protection in Knowledge Graph Completion and Other Applications

Practical KG completion models and other high-performance KG-related applications rely on embedding learning on large datasets. However, these data sources may carry user-provided sensitive data, where privacy violations could happen if machine learning is directly performed. Strategies to prevent privacy violations are thus proposed to serve the mandatory requirement for safe KG applications. Current mainstream research mainly focuses on federated learning and differential privacy to achieve privacy protection.

Federated Learning is an enlightening direction to serve privacy protection while performing machine learning for KG completion and other tasks. A federated setting for KGs that trains model ensembles from multi-sources is one of the popular strategies. Significant advances have been conducted to federated knowledge embeddings to achieve safe KG completion, such as FKGE [274] and FedE [275], which prohibit data exchange while incorporating cross-modal features during training. However, entity alignment for knowledge fusion is a paradoxical bottleneck that impedes federated learning for complex completion tasks and other applications, requiring multi-source KGs to be shared before model learning, which will exchange sensitive information during knowledge fusion. How to create a privacy-reserved super feature space for encrypted entity alignment while federating features is still open for exploration. Designing more privacy-friendly models for constructing KGs is critical for sensitive data scenarios. We illustrate the procedure of developing a federated model in Figure 20.

Fig. 20.

Even trained on a sanitized database, a KG embedding model serving completion or other purposes may still expose sensitive information about identities when minor variations on these anonymized features are analyzed. Therefore, another challenging direction is proposed to thwart identity detection through such attacks on a sanitized database. Differential Privacy (DP) [276] is an insightful strategy to anonymize sensitive data so that each near data piece is not distinguishable from the other, thus, privacy attacks by contrasting minor features will be compromised. Currently, efforts have incorporated differential privacy in deep learning models, such as HGM [277] with robust DP strategies. Besides, ensuring the fairness of DP-outputted properties of different individuals or groups is still open for further discussion [278]. The objective of “fairness”, requiring anonymized data output can be viewed equally, may have polarized impacts on various scenarios. As for candidate selection tasks, “fairness” can be guaranteed with well-designed DP strategies [279], while researchers find such “fairness” targets may impair privacy-preserving efforts in data of census statistics [280]. More ideas should be considered to handle the tradeoff between privacy and fairness.

C.2 Fabricated Information Detection

The information generated by generative AI or concocted by humans brings a new challenge for KG systems, requiring developers to distinguish unrealistic or unoriginal inputs from reliable data.

Unlike data noises handled by models for triple classification, information faked by humans on purpose is more confusing with hard-to-tell features. Current research works have focused on detecting fake news by comprehending context features, such as CompareNet [8] for text information and KMAGCN [250] for multi-modal information. Explainable models [281] are also considered to serve this goal. Detecting fabricated information from generative AI is more formidable. As for generated images, many models [282] based on deep learning have been devised to tackle face photos manipulated by DeepFake. More attention should also be paid to the generated data in other scenarios like events and sceneries. Furthermore, texts generated by a language model (like ChatGPT [283]) also present unoriginal or low-quality content. DetectGPT [284] proposes a solution by analyzing the probability distribution of a perturbed text to pick model-generated parts. Tools like GPTZero also manage to judge such content by different means. However, these tools still present many inaccuracy judgments. Generative models with adversarial learning mechanisms may compromise the regular efforts for detection to cause this. It is also worth reminding readers that excessive data collection can unintentionally make a human write like an AI. Expert decisions and legislative actions would also be needed to alleviate these problems.

C.3 Incorporating Generative AI and Other Large Pre-trained Models for Knowledge Graph Construction

Large pre-trained models have been leading a significant impact on multiple KG construction tasks and their related applications. KG-BERT [285] incorporates knowledge triples as sequences to train with, achieving breakthroughs in KG completion tasks. DRAGON [286] proposes a self-supervision strategy to consolidate KG with textual information for obtaining deep advanced representations, effectively enhancing model performances on question-answering tasks with complex reasoning. Developing intriguing strategies utilizing large KG structures for pre-training language models to serve KG construction and other applications is promising. Directly extracting intrinsic facts from language models will also be a new direction for interpretable machine learning. Petroni et al. [287] discover that large language models only trained on contextual information like BERT [26] may contain or potentially possess oracle knowledge about entity relations. They propose a LAMA Probe to evaluate explicable relational knowledge within these models by a simplified cloze test. Furthermore, such attempts may also unveil how structured knowledge emerges from context awareness.

Recently, advances in generative AI have also delivered new directions for KG construction. A recent survey [242] has reported generative models that turn multiple KG construction tasks into seq2seq structure prediction tasks, which can better comprehend complex task-related knowledge. These advances serving generative KG construction have provided a promising unified paradigm to probe into tasks like entity linking, RE, and KG completion. The survey also points out that training efficiency and the generation quality of models remain open for future improvement.

Another problem is absorbing generated information for KG construction, such as gathering data from commonsense text generation tasks. KG-BART [288] provides a solution based on pre-train models to obtain a text about a concept set from a KB. For example, given {fish, catch, river, net}, the text “fishermen use strong nets to catch plentiful fishes in the river.” will be generated by the model. Such accountable data can enrich a KG system with background knowledge in a pre-train model. In terms of AI-generated factoid content, a KG system utilizes it for data augment. Such generated unseen objects like “a WWI battlefield using magic as weapons” are not likely to be reasonable real-world facts, which may not be the appropriate content for display. It is worth pointing out that completing a KG with fabricated data could be misleading or profile users with stereotypes (like generating a man’s self-portrait based on his birthplace). Such technical abuses should not be encouraged.

C.4 Long and Intricate Contexts for KG Construction

Intricate cross-sentence or cross-paragraph contexts impede different KG construction sub-tasks for practical use, especially RE tasks. It is worth reminding readers that complex contexts do not merely relate to long-term dependency. Yao et al. [27] point out that four kinds of inferences including pattern recognition, coreference reasoning, logic reasoning, and commonsense reasoning, are also critical to contain high-order contextual semantics. A specific example is presented in Figure 21.

Fig. 21.

A model that handles complex long contexts should focus on intricate cross-sentence patterns while performing reasoning over multiple linguistic objects. Besides document-level extraction models in Section 3.3.6, Some efforts in Section D.2 also model document-level contexts via heterogeneous models for entity typing (ET). Noticeably, ambiguous expressions may occur in user-generated texts, which are usually not correctly interpreted by models without external information. Another challenging issue for reasoning is multi-hop reasoning. More linguistic structures should be explored to comprehend tortuous expressions.

Out-of-context expressions requiring background knowledge to handle are bottlenecks for KG construction. The obstacles are mainly two-fold: (1) spontaneous knowledge, and (2) evidence support. Commonsense knowledge spontaneously generated is often utilized to derive new facts, e.g., man and woman who have kids should be couples/partners, despite such convictions sometimes inaccuracy. How to obtain commonsense rules and adapt them to suitable scenarios is an important direction. Meanwhile, many document-level datasets do not contain evidence information for correct logic paths. Efforts like [289] have probed into document-level evidence structures for relation mentions. However, it is not likely to foresee that a model can learn to organize clues correctly to resolve facts in all scenarios (e.g., validating the conclusion in a philosophical book). We believe long-context is not merely an NLP question, and models [290] understanding linguistic expressions will be a critical direction. Furthermore, conditions like temporal and geographical information provided by data sources should also be considered for rigorously comprehending contexts.

D More Advances for Knowledge Graph Construction

Besides mainstream models for different KG Construction sub-tasks, there are also many other innovative or practical attempts to work on different scenarios. In this part, we present more advances for enlightening readers to design novel construction solutions.

D.1 Rule-based Methods for Knowledge Acquisition

Many early attempts focus on rules that achieve knowledge acquisition or its sub-tasks. Despite inaccuracy in big data environments, rule-based methods are practical solutions to quickly extract massive raw knowledge. These methods also work in scenarios where high-performance computing is not available.

Rule-based approaches [291] are the general solutions for NER. As for semi-structured web data, Wrapper inductions generate rule wrappers to interpret semi-structures such as DOM tree nodes and tags for harvesting entities from pages. Some rule-based solutions are unsupervised, which require no human annotations, such as Omini [292]. As for entities in table forms, many approaches are proposed based on property-attribute layouts of Wikipedia, such as rule-based tools [40][254] for DBpedia, and YAGO. For unstructured data, classic NER systems [293] also rely on manually-constructed rule sets for pattern matching. Semi-supervised approaches are developed to improve rule-based NER by iteratively generating refined new patterns via pattern seeds and scoring, such as Bootstrapping-based NER [294].

Methods focusing rules are the earliest attempts for RE tasks on different data structure kinds, gathering strings that fit in hand-craft templates, e.g., “$PEOPLE is born in $LOCATION.” refers to ($PEOPLE, born-in, $LOCATION). However, these unsupervised strategies rely on complex linguist knowledge to label data. Later, researchers concentrate on automatical pattern discovery for triples mining. Semi-supervision design is an enlightening strategy to reduce hand-craft features and data labeling that uncovers more reliable patterns based on a small group of annotated samples, such as DIPRE [295] iteratively extracting patterns with seeds, bootstrapping-based KnowItAll [12] and Snowball [296] equipping DIPRE with confidence evaluation. Some rule-based models consider more lexical objects for mining. OLLIE [264] incorporates lexical structure patterns with relational dependency paths in texts. MetaPAD [297] combines lexical segmentation and synonymous clustering to meta patterns that are sufficiently informative, frequent, and accurate for relational triples. Specifically for semi-structured tables, researchers design table structure-based rules to acquire relationships arranged in rows, columns, and table headers, such as [298]. Furthermore, Some semi-structured extraction systems utilizing distant supervision tolerate potential errors, which directly query external databases like DBpedia and Wikipedia to acquire relationships for the found entities in tabular data, such as [70], [299], and [300]. Similarly, Muñoz et al. [300] look up the Wikipedia tables for labeling relationships in tabular forms. Krause et al. [301] also expand rule sets for RE via distant supervision.

Rule-based models to perform end-to-end knowledge acquisition are lightweight solutions for specific domains. However, these designs require extra work for maintenance if the domain changes.

D.2 More Embedding-based Models

Embedding-based models lay the foundation for KG completion while providing semantic support for different sub-tasks for knowledge acquisition from semi-structured or unstructured data.

More variants for translation embedding (TranE) models for KG completion have been developed to search entity-relation feature space via mapping matrices like TransR [302] and TransH [303]. Meanwhile, researchers also consider more tensor-based empirical models for embedding over a completed large graph, such as RESCAL [304] and DistMult [305]. Some knowledge representation models leverage non-linear neural networks to exploit deep knowledge embedding features for KG completion, such as ConvE [306], M-DCN [307], and TransGate [308]. Unstructured entity descriptions are also incorporated for feature enhancement, such as the DKRL model [309] and ConMask model [310]. GCNs are also presented to encode a KG, such as R-GCN [311], W-GCN [312], and COMPGCN [313]. GCNs can also comprehend neighborhood information through semantic diffusion mechanisms. ProjE [314] projects an entity and a relation to distinctive feature spaces through neural operations for capturing another candidate for missing entities. However, when the relation element is missing, a latent vector space of relationship candidates cannot be retrospected. SENN [315] bridges the disparity-distribution-space semantic gaps by multi-task embedding sharing strategy unifying relation, head entity, and tail entity link prediction.

As for ET, novel embedding-based models avail of combing global graph structure features and background knowledge for predicting potential types of entities via representations. Researchers reported that the classical TransE model acts poorly while directly applied to ET tasks. Moon et al.[154] propose the TransE-ET model adjusting the TransE model by optimizing the Euclidean distance between entities and their types representations, limited by insufficient entities types and triples features. New solutions aim at constructing various graphs to share diversified features of entity-related objects for learning embeddings with entity-type features. PTE [17] reduces data noise via a partial-label embedding, which constructs a bipartisan graph between entities and all their types while connecting entities nodes to their related extracted text features. Finally, PTE utilizes the background KG by building a type hierarchy tree with the derived correlation weights. JOIE [316] embeds entity nodes in the ontology-view graph and instance graphs, gathering entity types by top-k ranking between entity and type candidates. Likewise, ConnectE [317] maps entities onto their types and learning knowledge triples embeddings. Practical models improving embeddings on heterogeneous graphs for ET tasks (in Xlore project [42]) also include [318], [319], [320]. We present graph structures for embedding model-based ET in Figure 22.

Fig. 22.

Embedding-based models are also critical solutions for entity linking via entity embeddings. LIEGE [321] derives distribution context representations to links entities for web pages. Early researchers [322] leverage Bag-of-word (BoW) for contextual embeddings of entity mentions, then performed clustering to gather linked entity pairs. Later, Lasek et al. [323] extend the BoW model with linguistic embeddings for EL tasks. Researchers also focus on Deep representations for high-performance linking. DSRM [324] employs a deep neural network to exploit semantic relatedness, combining entity descriptions and relationships with types features to obtain deep entity features for linking. EDKate [325] jointly learns low-dimensional embedding of entities and words in the KB and textual data, capturing intrinsic entity-mention features beyond the BoW model. Furthermore, Ganea and Hofmann [18] introduce an attention mechanism for joint embedding and passed semantic interaction for disambiguation. Le and Titov [19] model the latent relations between mentions in the context for embedding, utilizing mention-wise and relation-wise normalization to score pair-wise coherence score function.

Researchers also focus on embedding-based distribution models over multiple semantic structures to handle coreference resolution (CO). Durrett and Klein [326] utilize antecedent representations to enable coreference inference through distribution features. Martschat and Strube [327] explore distribution semantics over mention-pairs and tree models to enhance coreference representations, directly picking robust features to optimize the CO task. Chakrabarti et al. [328] further employ the MapReduce framework to cover anaphoric entity names through query context similarity.

As for joint RE, novel distribution embedding-based models are proposed to model the cross-task distributions to bridge the semantic gaps between NER and RC. Ren et al. [329] propose a knowledge-enhanced distribution CoType model for joint extraction tasks. In this model, entity pairs are firstly mapped onto their mentions in the KB, then tagged with entity types and all relation candidates provided by the KB. This model learns embeddings of relation mentions with contextualized lexical and syntax features while training embeddings of the entity mentions with their types, then the contextual relation mention will be derived by its head and tail entities embeddings via TranE [330] model. The CoType model assumes interactive cooccurrence between entities and their relation labels, filling the distribution discrepancy with knowledge from the external domain and extra type features. Noticeably, this model also effectively prevents noises in distant-supervised datasets. However, feature engineering and extra KBs are also needed.

D.3 Rule-Mining Methods for Relation Path Reasoning

Many efforts focus on automatically mining logic rules to pave reasoning paths. There are methods for rule discovering, such as AMIE [177], RLvLR [178], and RuleN [331]. Instead of searching for promising relation path patterns approaching the symbolic essence of knowledge, the rule mining approaches extract and prune logic rules from a reasonable KG, then perform link prediction via the collected rule templates. However, unseen knowledge paths cannot be easily derived by logical rules in incomplete graphs.

Another research direction is to fuel logic rules into neural models to boost path reasoning. KALE [332] jointly embeds first-order logic rules with knowledge embedding to enhance relation inference. RUGE [333] iteratively rectifies KG embeddings via learned soft rules and then performs relation path reasoning. Logic rules are leveraged as the side semantic information into neural models. NeuralLP [179] proposes a neural framework to encode logic rule structures into vectorized embeddings with an attention mechanism. pLogicNet [180] introduces Markov logic network (MLN) to model uncertain rules. ExpressGNN [181] further employs GCNN to solve neighborhood graphic semantics with logic rules. These rule-based neural models are also regarded as the application of differentiable learning availing for gradient-based optimization algorithms on logic programming.

D.4 Other Advances

Researchers explore more strategies for flexible NER tasks. Transfer Learning shares knowledge between different domains or models. Pan et al. [334] propose Transfer Joint Embedding (TJE) to jointly embed output labels and input samples from different domains for blending intrinsic entity features. Lin et al. apply [335] a neural network with adaptation layers to transfer parameter features from a model pre-trained on a different domain. Reinforcement Learning (RL) puts NER models to interact with the environment domain through a behavior agency with a reward policy, such as the Markov decision process (MDP)-based model [336] and Q-network enhanced model [337]. Noticeably, researchers [338] have also leveraged the RL model for noise reduction in distant-supervised NER data. Adversarial Learning generates counterexamples or perturbations to enforce the robustness of NER models, such as DATNet [339] imposing perturbations on word representations and counterexamples generators ([340] and [341]). Moreover, Active Learning, which queries users to annotate selected samples, has also been applied to NER. Shen et al.[342] incrementally chose the most samples for NER labeling during the training procedures to mitigate the reliance on tagged samples.

Few-shot/zero-shot ET is an intricate challenging issue. Ma et al. [343] model the prototype of entity label embeddings for zero-shot fine-grain ET, naming Proto-HLE, which combines prototypical features with hierarchical type labels for inferring essential features of a new type. Zhang et al. [344] further propose MZET that exploits contextual features and word embeddings with a Memory Network to provide semantic side information for few-shot ET.

More probabilistic-based models are developed for EL tasks. Guo et al. [345] propose a probabilistic model for unstructured data, that leverages the prior probability of an entity, context, and name when performing linking tasks with unstructured data. Han et al. [346] employed a reference graph of entities, assuming that entities co-occurring in the same documents should be semantically related.

Joint models for NER and EL reduce error propagation of the pipeline-based entity recognition tasks. NEREL [347] couples NER and EL by ranking extracted mention-entity pairs to exploit the interaction features between entity mentions and their links. Graphic models are also effective designs to combine Named Entity Normalization (NEN) labels that convert entity mentions into unambiguous forms, e.g., Washington (Person) and Washington (State). Li et al. [348] incorporated EL with NEN utilizing a factor graph model, forming CRF chains for word entity types and their target nodes. Likewise, MINTREE [349] introduces a tree-based pair-linking model for collective tasks.

Cluster-based solutions handle the CO (Coreference Resolution) task as a pairwise binary classification task (co-referred or not). Early cluster models aim at mention-pair features. Soon et al. [350] propose a single-link clustering strategy to detect anaphoric pairs. Recasens et al. [351] further develop a mention-pair-based cluster to emanate a coreference chain or a singleton leaf. Later, researchers concentrate on entity-based features to exploit complex anaphoric features. Rahman and Ng [352] propose a mention-ranking clustering model to dive into entity characteristics. Stoyanov and Eisner [353] develop agglomerative clustering to merge the best clusters with entity features.

Early researchers concentrate on intriguing statistical-based features for fast end-to-end joint RE, such as Integer Linear Programming (ILP)-based algorithm [354] solving entities and relations via conditional probabilistic model, semi-Markov chain model [355] jointly decoding global-level relation features, and MLNs [356] modeling joint logic rules of entity labels and relationships. Early attempts deliver prototypes of entity-relationship interactions. However, statistical patterns are not explicit for intricate contexts.

Few-shot RC designs also consider feature augmentation strategies to mitigate data deficiency with intriguing model designs and background knowledge. Similar to [95], Levy et al. [357] turn zero-shot RC into a reading comprehension problem to comprehend unseen labels by a template converter. Soares et al. [358] compose a compound relation representation for each sentence by the BERT contextualized embeddings of entity pairs and the corresponding sentence. GCNs also deliver extra graph-level features for few-shot learning. Satorras and Estrach [359] propose a novel GCN framework to determine the relation tag of a query sample by calculating the similarity between nodes. Moreover, Qu et al. [360] employ posterior distribution for prototypical vectors. Some designs also avail semi-supervised data augmentation based on metric learning. The previous Neural Snowball [121] (based on RSN) labels the query set via the Siamese network while drawing a similar sample candidate from external distant-supervised sample sets to enrich the support set.

Many early attempts develop random-walk models for relation path reasoning that infer relational logic paths in a latent variable logic graphic model. Path-Ranking Algorithm (PRA) [361] generates a feature matrix to sample potential relation paths. However, the feature sparsity in the graph impedes random walk approaches. Semantic enrichment strategies are proposed to mitigate this bottleneck, such as inducing vector space similarity [362] and clustering associated relations [363].

Early attempts aim at the unique attributes of entities for entity matching. Many models leverage distance-based approaches to distributional representations of entity descriptions or definitions. VCU [364] proposes first-order and second-order vector models to embed the description words of an entity pair for comprehensively measuring the conceptual distance. TALN [365] leverages sense-based embedding derived by BabelNet to combine the definitional description of words, which first generates the embedding of each filtered definition word combined with POS-tagger, syntax features via BabelNet, then averages them to obtain a centroid sense to obtain the best matching candidates. String-similarity-based models available for entity matching also include TF-IDF [366] and I-Sub [367].

Graph-based methods achieve feasible performance for entity matching on the medium-scale KG that consists of hierarchical graph structures. ETF [368] learns concept representations through semantic features and graph-based features, including Katz similarity, random walk betweenness centrality, and information propagation score. ParGenFS [369] leverages a graph-based fuzzy cluster algorithm to conceptualize a new entity. This method stimulates the thematic distribution to acquire distinctive concept clusters to search the corresponding location of an entity update in a target KG.

Entity alignment tasks can also be handled by text-similarity-based models that detect surficial similarity between entities when considering the tradeoff between performance and computation cost. Rdf-ai [370] proposes a systematic model to match two entity node graphs, which leverages the string-matching and lexical-feature-similarity comparing algorithms to align available attributes, then calculates the entity similarity for alignment. Similarly, Lime [371] further leverages metric spaces to detect aligned entity pairs, which first generate entity exemplars to filter alignable candidates before similarity computation for entity fusion. Different from small-scale KGs, the shaped large KGs contain meaningful relational paths and enriched concept taxonomy. HolisticEM [372] employs IDF score to calculate the surficial similarity of entity names for seed generating and utilizes Personalized PageRank (PPR) to measure distances between entity graphs by traversing their neighbor nodes.

F Knowledge Graph Storage

In this section, we provide a brief overview of KG storage tools for different data environments.

Early efforts utilize relational models to perpetuate constructed KGs. Traditional RDBMS provides reliable and swift CRUD operations for table-formed databases. Developers have also employed graph algorithms like depth-first traverse and shortest-path search to enhance relational databases. Ref. [2] includes representative examples like PostgreSQL [389], and filament. However, it can be very costly for a relational database to handle sparse KGs or perform data partition for distribution storage.

Key/value databases are lightweight solutions for saving clusters in large KGs, supporting distributed storage with a simplified flexible data format. Trinity [390] provides a high-performance in-memory Key/Value storage system to manage large KGs with billion nodes, such as Probase. CouchDB [391] utilizes a replication mechanism to maintain dynamic KGs. MapReduce technology automatically transforms data groups into key/value mappings. Hadhoop enables high-throughput parallel computing for KG storage via MapReduce. Pregel [392] develops a superstep mechanism to share messages between vertices for parallel computing.

Another enlightening direction is to design graph databases that fit in knowledge triple structures. Neo4j[393] is a lightweight NoSQL-based graph database supporting embedded dynamic KG storage. SOnes provides object-oriented queries for KG databases. Novel languages are also developed for knowledge storage, such as resource description framework (RDF) and Web Ontology Language (OWL). Some graph databases based on RDF optimize the storage of graph structures. For example, gStore [392] improves RDF-structured KG databases via sub-graph matching algorithms.

References

[1]

Heiko Paulheim. 2017. Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8, 3 (2017), 489–508. DOI:

Abstract

A Applications of Knowledge Graph Construction

A.1 Recommendation Systems

A.2 Fake News Detection

A.3 Dialog Systems

B More KG-related Resources

B.1 More Practical KG Datasets

B.1.1 Encyclopedia KGs.

B.1.2 Linguistic KGs.

B.1.3 Commonsense KGs.

B.1.4 Enterprise Support KGs.

B.1.5 Domain-specific KGs.

B.1.6 Federated KGs.

B.2 More Off-the-shelf KG Tools

B.2.2 Knowledge Acquisition.

B.2.3 Knowledge Refinement.

C More Discussions On Knowledge Graph Construction

C.1 Strategies for Privacy Protection in Knowledge Graph Completion and Other Applications

C.2 Fabricated Information Detection

C.3 Incorporating Generative AI and Other Large Pre-trained Models for Knowledge Graph Construction

C.4 Long and Intricate Contexts for KG Construction

D More Advances for Knowledge Graph Construction

D.1 Rule-based Methods for Knowledge Acquisition

D.2 More Embedding-based Models

D.3 Rule-Mining Methods for Relation Path Reasoning

D.4 Other Advances

F Knowledge Graph Storage

References

Cited By

Index Terms

Recommendations

Knowledge Graph Augmentation with Entity Identification for Improving Knowledge Graph Completion Performance

A survey of few-shot knowledge graph completion

Knowledge fusion patterns: A survey

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

B.2.1 Data Preprocessing.

E Semi-strctured Data Pre-processing

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Full Text

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations