Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2110.09749v5 [cs.CL] 21 Dec 2023

Importance Estimation from Multiple Perspectives for
Keyphrase Extraction

Mingyang Song, Liping Jing  and Lin Xiao
Beijing Key Lab of Traffic Data Analysis and Mining
Beijing Jiaotong University, China
{mingyang.song, lpjing, 17112079}@bjtu.edu.cn
  Corresponding author.
Abstract

Keyphrase extraction is a fundamental task in Natural Language Processing, which usually contains two main parts: candidate keyphrase extraction and keyphrase importance estimation. From the view of human understanding documents, we typically measure the importance of phrase according to its syntactic accuracy, information saliency, and concept consistency simultaneously. However, most existing keyphrase extraction approaches only focus on the part of them, which leads to biased results. In this paper, we propose a new approach to estimate the importance of keyphrase from multiple perspectives (called as KIEMP) and further improve the performance of keyphrase extraction. Specifically, KIEMP estimates the importance of phrase with three modules: a chunking module to measure its syntactic accuracy, a ranking module to check its information saliency, and a matching module to judge the concept (i.e., topic) consistency between phrase and the whole document. These three modules are seamlessly jointed together via an end-to-end multi-task learning model, which is helpful for three parts to enhance each other and balance the effects of three perspectives. Experimental results on six benchmark datasets show that KIEMP outperforms the existing state-of-the-art keyphrase extraction approaches in most cases.

1 Introduction

Keyphrase Extraction (KE) aims to select a set of reliable phrases (e.g., “harmonic balance method", “grobner base", “error bound", “algebraic representation", and “singular point" in Table 1) with salient information and central topics from a given document, which is a fundamental task in natural language processing. Most classic keyphrase extraction methods typically include two mainly components: candidate keyphrase extraction and keyphrase importance estimation Medelyan et al. (2009); Liu et al. (2010); Hasan and Ng (2014).

Input Document: harmonic balance ( hb ) method is well known principle for analyzing periodic oscillations on nonlinear networks and systems. because the hb method has a truncation error, approximated solutions have been guaranteed by error bounds. however, its numerical computation is very time consuming compared with solving the hb equation. this paper proposes proposes an algebraic representation of the error bound using grobner base. the algebraic representation enables to decrease the computational cost of the error bound considerably. moreover, using singular points of the algebraic representation, we can obtain accurate break points of the error bound by collisions.

Output / Target Keyphrases: harmonic balance method; grobner base; error bound; algebraic representation; singular point; quadratic approximation

Table 1: Sample input document with output / target keyphrases in KP20k testing set. Specially, keyphrases typically can be categorized into two types: present keyphrase that appears in a given document and absent keyphrase which does not appear in a given document.

As shown in Table 1, each keyphrase usually consists of more than one words Meng et al. (2017). To extract the candidate keyphrases from the the given document which is typically characterized via word-level representation, researchers leverage some heuristics Wan and Xiao (2008); Liu et al. (2009a, b); Nguyen and Phan (2009); Grineva et al. (2009); Medelyan et al. (2009) to identify the candidate keyphrases. For example, the word embeddings are composed to n-grams by Convolution Neural Network (CNN) Xiong et al. (2019); Sun et al. (2020); Wang et al. (2020).

Usually, the candidate set contains much more keyphrases than the ground truth keyphrase set. Therefore, it is critical to select the important keyphrase from the candidate set by a good strategy. In other words, keyphrase importance estimation commonly is one of the essential components in many keyphrase extraction models. Since keyphrase extraction concerns “the automatic selection of important and topical phrases from the body of a document” Turney (2000). Its goal is to estimate the importance of the candidate keyphrases to determine which one should be extracted. Recent approaches Sun et al. (2020); Wang et al. (2020) recast the keyphrase extraction as a classification problem, which extracts keyphrases by a binary classifier. However, a binary classifier classifies each candidate keyphrase independently, and consequently, it does not allow us to determine which candidates are better than the others Hulth (2004). Therefore, some methods Jiang et al. (2009); Xiong et al. (2019); Wang et al. (2020); Sun et al. (2020) propose a ranking model to extract keyphrases, where the goal is to learn a phrase ranker to compare the saliency of two candidate phrases. Furthermore, many previous studies Liu et al. (2010); Wang et al. (2019); Liu et al. (2009b) extract keyphrases with the main topics discussed in the source document, For example, Liu et al. (2010) proposes to build a topical PageRank approach to measure the importance of words concerning different topics.

However, most existing keyphrase extraction methods estimate the importance of keyphrases on at most two perspectives, leading to biased extraction. Therefore, to improve the performance of keyphrase extraction, the importance of the candidate keyphrases requires to be estimated sufficiently from multiple perspectives. Motivated by the phenomenon mentioned above, we propose a new importance estimation from multiple perspectives simultaneously for the keyphrase extraction task. Concretely, it estimates the importance from three perspectives with three modules (syntactic accuracy, information saliency, and concept consistency) with three modules. A chunking module, as a binary classification layer, measures the syntactic accuracy of each candidate keyphrase. A ranking module checks the semantics saliency of each candidate phrase by a pairwise ranking approach, which introduces competition between the candidate keyphrases to extract more salient keyphrases. A matching module judges the concept relevance of each candidate phrase in the document via a metric learning framework. Furthermore, our model is trained jointly on the above three modules, balancing the effect of three perspectives. Experimental results on two benchmark data sets show that KIEMP outperforms the existing state-of-the-art keyphrase extraction approaches in most cases.

2 Related Work

A good keyphrase extraction system typically consists of two steps: (1) candidate keyphrase extraction, extracting a list of words / phrases that serve as the candidate keyphrases using some heuristics Wan and Xiao (2008); Nguyen and Phan (2009); Medelyan et al. (2009); Grineva et al. (2009); Liu et al. (2009a, b); and (2) keyphrase importance estimation, determining which of these candidate phrases are keyphrases using different importance estimation approaches.

In the candidate keyphrase extraction, the heuristic rules usually are designed to avoid spurious phrases and keep the number of candidates to a minimum Hasan and Ng (2014). Generally, the heuristics mainly include (1) leverage a stop word list Liu et al. (2009b), (2) allowing words with part-of-speech tags Mihalcea and Tarau (2004); Liu et al. (2009a), (3) composing words to n-grams to be the candidate keyphrases Medelyan et al. (2009); Sun et al. (2020); Xiong et al. (2019); Wang et al. (2020). The above heuristics have proven effective with their high recall in extracting gold keyphrases from various sources. Motivated by the above methods, in this paper, we leverage CNNs to compose words to n-grams as the candidate keyphrases.

In the keyphrase importance estimation, the existing methods can be mainly divided into two categories: unsupervised and supervised. The unsupervised method usually are categorized into four groups, i.e., graph-based ranking Mihalcea and Tarau (2004), topic-based clustering Liu et al. (2009b), simultaneous learning Zha (2002), and language modeling Tomokiyo and Hurst (2003). Early supervised approaches to keyphrase extraction recast this task as a binary classification problem Witten et al. (1999); Turney (2002, 2000); Jiang et al. (2009). Later, to determine which candidates are better than the others, many ranking approach is proposed to rank the saliency of two phrases Jiang et al. (2009); Sun et al. (2020). This pairwise ranking approach, therefore, introduces competition between candidate keyphrases and has been achieved good performance. Both supervised and unsupervised methods construct features or models from different perspectives to measure the importance of candidate keyphrases to determine which keyphrases should be extracted. However, the approaches mentioned earlier consider at most two perspectives when measuring the importance of phrases, which leads to biased keyphrase extraction. Different from the existing methods, the proposed KIEMP considers estimating the importance of the candidate keyphrases from multiple perspectives simultaneously.

3 Methodology

We formally define the problem of keyphrase extraction as follows. In this paper, KIEMP takes a document D={w1,,wi,,wM}𝐷subscript𝑤1subscript𝑤𝑖subscript𝑤𝑀{D}=\{w_{1},...,w_{i},...,w_{M}\}italic_D = { italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } and learns to extract a set of keyphrases K𝐾{K}italic_K (each keyphrase may be composed of one or several word(s)) from their n-gram based representations under multiple perspectives.

This section describes the architecture of KIEMP, as shown in Figure 1. KIEMP mainly consists of two submodels: candidate keyphrase extraction and keyphrase importance estimation. The former first identifies and extracts the candidate keyphrases. Then the latter estimates the importance of keyphrases from three perspectives simultaneously with three modules to determine which one should be extracted.

Refer to caption
Figure 1: The KIEMP model architecture.

3.1 Contextualized Word Representation

Recently, pre-trained language models Peters et al. (2018); Devlin et al. (2019); Liu et al. (2019) have emerged as a critical technology for achieving impressive gains in a wide variety of natural language tasks Liu and Lapata (2019). These models extend the idea of word embeddings by learning contextual representations from large-scale corpora using a language modeling objective. In this situation, Xiong et al. (2019) propose to represent each word by its ELMo Peters et al. (2018) embedding and Sun et al. (2020) leverage variants of BERT Devlin et al. (2019); Liu et al. (2019) to obtain contextualized word representations. Motivated by the above approaches, we represent each word by RoBERTa Liu et al. (2019), which encodes D𝐷{D}italic_D to a sequence of vector H={h1,,hi,,hM}𝐻subscript1subscript𝑖subscript𝑀{H}=\{h_{1},...,h_{i},...,h_{M}\}italic_H = { italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }:

H=RoBERTa{w1,,wi,,wM},𝐻RoBERTasubscript𝑤1subscript𝑤𝑖subscript𝑤𝑀{H}=\text{RoBERTa}\{w_{1},...,w_{i},...,w_{M}\},italic_H = RoBERTa { italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } , (1)

where hidsubscript𝑖superscript𝑑h_{i}\in\mathbb{R}^{d}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT indicates the i𝑖iitalic_i-th contextualized word embedding of wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the last transformer layer in RoBERTa. Specifically, the [CLS] token of RoBERTa is used as the document representation.

3.2 Candidate Keyphrase Extraction

In the keyphrase extraction task, keyphrase usually contains more than one word, as shown in Table 1. Therefore, it is necessary to identify the candidate keyphrases via some strategies. Previous work Medelyan et al. (2009); Sun et al. (2020); Wang et al. (2020); Xiong et al. (2019) allow n-grams that appear in the document to be the candidate keyphrases. Motivated by the previous approaches, we consider the language properties Xiong et al. (2019) and compose the contextualized word representations to n-grams by CNNs (similar to Sun et al. (2020)). Specifically, the phrase representation of the i𝑖iitalic_i-th n𝑛nitalic_n-gram cinsuperscriptsubscript𝑐𝑖𝑛c_{i}^{n}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is computed as:

hin=CNNn(hi:i+n),superscriptsubscript𝑖𝑛superscriptCNN𝑛subscript:𝑖𝑖𝑛h_{i}^{n}=\text{CNN}^{n}(h_{i:i+n}),italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = CNN start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_i : italic_i + italic_n end_POSTSUBSCRIPT ) , (2)

where hindsuperscriptsubscript𝑖𝑛superscript𝑑h_{i}^{n}\in\mathbb{R}^{d}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT indicates the i𝑖iitalic_i-th n𝑛nitalic_n-gram representation. Concretely, n[1,N]𝑛1𝑁n\in[1,N]italic_n ∈ [ 1 , italic_N ] is the length of n-grams, and N𝑁Nitalic_N indicates the maximum length of allowed candidate n-grams. Specifically, each n-gram has its own set of convolution filters CNNnsuperscriptCNN𝑛\text{CNN}^{n}CNN start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with window size n𝑛nitalic_n and stride 1111.

3.3 Keyphrase Importance Estimation

In the keyphrase extraction models, keyphrase importance estimation commonly is one of the essential components. To improve the accuracy of keyphrase extraction, we estimate the importance of keyphrases from three perspectives simultaneously with three modules: chunking for syntactic accuracy, ranking for information saliency, and matching for concept consistency.

3.3.1 Chunking for Syntactic Accuracy

Many studies Turney (2002); Witten et al. (1999); Turney (2000) regard keyphrase extraction as a classification task, in which a model is trained to determine whether a candidate phrase is a keyphrase in a syntactic perspective. For example, Xiong et al. (2019); Sun et al. (2020) directly predict whether the n-gram is a keyphrase based on its corresponding representation. Motivated by these above methods, in this paper, the syntactic accuracy of phrase cinsuperscriptsubscript𝑐𝑖𝑛c_{i}^{n}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is estimated by a chunking module:

I1(cin)=softmax(𝐖1hin+b1),subscript𝐼1superscriptsubscript𝑐𝑖𝑛softmaxsubscript𝐖1superscriptsubscript𝑖𝑛subscript𝑏1I_{1}(c_{i}^{n})=\text{softmax}(\mathbf{W}_{1}h_{i}^{n}+b_{1}),italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = softmax ( bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , (3)

where 𝐖1subscript𝐖1\mathbf{W}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and b1subscript𝑏1b_{1}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT indicate a trainable matrix and a bias. The softmax is taken over all possible n-grams at each position i𝑖iitalic_i and each length n𝑛nitalic_n. The whole model is trained using cross-entropy loss:

Lc=CrossEntropy(yin,I1(cin)),subscript𝐿𝑐CrossEntropysuperscriptsubscript𝑦𝑖𝑛subscript𝐼1superscriptsubscript𝑐𝑖𝑛L_{c}=\text{CrossEntropy}(y_{i}^{n},I_{1}(c_{i}^{n})),italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = CrossEntropy ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) , (4)

where yinsuperscriptsubscript𝑦𝑖𝑛y_{i}^{n}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the label of whether the phrase cinsuperscriptsubscript𝑐𝑖𝑛c_{i}^{n}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a keyphrase of the original document.

3.3.2 Ranking for Information Saliency

The binary classifier-based keyphrase extraction model classifies each candidate keyphrase independently, and consequently, it does not allow us to determine which candidates are better than the others Hulth (2004). However, the goal of keyphrase extraction is to identify the most salient phrases for a document Hasan and Ng (2014). Therefore, a ranking model is required to rank the saliency of the candidate keyphrases. We leverage a pairwise learning approach to rank the candidate keyphrases globally to compare the information saliency between all candidates. First, we put the candidate keyphrases in the document that are labeled as keyphrases, in the positive set 𝐏+superscript𝐏\mathbf{P}^{+}bold_P start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, and the others to the negative set 𝐏superscript𝐏\mathbf{P}^{-}bold_P start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT, to obtain the ranking labels. Then, the loss function is the standard hinge loss in the pairwise learning model:

Lr=p+,pKmax(0,δ1I2(p+)+I2(p)),subscript𝐿𝑟subscriptsuperscript𝑝superscript𝑝𝐾max0subscript𝛿1subscript𝐼2superscript𝑝subscript𝐼2superscript𝑝L_{r}=\sum_{p^{+},p^{-}\in K}\text{max}(0,\delta_{1}-I_{2}(p^{+})\\ +I_{2}(p^{-})),italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ∈ italic_K end_POSTSUBSCRIPT max ( 0 , italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) + italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ) , (5)

where I2()subscript𝐼2I_{2}(\cdot)italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ ) represents the estimation of information saliency and δ1subscript𝛿1\delta_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT indicates the margin. It enforces KIEMP to rank the candidate keyphrases p+superscript𝑝p^{+}italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ahead of psuperscript𝑝p^{-}italic_p start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT within the same document. Specifically, the information saliency of the i𝑖iitalic_i-th n-gram representation cinsuperscriptsubscript𝑐𝑖𝑛c_{i}^{n}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT can be computed as follows:

I2(cin)=𝐖2hin+b2,subscript𝐼2superscriptsubscript𝑐𝑖𝑛subscript𝐖2superscriptsubscript𝑖𝑛subscript𝑏2I_{2}(c_{i}^{n})=\mathbf{W}_{2}h_{i}^{n}+b_{2},italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (6)

where 𝐖2subscript𝐖2\mathbf{W}_{2}bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a trainable matrix, and b2subscript𝑏2b_{2}italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a bias. Through the pairwise learning model, we can rank the information saliency of all candidates and extract the keyphrases with more salient information sufficiently.

3.3.3 Matching for Concept Consistency

As phrases are used to express various meanings corresponding to different concepts (i.e., topics), a phrase will play different important roles in different concepts of the document Liu et al. (2010). A matching module is proposed via a metric learning framework to estimate the concept consistency between the candidate keyphrases and their corresponding document. We first apply variation autoencoder Rezende et al. (2014) on the documents 𝐃𝐃\mathbf{D}bold_D and the candidate keyphrases 𝐊𝐊\mathbf{K}bold_K to obtain their concepts. Each document D𝐷Ditalic_D is encoded via a latent variable zc𝑧superscript𝑐z\in\mathbb{R}^{c}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT which is assumed to be sampled from a standard Gaussian prior, i.e., zp(z)=𝒩(0,Id)similar-to𝑧𝑝𝑧𝒩0subscript𝐼𝑑z\sim p({z})=\mathcal{N}(0,{I}_{d})italic_z ∼ italic_p ( italic_z ) = caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). Such variable has ability to determine the latent concepts hidden in the documents and will be useful to extract keyphrase Wang et al. (2019). During the encoding process, z𝑧{z}italic_z can be sampled via a re-parameterization trick for Gaussian distribution, i.e., zq(z|D)=𝒩(μ,σ)similar-to𝑧𝑞conditional𝑧𝐷𝒩𝜇𝜎{z}\sim q({z}|{D})=\mathcal{N}({\mu},{\sigma})italic_z ∼ italic_q ( italic_z | italic_D ) = caligraphic_N ( italic_μ , italic_σ ). Specifically, we sample an auxiliary noise variable εN(0,I)similar-to𝜀𝑁0𝐼{\varepsilon}\sim N(0,{I})italic_ε ∼ italic_N ( 0 , italic_I ) and re-parameterization z=μ+σε𝑧𝜇direct-product𝜎𝜀{z}={\mu}+{\sigma}\odot\varepsilonitalic_z = italic_μ + italic_σ ⊙ italic_ε, where direct-product\odot denotes the element-wise multiplication. The mean vector μc𝜇superscript𝑐{\mu}\in\mathbb{R}^{c}italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and variance vector σc𝜎superscript𝑐{\sigma}\in\mathbb{R}^{c}italic_σ ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT will be inferred by a two-layer network with ReLU-activated function, i.e., μ=μϕ(D)𝜇subscript𝜇italic-ϕ𝐷{\mu}=\mu_{\phi}({D})italic_μ = italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_D ) and σ=σϕ(D)𝜎subscript𝜎italic-ϕ𝐷{\sigma}=\sigma_{\phi}({D})italic_σ = italic_σ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_D ) where ϕitalic-ϕ\phiitalic_ϕ is the parameter set. During the decoding process, the document can be reconstructed by a multi-layer network (fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT) with Tanh-activated function, i.e., D~=fk(z)~𝐷subscript𝑓𝑘𝑧{\tilde{D}}=f_{k}({z})over~ start_ARG italic_D end_ARG = italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_z ). Furthermore, the candidate keyphrases are processed in the same way as the documents.

Once having the latent concept representation of the document z𝑧zitalic_z and the phrase zinsuperscriptsubscript𝑧𝑖𝑛z_{i}^{n}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the concept consistency can be estimated as follows,

I3(cin,D)=zin𝐖3z.subscript𝐼3superscriptsubscript𝑐𝑖𝑛𝐷superscriptsubscript𝑧𝑖𝑛subscript𝐖3𝑧I_{3}(c_{i}^{n},D)=z_{i}^{n}\mathbf{W}_{3}z.italic_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_D ) = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_z . (7)

Here, 𝐖3subscript𝐖3\mathbf{W}_{3}bold_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is a learnable mapping matrix. The loss function is the triplet loss in the metric learning framework calculated as follows:

Lm=p+,pKmax(0,I3(p,D)I3(p+,D)+δ2),subscript𝐿𝑚subscriptsuperscript𝑝superscript𝑝𝐾max0subscript𝐼3superscript𝑝𝐷subscript𝐼3superscript𝑝𝐷subscript𝛿2L_{m}=\sum_{p^{+},p^{-}\in K}\text{max}(0,I_{3}(p^{-},D)-I_{3}(p^{+},D)+\delta% _{2}),italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ∈ italic_K end_POSTSUBSCRIPT max ( 0 , italic_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_D ) - italic_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_D ) + italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , (8)

where δ2subscript𝛿2\delta_{2}italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represents the margin. It enforces KIEMP to match and rank the concept consistency of keyphrases p+superscript𝑝p^{+}italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ahead of the non-keyphrases psuperscript𝑝p^{-}italic_p start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT within their corresponding document D𝐷Ditalic_D.

Furthermore, to simultaneously minimize the reconstruction loss and penalize the discrepancy between a prior distribution and posterior distribution about the latent variable z𝑧{z}italic_z, the VAE process can be implemented by optimizing the following objective function for the documents Ldsubscript𝐿𝑑L_{d}italic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and the candidate keyphrases Lksubscript𝐿𝑘L_{k}italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT:

Ld=𝔼q(𝐳|𝐃)[p(𝐃|𝐳)]+DKL(p(𝐳)||q(𝐳|𝐃)),L_{d}=-\mathbb{E}_{q(\mathbf{z}|\mathbf{D})}\big{[}p(\mathbf{D}|\mathbf{z})% \big{]}+D_{KL}\big{(}p(\mathbf{z})||q(\mathbf{z}|\mathbf{D})\big{)},italic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT italic_q ( bold_z | bold_D ) end_POSTSUBSCRIPT [ italic_p ( bold_D | bold_z ) ] + italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_p ( bold_z ) | | italic_q ( bold_z | bold_D ) ) , (9)
Lk=𝔼q(𝐳|𝐊)[p(𝐊|𝐳)]+DKL(p(𝐳)||q(𝐳|𝐊)),L_{k}=-\mathbb{E}_{q(\mathbf{z}|\mathbf{K})}\big{[}p(\mathbf{K}|\mathbf{z})% \big{]}+D_{KL}\big{(}p(\mathbf{z})||q(\mathbf{z}|\mathbf{K})\big{)},italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT italic_q ( bold_z | bold_K ) end_POSTSUBSCRIPT [ italic_p ( bold_K | bold_z ) ] + italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_p ( bold_z ) | | italic_q ( bold_z | bold_K ) ) , (10)

where DKLsubscript𝐷𝐾𝐿D_{KL}italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT indicates the Kullback-Leibler divergence between two distributions. And the final loss of this module is calculated as follows:

Lt=Lm+λLd+(1λ)Lk,subscript𝐿𝑡subscript𝐿𝑚𝜆subscript𝐿𝑑1𝜆subscript𝐿𝑘L_{t}=L_{m}+\lambda L_{d}+(1-\lambda)L_{k},italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + ( 1 - italic_λ ) italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , (11)

where λ(0,1)𝜆01\lambda\in(0,1)italic_λ ∈ ( 0 , 1 ) indicates the balance factor. Through concept consistency matching, we expect to align keyphrases with high-level concepts (i.e., topics or structures) in the document to assist the model in extracting keyphrases with more important concepts.

3.4 Model Training and Inference

Multi-task learning has played an essential role in various fields Srna et al. (2018), and has been widely used in the natural language processing tasks Sun et al. (2020); Mu et al. (2020) recently. Therefore, our framework allows end-to-end learning of syntactic chunking, saliency ranking, and concept matching in this paper. Then, we define the training objective of the entire framework with the linear combination of Lcsubscript𝐿𝑐L_{c}italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

L=ϵ1Lc+ϵ2Lr+ϵ3Lt,𝐿subscriptitalic-ϵ1subscript𝐿𝑐subscriptitalic-ϵ2subscript𝐿𝑟subscriptitalic-ϵ3subscript𝐿𝑡L=\epsilon_{1}L_{c}+\epsilon_{2}L_{r}+\epsilon_{3}L_{t},italic_L = italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (12)

where the hyper-parameters ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and ϵ3subscriptitalic-ϵ3\epsilon_{3}italic_ϵ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT balance the effects of the importance estimation from three perspectives. Specifically, ϵ1+ϵ2+ϵ3=1subscriptitalic-ϵ1subscriptitalic-ϵ2subscriptitalic-ϵ31\epsilon_{1}+\epsilon_{2}+\epsilon_{3}=1italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 1.

In this paper, KIEMP aims to extract keyphrases according to their saliency. It contains three modules syntactic accuracy chunking, information saliency ranking, and concept consistency matching. Chunking and matching are used to enforce the ranking module to rank the proper candidate keyphrases ahead. Therefore, only the ranking module is used in the inference process (test-phase).

Dataset Document Len. # Keyphrase Keyphrase Len.
Average Average Average
OpenKP 900.4 1.8 2.0
KP20k 179.8 5.3 2.0
Inspec 128.7 9.8 2.5
Krapivin 182.6 5.8 2.2
Nus 219.1 11.7 2.2
SemEval 234.8 14.7 2.4
Table 2: Statistics of six benchmark datasets. Document Len. and Keyphrase Len. represent the number of words in the document and keyphrase respectively.

4 Experimental Settings

4.1 Datasets

Six benchmark datasets are mainly used in our experiments, OpenKP Xiong et al. (2019), KP20k Meng et al. (2017), Inspec Hulth (2003), Krapivin Krapivin and Marchese (2009), Nus Nguyen and Kan (2007) and SemEval Kim et al. (2010). Table 2 summarizes the statistics of each testing sets.

OpenKP consists of around 150K documents sampled from the index of the Bing search engine. In OpenKP, we follow the official split of training (134K documents), development (6.6K documents), and testing (6.6K documents) sets. The keyphrases for each document in OpenKP were labeled by expert annotators, with each document assigned 1-3 keyphrases. As a requirement, all the keyphrases appeared in the original document Xiong et al. (2019).

KP20k contains a large number of high-quality scientific metadata in the computer science domain from various online digital libraries Meng et al. (2017). We follow the official setting of this dataset and split the dataset into training (528K documents), validation (20K documents), and testing (20K documents) sets. From the training set of KP20k, we remove all articles that are duplicated in themselves, either in the KP20k validation and testing set. After the cleanup, the KP20k dataset contains 504K training samples, 20K validation samples, and 20K testing samples.

To verify the robustness of KIEMP, we also test the model trained with KP20k dataset on four widely-adopted keyphrase extraction data sets including Inspec, Krapivin, Nus, and SemEval.

In this paper, we focus on keyphrase extraction. Therefore, only the keyphrases that appear in the documents are used for training and evaluation.

Hyper-parameter Dimension or Value
λ𝜆\lambdaitalic_λ 0.50.50.50.5
ϵ1,ϵ2,ϵ3subscriptitalic-ϵ1subscriptitalic-ϵ2subscriptitalic-ϵ3\epsilon_{1},\epsilon_{2},\epsilon_{3}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 1/3
δ1,δ2subscript𝛿1subscript𝛿2\delta_{1},\delta_{2}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 1.0
Optimizer AdamW
Learning Rate 1×1051superscript1051\times 10^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
Batch Size 32323232
Warm-Up Proportion 10%percent1010\%10 %
RoBERTa Embedding (d)superscript𝑑(\mathbb{R}^{d})( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) 768
Concept Dimension (c)superscript𝑐(\mathbb{R}^{c})( blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) 64
Max Sequence Length 512
Maximum Phrase Length (N)𝑁(N)( italic_N ) 5
Table 3: Parameters used for training KIEMP.
Model OpenKP KP20k
R@1𝑅@1R@1italic_R @ 1 R@3𝑅@3R@3italic_R @ 3 R@5𝑅@5R@5italic_R @ 5 F1@1subscript𝐹1@1F_{1}@1italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT @ 1 F1@3subscript𝐹1@3F_{1}@3italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT @ 3 F1@5subscript𝐹1@5F_{1}@5italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT @ 5 F1@5subscript𝐹1@5F_{1}@5italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT @ 5 F1@10subscript𝐹1@10F_{1}@10italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT @ 10
Unsupervised Methods
TFIDF Jones (2004) 0.150 0.284 0.347 0.196* 0.223* 0.196* 0.105 0.130
TextRank Mihalcea and Tarau (2004) 0.041 0.098 0.142 0.054* 0.076* 0.079* 0.180 0.150
Supervised Methods with Additional Features
BLING-KPE Xiong et al. (2019) 0.220 0.390 0.481 0.285* 0.303* 0.270* - -
SMART-KPE+R2J Wang et al. (2020) 0.307 0.532 0.625 0.381 0.405 0.347 - -
Supervised Methods without Additional Features
CopyRNN Meng et al. (2017) 0.174 0.331 0.413 0.217* 0.237* 0.210* 0.327 0.278
DivGraphPointer Sun et al. (2019) - - - - - - 0.368 0.292
Div-DGCN Zhang et al. (2020) - - - - - - 0.349 0.313
SKE-Large-CLS Mu et al. (2020) - - - - - - 0.392 0.330
ChunkKPE Sun et al. (2020) 0.283 0.486 0.581 0.355 0.373 0.324 0.408 0.337
RankKPE Sun et al. (2020) 0.290 0.509 0.604 0.361 0.390 0.337 0.417 0.343
JointKPE Sun et al. (2020) 0.291 0.511 0.605 0.364 0.391 0.338 0.419 0.344
KIEMP 0.298 0.517 0.615 0.369 0.392 0.340 0.421 0.345
Table 4: Performances of keyphrase extraction model on the OpenKP development set and the KP20k testing set. The best results of our model are highlighted in bold, and the best results of baselines are underlined. * indicates these numbers are not included in the original paper and are estimated with Precision and Recall. The results of the baselines are reported in their corresponding papers.

4.2 Baselines

This paper focuses on the comparisons with the state-of-the-art baselines and chooses the following keyphrase extraction models as our baselines.

TextRank An unsupervised algorithm based on weighted-graphs proposed by Mihalcea and Tarau (2004). Given a word graph built on co-occurrences, it calculates the importance of candidate words with PageRank. The importance of a candidate keyphrase is then estimated as the sum of the scores of the constituent words.

TFIDF Jones (2004) is computed based on candidate frequency in the given text and inverse document frequency

CopyRNN Meng et al. (2017) which uses the attention mechanism as the copy mechanism to extract keyphrases from the given document.

BLING-KPE Xiong et al. (2019) first concatenates the pre-trained language model (ELMo Peters et al. (2018)) as word embeddings, visual as well as positional features, and then uses a CNN network to obtain n-gram phrase embeddings for binary classification.

JointKPE Sun et al. (2020) jointly learns a chunking model (ChunkKPE) and a ranking model (RankKPE) for keyphrase extraction.

SMART-KPE+R2J Wang et al. (2020) presents a multi-modal method to the keyphrase extraction task, which leverages lexical and visual features to enable strategy induction as well as meta-level features to aid in strategy selection.

DivGraphPointer Sun et al. (2019) combines the advantages of traditional graph-based ranking methods and recent neural network-based approaches. Furthermore, they also propose a diversified point network to generate a set of diverse keyphrases out of the word graph in the decoding process.

Div-DGCN Zhang et al. (2020) proposes to adopt the Dynamic Graph Convolutional Networks (DGCN) to acquire informative latent document representation and better model the compositionality of the target keyphrases set.

SKE-Large-CLS Mu et al. (2020) obtains span-based representation for each keyphrase and further learns to capture the similarity between keyphrases in the source document to get better keyphrase predictions.

In this paper, for ease of introduction, all the baselines are divided according to the following three perspectives, syntax, saliency, and combining syntax and saliency. Among them, BLING-KPE, CopyRNN, ChunkKPE belong to the former, TFIDF, TextRank, as well as RankKPE belong to the second, and DivGraphPointer, Div-DGCN, SKE-Large-CLS, SMART-KPE+R2J, and JointKPE belong to the last.

4.3 Evaluation Metrics

For the keyphrase extraction task, the performance of keyphrase model is typically evaluated by comparing the top k𝑘kitalic_k predicted keyphrases with the target keyphrases (ground-truth labels). The evaluation cutoff k𝑘kitalic_k can be a fixed number (e.g., F1@5subscript𝐹1@5F_{1}@5italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT @ 5 compares the top-5555 keyphrases predicted by the model with the ground-truth to compute an F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score). Following the previous work Meng et al. (2017); Sun et al. (2019), we adopt macro-averaged recall and F-measure (F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) as evaluation metrics, and k𝑘kitalic_k is set to be 1, 3, 5, and 10. In the evaluation, we apply Porter Stemmer Porter (2006) to both target keyphrases and extracted keyphrases when determining the match of keyphrases and match of the identical word.

4.4 Implementation Details

Implementation details of our proposed models are as follows. The maximum document length is 512 due to BERT limitations Devlin et al. (2019), and documents are zero-padded or truncated to this length. The training used 4 GeForce RTX 2080 Ti GPUs and took about 31 hours and 77 hours for OpenKP and KP20k datasets respectively. Table 3 lists the parameters of our model. Furthermore, the model was implemented in Pytorch Paszke et al. (2019) using the huggingface re-implementation of RoBERTa Wolf et al. (2019).

5 Results and Analysis

This section investigates the performance of the proposed KIEMP on six widely-used benchmark datasets (OpenKP, KP20k, Inspec, Krapivin, Nus, and Semeval) from three facets. The first one demonstrates its superiority by comparing it with ten baselines in terms of several metrics. The second one is to verify the sensitivity of the concept dimension. The last one is to explicitly show the keyphrase extraction results of KIEMP via two examples (two testing documents).

5.1 Overall Performance

The overall performance of different algorithms on two benchmarks (OpenKP and KP20k) is summarized in Table 4. We can see that the supervised methods outperform all the unsupervised algorithms (TFIDF and TextRank). This is not surprising since the supervised methods are trained end-to-end with supervised data. In all the supervised baselines, the methods using additional features are better than those without additional features. The reason is that the models with additional features are equal to encode keyphrases from multiple features perspectives. Therefore, it is helpful for the model to measure the importance of each keyphrase, thus improving the performance of the result of keyphrase extraction. Intuitively, this is the same as our proposed method. KIEMP considers the importance of keyphrases from multiple perspectives and fairly measures the importance of each keyphrase. But the difference is that we do not need additional features to assist. And in many practical applications of keyphrase extraction, there is no additional feature (i.e., visual features) information to use in most cases. Compared with recent baselines (ChunkKPE, RankKPE, and JointKPE), KIEMP performs stably better on all metrics on both two datasets. These results demonstrate the benefits of estimating the importance of keyphrases from multiple perspectives simultaneously and the effectiveness of our multi-task learning strategy.

Refer to caption
Figure 2: Results of keyphrase extraction model on four testing sets (Semeval, Inspec, Krapivin, and Nus). The results of JointKPE are re-evaluated via the code which is provided by its corresponding paper.

Furthermore, to verify the robustness of KIEMP, we also test the KIEMP trained with KP20k dataset on four widely-adopted keyphrase extraction data sets. It can be seen from Figure 2 that KIEMP is superior to the best baseline (JointKPE). We consider that this phenomenon comes from two benefits. One is that the high-level concepts captured by a deep latent variable model may contain topic and structure features. These features are essential information to evaluate the importance of phrases. Another one is that the latent variable is characterized by a probability distribution over possible values rather than a fixed value, which can enforce the uncertainty of our model and further lead to robust representation learning.

Concept Dimension (c)superscriptnormal-c(\mathbb{R}^{c})( blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) OpenKP
R@1𝑅@1R@1italic_R @ 1 R@3𝑅@3R@3italic_R @ 3 R@5𝑅@5R@5italic_R @ 5
64 0.298 0.517 0.615
256 0.297 0.513 0.610
512 0.296 0.509 0.609
768 0.293 0.508 0.606
Table 5: Effectiveness of different dimensions of latent concept representation. The best results are highlighted in bold.

(A) Part of the Input Document:

The Great Plateau is a large region of land that is secluded from other parts of Hyrule, as its steep slopes prevent anyone from traveling to and from it without special equipment, such as the Paraglider. The only active inhabitant is the Old Man, a mysterious … (URL: https://zelda.gamepedia.com/Great_Plateau)

Target Keyphrase: (1) great plateau ; (2) breath of the wild ; (3) hyrule

KIEMP without concept consistency matching: (1) great plateau ; (2) hyrule ; (3) breath of the wild ; (4) paraglider ; (5) zelda

KIEMP: (1) great plateau ; (2) breath of the wild ; (3) hyrule ; (4) paraglider ; (5) starting region

(B) Part of the Input Document:

Transformational leaders also depend on visionary leadership to win over followers, but they have an added focus on employee development. For example, a transformational leader might explain how her plan for the future serves her employees’ interests and how she will support them through the changes … (URL: https://yourbusiness.azcentral.com/managers-different-leadership-styles-motivate-teams-8481.html)

Target Keyphrase: (1) managers ; (2) leadership ; (3) teams

KIEMP without concept consistency matching: (1) motivating ; (2) motivate ; (3) charismatic leadership ; (4) transformational leadership ; (5) employee development

KIEMP: (1) leadership styles; (2) managers ; (3) charismatic leadership ; (4) transformational leadership ; (5) leadership

Table 6: Example of keyphrase extraction results (selected from the OpenKP dataset). Phrases in red and bold are target keyphrases predicted by the different models (KIEMP without concept consistency matching and KIEMP).

5.2 Sensitivity of the Concept Dimension

Here, we verify the effectiveness of using different concept dimensions. From Table 5, we can find that the increase of the dimension of latent concept representation has little effect on the result of keyphrase extraction. In contrast, the smaller the dimension, the better the result. Furthermore, in Table 4, the improvement of our proposed KIEMP model on the F1@1subscript𝐹1@1F_{1}@1italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT @ 1 evaluation metric is higher than the F1@3subscript𝐹1@3F_{1}@3italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT @ 3 and F1@5subscript𝐹1@5F_{1}@5italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT @ 5 evaluation metrics on the OpenKP dataset. We consider the main reason is that our concept representation may capture the high-level conceptual information of phrases or documents, such as topics and structure information. Therefore, KIEMP with concept consistency matching module focuses more on extracting keyphrases closest to the main topic of the given document.

5.3 Case Study

To further illustrate the effectiveness of the proposed model, we present a case study on the results of the keyphrases extracted by different algorithms. Table 6 presents the results of KIEMP without concept consistency matching and KIEMP. From the first example, we can see that our KIEMP model is more inclined to extract keyphrases closer to the central semantics of the input document, which benefits from our concept consistency matching model. From the second example, we can see that the keyphrases extracted by KIEMP without concept consistency matching contain some redundant or meaningless phrases. The main reason may be that the KIEMP without concept consistency matching does not measure the importance of phrases from multiple perspectives, which leads to biased extraction. On the contrary, the keyphrases extracted by KIEMP are all around the main concepts of the example document, i.e., “leadership”. It further demonstrates the effectiveness of our proposed model.

6 Conclusions and Future Work

A new keyphrase importance estimation from the multiple perspectives approach is proposed to estimate the importance of keyphrase. Benefiting from the designed syntactic accuracy chunking, information saliency ranking, and concept consistency matching modules, KIEMP can fairly extract keyphrases. A series of experiments have demonstrated that KIEMP outperformed the existing state-of-the-art keyphrase extraction methods. In the future, it will be interesting to introduce an adaptive approach in KIEMP to filter the meaningless phrases.

7 Acknowledgments

This work was supported in part by the National Key Research and Development Program of China under Grant 2020AAA0106800; the National Science Foundation of China under Grant 61822601 and 61773050; the Beijing Natural Science Foundation under Grant Z180006; The Fundamental Research Funds for the Central Universities (2019JBZ110).

References