Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

TechPat: Technical Phrase Extraction for Patent Mining

Published: 15 June 2023 Publication History
  • Get Citation Alerts
  • Abstract

    In recent years, due to the explosive growth of patent applications, patent mining has drawn extensive attention and interest. An important issue of patent mining is that of recognizing the technologies contained in patents, which serves as a fundamental preparation for deeper analysis. To this end, in this article, we make a focused study on constructing a technology portrait for each patent, i.e., to recognize technical phrases concerned in it, which can summarize and represent patents from a technical perspective. Along this line, a critical challenge is how to analyze the unique characteristics of technical phrases and illustrate them with definite descriptions. Therefore, we first generate the detailed descriptions about the technical phrases existing in extensive patents based on different criteria, including various previous works, practical experience, and statistical analyses. Then, considering the unique characteristics of technical phrases and the complex structure of patent documents, such as multi-aspect semantics and multi-level relevances, we further propose a novel unsupervised model, namely TechPat, which can not only automatically recognize technical phrases from massive patents but also avoid the need for expensive human labeling. After that, we evaluate the extraction results from various aspects. Specifically, we propose a novel evaluation metric called Information Retrieval Efficiency (IRE) to quantify the performance of extracted technical phrases from a new perspective. Extensive experiments on real-world patent data demonstrate that the TechPat model can effectively discriminate technical phrases in patents and greatly outperform existing methods. We further apply extracted technical phrases to two practical application tasks, namely patent search and patent classification, where the experimental results confirm the wide application prospects of technical phrases. Finally, we discuss the generalization ability of our proposed methods.

    1 Introduction

    According to statistics of World Intellectual Property Organization (WIPO),1 patent applications, which contain rich innovation ideas, keep growing rapidly worldwide and the number has reached 3.3 million in 2020. Indeed, the explosive growth of patents creates a valuable base of data for revealing the inner law of innovation through patent analysis [27, 43, 64, 70, 74], but at the same time, puts forward higher requirements on patent mining techniques [53, 73].
    As a matter of fact, patent mining is often highly reliant on text analysis, i.e., how to process, organize and analyze key information of patent documents [22, 55]. An effective step here is to construct a technology portrait for each patent, that is, to identify the technical phrases [61] involved, which aids greatly in summarizing the key information it contains from a technical perspective. Generally speaking, technical phrases refer to phrases that are closely related to special technologies. For example, one given patent may contain “wireless communication” and “multiplex communication”. These technical phrases indicate that the patent is closely related to the “electric network” domain and the combination of them can be seen as a technology portrait tagging this patent (A more clear and detailed description of technical phrase can be found in Section 3.2). In order to better understand the strength of technical phrases, we present a comparison between technical and non-technical phrases in Table 1. From this table, we can observe that technical phrases vary substantially compared with non-technical phrases, as the former contain rich technical information and may represent a certain technology (e.g., wireless communication), while the latter, relatively speaking, tend to have a more general or common meaning (e.g., wire and cable). Accordingly, compared with non-technical phrases, technical phrases could reveal and represent the technologies contained in patent documents, thereby providing a vital basis for patent mining. Therefore, how to automatically extract these technical phrases from massive patent documents for constructing the technology portrait is a meaningful research issue.
    Table 1.
    DomainTechnical phraseNon-technical phrase
    Electricitywireless communication, video encoding, netcentric computer servicewire and cable, TV signal, power plug
    Mechanical Engineeringfluid leak detection, power transmission, rotational speed control systembuilding materials, steer column, seat back
    Table 1. A Comparison between Technical Phrases and Non-Technical Phrases
    As far as we are concerned, there have been few works specifically designed for technical phrase extraction, while some relevant works on phrase extraction have been explored. According to the extraction target, these can be divided into three categories, including key phrase extraction [58], Named Entity Recognition(NER) [20], and concept extraction [21]. In more detail, key phrase extraction aims to extract phrases that provide a concise summary of a document [58], with preference given to those that are both frequently occurring and close to the main topics. NER [20] focuses on locating and classifying named entities into pre-defined categories and pays more attention to whether the given phrases are real entities. For its part, concept extraction [21] is somewhat similar to technical phrase extraction and aims to find words or phrases describing a concept from massive texts. However, it is worth noting that a concept here does not equal to a technical phrase, as some phrases like “user preference” and “reproductive age” actually belong to concepts but not our focus (i.e., technical phrases). To summarize, although the extraction targets of these works vary from each other, most of them ignore the technical information contained in phrases, which is the key attribute of technical phrases.
    Unfortunately, there are many technical and domain challenges inherent in designing and implementing an effective technical phrase extraction system in the patent field. First, as the technical meaning in patent documents is difficult to quantify, there are more perplexing and unreachable characteristics among technical phrases. Two similar phrases may sometimes have completely different implications. For example, “support vector machine (SVM)” is a technical phrase indicating a classification algorithm, while “support machine” is not. Second, technical phrases often appear at different levels of one patent (i.e., “Title”, “Abstract”, and “Claim”), and such a multi-level structure of patents shows strong connections in describing a common technology target. Therefore, how to combine the information from different levels and effectively utilize their relations are also key challenges for recognizing technical phrases from patents. Third, in the text of each level, there are often phrases that describe various aspects of the content, especially in long texts. For instance, as shown in Figure 1, in a patent of “Computer Network”, there exist phrases describing the “Transmission” aspect (such as “uplink transmission”, “media stream transmission”), while there are also some phrases concerning the “Computing” aspect (such as “cloud computing” and “parallel computing”). We refer to this as the multi-aspect semantics structure, which reveals the distribution of numerous phrases in the text, and thus provides an entry point for technical phrase recognition.
    Fig. 1.
    Fig. 1. The multi-aspect semantics structure in text.
    To directly achieve the primary goal of extracting technical phrases with addressing the first two challenges, in our preliminary work [30], we propose an Unsupervised Multi-level Technical Phrase Extraction (UMTPE) model, which primarily explores both the statistical, semantic characteristics of technical phrases and the multi-level structure in patent data. Specifically: (1) we first analyze the key characteristics of technical phrases in patent documents and provide a clear description of them. Then, we design several measurement indicators for technical phrases from statistical and semantic perspectives, which enable us to recognize technical phrases accurately and comprehensively. (2) Considering the relations between different levels in patents, we design components (i.e., Topic Generation, Topic Relevance) to relate adjacent levels, which could utilize the implied information in multi-level structure extensively. With the help of these designs, UMTPE can extract technical phrases from numerous patent documents; however, it neglects the multi-aspect semantics structure in patent texts.
    In this article, to better mine the semantics structure in patent texts and improve the extraction performance of our proposed model, we further develop an extended version of UMTPE and propose another enhanced model, namely TechPat, in which we refine the analysis of technical phrases and patent data, and further incorporate the multi-aspect semantics structure into our modeling process. To be more specific, in the TechPat model, we propose the multi-aspect graph to characterize the relation of different phrases in the text, which is closer to the real distribution of phrases. Then, we revise the measurement indicators in UMTPE accordingly and design a novel ranking algorithm to help select technical phrases from a pre-generated candidate phrase pool. In this way, TechPat can more accurately recognize technical phrases from massive patent documents.
    After the extraction process, it is still a non-trivial task to comprehensively evaluate the results, especially for these special extraction tasks. In this article, to supplement traditional evaluation metrics and improve the evaluation confidence, we propose a novel metric called Information Retrieval Efficiency (IRE) to evaluate the extracted technical phrases from a representation ability perspective. Extensive experiments on real-world patent data demonstrate that our proposed methods can effectively discriminate technical phrases in patents and greatly outperform existing baselines. Finally, we further apply the extracted technical phrases into two practical application tasks, namely patent search and patent classification, the results of which confirm the application prospects of technical phrases.
    Although our proposed methods focus on the technical phrase extraction in patent documents, the designs are more general and can be transferred into more types of technical documents. We provide the way to apply our methods to the scientific article data, and discuss their generalization ability to more technical documents.
    Overview. The remainder of this article is organized as follows. In Section 2, we briefly introduce some related works of our study. In Section 3, we introduce some preliminaries pertaining to both patent data and technical phrases, and further present the problem statement and our solution overview. Section 4 contains the details of our proposed TechPat model. In Section 5, we specify two application tasks of our proposed technical phrases. Section 6 presents the experimental results. After that, we further discuss the generalization ability of our proposed methods to more technical documents in Section 7. Finally, conclusions are given in Section 8.

    2 Related Work

    To the best of our knowledge, few existing works have been directly designed to extract technical phrases from patents. However, some relevant works can still be identified, including key phrase extraction, NER and concept extraction.
    Key Phrase Extraction. Key phrase extraction aims to extract phrases that provide a concise summary of a document. It has been widely studied in data mining tasks, including supervised [1, 7, 26, 37, 38, 49, 58, 78] and unsupervised methods [2, 5, 23, 24, 31, 46, 47, 50, 59, 72]. For one thing, supervised methods often target at training a complicated model with the help of labeled data or external knowledge bases. For instance, Meng et al. [37, 38] designed an encoder-decoder framework to generate key phrases from the original text. Ahmad et al. [1] designed a novel transformer-based architecture to mine key phrases in long documents, which can extract and generate key phrases from the text simultaneously. Shang et al. [26, 49] proposed an approach of adaptively recognizing phrase occurrence based on quality estimation, which relied on external knowledge bases (e.g., Wikipedia) to some extent. For another, unsupervised methods focus on mining the inner connections in documents in response to the lack of labeled data. Bellaachia et al. [2] designed an improved ranking algorithm based on Pagerank and Textrank to evaluate the importance of words in documents, which they used to formulate key phrases. Liang et al. [24] proposed to utilize the pretrained embedding to represent both the document and candidate phrases, after which a ranking mechanism considering both local and global similarities in the context was conducted to select key phrases.
    Named Entity Recognition. NER focuses on locating and classifying named entities into pre-defined categories, which is often regarded as a sequence labeling problem. In the early stages, researchers applied Conditional Random Field (CRF), SVM, and perception models with hand-crafted features [11, 19, 32]. In recent years, as deep learning has rapidly developed, NER has tended to be tackled by Recurrent Neural Networks (RNNs) and attention mechanism [14, 20, 25, 33, 45, 62, 63, 66, 76, 77]. For example, Chiu et al. [33] proposed a hybrid BiLSTM-CNNs-CRF architecture to locate named entities in the original text. Lin et al. [25] put forward an “entity trigger” to improve the traditional LSTM&CRF framework, which increased the model’s interpretability and saved substantially on labeled data. On the basis of the LSTM&CRF framework, Wu et al. [62] proposed an automatic annotation method via quote marks, which can help detect entities from Chinese chatbot conversation logs without supervision. Zhou et al. [66] proposed to treat NER as the multi-class classification problem of word pairs and designed a multi-head self-attention mechanism to mine the word level correlations for each entity type.
    Some pretrained models have also been developed for this task [8, 17, 34]. For example, Honnibal et al. [17] released a package tool called Spacy for NER, noun phrase chunking and other annotation tasks, which can achieve good time efficiency and robustness.
    Concept Extraction. Concept extraction aims to find words or phrases describing a concept within massive texts, which has been studied extensively in previous works [12, 21, 41, 56, 60, 69, 71, 75]. Generally speaking, concept extraction can also be formulated as a sequence labeling problem. Traditional methods often adopt a generation and selection mechanism, which consists of two steps: (1) extracting candidate concepts via hand-crafted rules or syntactic pattern matching; (2) selecting target concepts based on supervised or unsupervised methods. For instance, Li et al. [21] utilized a range of models to generate possible concepts, and then designed a novel architecture to evaluate the fitness of extracted concepts relative to the original text. Recently, with the help of strong computing power, deep learning methods have achieved remarkable performance. For example, Yang et al. [69] designed a transformer-based model to directly extract the concept from massive texts. Fand et al. [12] proposed a Guided Attention Network. In this model, three additional supervision signals were introduced to explore the structured information in raw text and it achieved good performance and learning efficiency.
    In summary, the above studies focus on their respective target phrases and cannot be transferred directly into technical phrase extraction. First, it is unsuitable to apply such supervised methods to our task, as there are insufficient labeled technical phrases in massive patent documents. Second, unsupervised approaches are often sensitive to the extraction target, meaning that certain gaps exist between technical phrase extraction and the existing methods. Moreover, patent data characteristics can be another consideration, as technology relations between different levels in patents (i.e., “Title”, “Abstract”, and “Claim”) will provide opportunities for aiding technical phrase recognition from patent documents, which cannot be effectively captured by the existing models.

    3 Overview

    In this section, we first introduce the patent data and analyze the multi-aspect semantics in patent text. Then, based on expert experience and statistics, we provide a clear description of technical phrases in patents. Finally, we present the problem statement of technical phrase extraction and specify our solution overview.

    3.1 Patent Data

    The patent data we use is provided by the United States Patent and Trademark Office (USPTO),2 and comprises two domains, i.e., Mechanical Engineering and Electricity. Each patent contains a multi-level structure, i.e., “Title”, “Abstract”, and “Claim”, where “Title” and “Abstract” depict the topic and brief summary of a patent, while “Claim” is a more detailed and lengthy description of the inventor’s rights.
    Multi-aspect Semantics. As noted in Section 1, on each level of a patent, there are often phrases that describe various aspects of the content, especially in long texts. To facilitate better analysis and illustration, we provide an example of phrases in “Abstract” and “Claim” in Figure 2. This figure reveals the distribution of all phrases in semantic space; each node represents a phrase, while the color indicates the semantic aspect to which the phrase belongs. We can easily determine that phrases in the same aspect often gather together in a sub-semantic-space and are much more closely related than others. We refer to this phenomenon as the multi-aspect semantics structure in the patent text, which provides a vital basis for the modeling of the technical phrase recognition process.
    Fig. 2.
    Fig. 2. (a) Phrases of “Abstract” in semantic space. (b) Phrases of “Claim” in semantic space.

    3.2 Description of Technical Phrase

    In this subsection, we hire four experts to manually extract technical phrases from 100 patents in two domains, i.e., Mechanical Engineering and Electricity, respectively. After examining the technical phrases extracted from patents, we can make several specific observations:
    (1)
    Part of Speech. Although the part of speech distribution of technical phrases shows various types, most of them are noun phrases. According to the statistics of extracted phrases, noun phrases account for more than 90%.
    (2)
    Number of Words. As Figure 3 shows, the lengths of technical phrases in different domains are slightly different; however, most of them comprise 2 \(\sim\) 4 words, sometimes reaching 5.
    (3)
    Semantic Context. In a patent document, there often exist similar technical phrases, such as “image encoding” and “image decoding”. It is easy to understand that technical phrases occurring in the same context will be relatively more similar to each other in semantics. Besides, technical phrases are expected to have a relatively independent technical meaning. While some phrases like “system architecture” also frequently occur in conjunction with technical phrases, these are not our focus as they have no specific technical meaning.
    (4)
    Local Occurrence. On each level of a patent, technical phrases often appear more than once, especially in long texts, which can be seen as a local occurrence. For example, across the technical phrases extracted from “Claim”, over 70% of them appear in the text at least twice.
    (5)
    Global Occurrence. In the same patent document, a common technical phrase tends to appear repeatedly across different levels. That is to say, their global occurrence in the multi-level structure may provide some insights for aiding technical phrase recognition. In order to verify this point, a focused analysis is conducted in the following.
    Fig. 3.
    Fig. 3. Number of words in technical phrases.
    Figure 4(a) illustrates the average number of technical phrases across different levels. As we can see, the number of technical phrases increases rapidly from “Title” to “Claim” on both datasets. Figure 4(b) shows the average ratio of the number of technical phrases to the number of words in different levels. From “Title” to “Claim”, this ratio drops significantly, indicating that more and more non-technical phrases emerge, and thus greatly increases the difficulty to recognize technical phrases. This clearly reveals that although the technical phrases become increasingly abundant from top to bottom in the multi-level structure, the extraction difficulty rapidly increases as the interference factors become even bigger.
    Fig. 4.
    Fig. 4. (a) The average number of technical phrases in “Title”, “Abstract”, and “Claim”. (b) The average ratio of the number of technical phrases to the number of words in different levels.
    Meanwhile, we find that over 35% of “Abstract”s contain at least one same technical phrase from “Title”s, and this percentage rises to 80% when it comes to “Claim”s and “Abstract”s. In other words, the technical phrases from the current level (e.g., “Title”) may play a guiding role in the technical phrase extraction of the next level (e.g., “Abstract”). We can therefore use the phrases extracted from the current level to help guide extractions in the next level, which can formulate a multi-level model architecture and effectively utilize the information between different levels.
    Moreover, existing patent classification systems can be an initial driving force for technical phrase recognition, for example, Cooperative Patent Classification Group (CPC Group), whose descriptions (Table 2) are highly relevant to technologies. Although both the quantity and quantity of CPC Group descriptions are limited, we can still regard them as the prior knowledge of technical phrases, which could help guide the extraction process in the first level (“Title”) of the patent.
    Table 2.
    CPC GroupDescription
    H02Jsystems for storing electric energy
    H04Hbroadcast communication
    H04Jmultiplex communication
    H04Wwireless communication networks
    Table 2. CPC Group Examples

    3.3 Problem Statement

    Based on the multi-level structure of patent documents, we attempt to extract technical phrases level by level. The extracted phrases in the current level will be seen as the prior knowledge for guiding the next level, while CPC Group descriptions can be seen as the initial level.
    In more detail, for each level of a patent (i.e., “Title”, “Abstract”, and “Claim”), technical phrase extraction is formulated as a generation and selection problem [16]. That is to say, given the word sequence of a patent document \(\boldsymbol {x} = (x_1, x_2, \ldots , x_n)\) , we first build a large-scale candidate pool \(\boldsymbol {Y} = \lbrace y_i, i = 1, 2, \ldots \rbrace\) , where \(y_i = (x_m, x_{m+1}, \ldots , x_{m+l-1})\) is a possible technical phrase, n represents the length of the patent document, m indicates the starting location of the candidate phrase, while l represents this candidate phrase’s length. Next, from the candidate pool \(\boldsymbol {Y}\) , we design a score and rank mechanism to select the final technical phrases. For convenience, we refer to the extracted phrase list in a certain level as \(P_{level}\) , such as \(P_{title}\) . Finally, with the technical phrases extracted from “Title”, “Abstract”, and “Claim”, we can obtain the technical phrase set \(P_{all} = \lbrace P_{title}, P_{abstract}, P_{claim}\rbrace\) for each patent.

    3.4 Solution Overview

    Our solution overview is shown in Figure 5. Specifically, Based on the existing CPC Group descriptions and patent documents organized in multi-level structure (“Title”, “Abstract”, and “Claim”), we propose our TechPat model. TechPat mines the relations between different levels in the patent documents and obeys a generation&selection process to recognize technical phrases, which will be detailed introduced in Section 4. After the extraction of technical phrases, we further apply them to two practical application tasks, i.e., searching relevant patents in the patent database, and classifying patents into given categories, which can prove the effect and application prospects of technical phrases.
    Fig. 5.
    Fig. 5. The overview of our solutions.
    In the following, we will specify the modeling process of our proposed TechPat model.

    4 The TechPat Model

    In this section, we introduce the technical details of TechPat model. As Figure 6 shows, our model deals with patent data level by level, which means we use the CPC Group descriptions or predicted results from the current level to guide the extraction to the next level. In each level of patents, TechPat model contains five modules: Topic Generation, Candidate Generation, Candidate Graph Construction, Candidate Score, and Technical Phrase Recognition.
    Fig. 6.
    Fig. 6. The TechPat framework. The left part illustrates the level-by-level architecture, while the right part describes the model architecture at each level in detail.
    Model Overview. As illustrated in Figure 6, in the Topic Generation part, we utilize the extracted results from the last level to generate several topic centroids in embedding space, which will guide the extraction in the current level. The Candidate Generation part then builds a large-scale candidate pool from documents, and Candidate Graph Construction part further formulates a multi-aspect graph with these candidate phrases. Upon this multi-aspect graph, special score and ranking mechanism will be implemented in the Candidate Score and Technical Phrase Recognition parts, after which we obtain the final technical phrases. The remainder of this section will describe our model in detail.

    4.1 Topic Generation

    As we discussed in Section 3.2, CPC Group descriptions can be seen as the initial prior knowledge of technical phrases to guide the extraction. From this perspective, we first map the content in CPC Group to the embedding space. In this article, we use the pretrained model bert-as-service [10, 65] to obtain the representation of each phrase. After that, we cluster them to a few centroids, the goal of which is to find several topics of technical phrases. These topic centroids will then be utilized to guide the phrase extraction in “Title”. Subsequently, we will select technical phrases with high confidence from extracted results in “Title” to perform the same operations to “Abstract”, which will form a multi-level structure (as illustrated in Figure 6).
    Rather than focusing on a particular patent, the topic generation concerns all CPC Group descriptions or high-confidence technical phrases from a certain level across the whole dataset. This design can overcome the effect of a few bad cases and improve the robustness of topic generation. As for the choice of clustering method, we use a hierarchical clustering method called Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) [35, 36], which outperforms traditional clustering algorithm in terms of both accuracy and stability, and allows us to find suitable clustering centroids.

    4.2 Candidate Generation

    Phrase extraction tasks often encounter a dilemma that there are no authoritative tools to generate phrases from raw texts, especially for a newly defined extraction object like technical phrases. In order to improve the completeness of the extracted phrases, this part will construct a large-scale candidate phrase pool via various phrase extraction methods. In more detail, we first generate candidate phrases using different methods. Next, after filtering out single words and removing duplications,3 we merge phrases from these methods to obtain the candidate pool for every patent document. These phrase extraction methods are as follows:
    DBpedia [8]. This is a tool for automatically annotating mentions of DBpedia resources in texts. With the help of the external database, phrases or entities labeled by DBpedia are of high quality and confidence.
    Spacy [17]. This is a library for advanced NLP tasks that provides phrase extraction tools. We use the entity and noun phrase chunking parts to generate candidate phrases.
    Noun Phrase Extraction. As noted in Section 3.2, most technical phrases are noun phrases. In order to avoid missing some candidates, we extract more noun phrases to complement this pool using grammar tagging [3].

    4.3 Candidate Graph Construction

    The interconnections of these candidate phrases in a local and global context play an important role in mining the differences and relations among candidates, which are crucial for technical phrase discrimination. Considering this and the multi-aspect semantics structure in patent texts, we develop a multi-aspect graph to model the relation between phrases in the text, and further propose a novel ranking algorithm in Section 4.5. In more detail, as illustrated in Figure 7, we construct a multi-aspect graph \(G = (V, S, E, W)\) based on all candidates; here each node \(v_i \in V\) represents a candidate phrase, each subgraph \(s_i \in S\) contains the candidates in the same semantic aspect, and each edge \(e_{ij} \in E\) represents the relation between nodes \(v_i\) and \(v_j\) in a subgraph. Finally, the weight \(w_{ij} \in W\) indicates the relation degree of edge \(e_{ij}\) .
    Fig. 7.
    Fig. 7. Multi-aspect graph example.
    To define the subgraph in the multi-aspect graph, we utilize the HDBSCAN clustering algorithm [35, 36] to divide all candidates into several semantic aspects, each of which can be seen as a subgraph. In a given subgraph \(s_k\) , we define the weight \(w_{ij}\) of edge \(e_{ij}\) as the cosine similarity between candidates \(v_i\) and \(v_j\) in embedding space. We can also calculate the central point \(C(s_k)\) of subgraph \(s_k\) by averaging the embedding of all candidates it contains. After the multi-aspect graph is constructed, we will score every candidate from three perspectives (i.e., statistical, inter-level, and semantic), as outlined in the next subsection.

    4.4 Candidate Score

    In this subsection, we score every candidate phrase in the candidate graph from statistical, inter-level, and semantic perspectives. This can be used to comprehensively measures the possibility of a candidate node belonging to the category of technical phrases.

    4.4.1 Statistical Measurement Indicators.

    With the observations in Section 3, we design two intuitive statistical measurement indicators to restrict the scope from which we select technical phrases.
    Self Length is a simple indicator to count the number of words in the phrase. From the analysis results in Figure 3, we can observe that most technical phrases are composed of 2 \(\sim\) 4 words, and sometimes the number reaches 5. According to this finding, we define Self Length as follows:
    \(\begin{equation} \boldsymbol {SL_i}=\left\lbrace \begin{array}{rcl} 1 & & {len(v_i) = 2,3,4},\\ 0.5 & & {len(v_i) = 5},\\ 0 & & otherwise, \end{array} \right. \end{equation}\)
    (1)
    where \(len(v_i)\) represents the number of words in the candidate phrase \(v_i\) .
    Influence Sphere measures the influence scope of a candidate phrase. On each level of a patent, technical phrases often appear in more than one sentence, as they are crucial for relating different parts of a paragraph, especially for long texts like “Claim”. From this perspective, we define Influence Sphere as the number of sentences including the candidate phrase in current document:
    \(\begin{equation} \boldsymbol {IS_i} = \sum _k \mathbb {I}(v_i \in sentence_k). \end{equation}\)
    (2)

    4.4.2 Inter-Level Measurement Indicators.

    On the basis of the multi-level structure of patent data, we observe that the relations between adjacent levels also play a crucial role in the recognition of technical phrases. To better utilize this characteristic, in this part, we design two inter-level indicators from both explicit and implicit perspectives.
    Explicit Topic measures the overlapping extent between candidate phrases and extracted technical phrases from the last level. In other words, if a candidate has been recognized as a technical phrase in the last level, it has much more possibility of being a technical phrase in the current level. We define Explicit Topic as follows:
    \(\begin{equation} \boldsymbol {ET_i}=\left\lbrace \begin{array}{rcl} 1 & & {v_i \in R},\\ 0 & & otherwise, \end{array} \right. \end{equation}\)
    (3)
    where R represents the extracted technical phrase list of the last level in the patent.
    Implicit Topic focuses on the degree of relevance between candidate phrases and existing technical topics from the last level. These technical topics are generated in the Topic Generation part and represent the technical centers in embedding space. High Implicit Topic means this candidate is more associated with a specific technology topic. We define it as the largest cosine similarity between the candidate and topic centroids in embedding space:
    \(\begin{equation} \boldsymbol {IT_i} = \max _{k} \cos (v_i, Topic_k). \end{equation}\)
    (4)

    4.4.3 Semantic Measurement Indicators.

    Based on the multi-aspect graph and the observations of technical phrase semantics, we design two measurement indicators from the semantic perspective.
    Semantic Relation measures the link ability of technical phrases. In general, similar technologies tend to appear in the same context, such as the closely associated technical phrases “image encoding” and “image decoding”. Combining the multi-aspect graph, we design Semantic Relation from both subgraph and whole-graph perspectives, which focus on the relation with the single phrase and the whole document, respectively. In the subgraph, we first cut edges with weight smaller than a threshold T and calculate the normalized degree of each candidate node as the subgraph score (Equation (5)). When it comes to the whole graph, we regard the central point of each subgraph as a node, and conduct the same series of operations on the whole graph to obtain a whole graph score for each candidate phrase (Equation (6)). After that, we integrate two scores with a hyper-parameter \(\alpha\) (Equation (7)):
    \(\begin{equation} sub_i = \frac{\sum _{i, j \in s_k \cap j\ne i} \mathbb {I}(cos(v_i, v_j) \ge T)}{\sum _{i, j \in s_k \cap j\ne i}\mathbb {I}(1)}, \end{equation}\)
    (5)
    where \(v_i, v_j\) are the nodes in the same subgraph \(s_k\) ,
    \(\begin{equation} whole_i = \frac{\sum _{k | i \notin s_k} \mathbb {I}(cos(v_i, C(s_k)) \ge T)}{\sum _{k | i \notin s_k}\mathbb {I}(1)}, \end{equation}\)
    (6)
    where \(C(s_k)\) represents the center point of subgraph \(s_k\) ,
    \(\begin{equation} \boldsymbol {SR_i} = \alpha \cdot sub_i + (1-\alpha) \cdot whole_i. \end{equation}\)
    (7)
    Semantic Independence focuses on the independence of technical phrases in the semantic embedding space. As discussed in Section 3, technical phrases also need to have a relatively independent meaning. For example, “system architecture” often occurs alongside some technical phrases (e.g., “image processing system architecture”), but it does not have an independent meaning and is not a technical phrase. For each node, we also design Semantic Independence from both subgraph and whole-graph perspectives, which emphasizes the differences with the single phrase and the whole document, respectively. In the subgraph, we calculate its smallest cosine distance with other nodes (Equation (8)). In the whole graph, we calculate its smallest cosine distance with the central point of each subgraph (Equation (9)). Subsequently, with a hyper parameter \(\beta\) , we integrate the subgraph score and the whole graph score to obtain the final Semantic Independence (Equation (10)):
    \(\begin{equation} sub_i = \min _{i, j \in s_k \cap j\ne i} (1 - \cos (v_i, v_j)), \end{equation}\)
    (8)
    \(\begin{equation} whole_i = \min _{k | i \notin s_k} (1-cos(v_i, C(s_k))), \end{equation}\)
    (9)
    \(\begin{equation} \boldsymbol {SI_i} = \beta \cdot sub_i + (1-\beta) \cdot whole_i, \end{equation}\)
    (10)
    where \(v_i, v_j, s_k, C(s_k)\) are the same as in Equations (5)–(7). And a low \(\boldsymbol {SI}\) score means that this phrase is relatively common or easy to form a longer phrase with arbitrary words or phrases.
    Among these three kinds of measurement indicators, the statistical measurement indicators mentioned above are designed more simply and intuitively; inter-level measurement indicators play a bridging role in relating different levels in patents, while semantic measurement indicators focus on the interconnections of candidate phrases in the multi-aspect graph. Based on these measurement indicators, we can comprehensively evaluate each candidate, and the normalized sum of these scores will be set as the weight \(I(v_i)\) of each node \(v_i\) in the multi-aspect graph.

    4.5 Technical Phrase Recognition

    On the basis of the multi-aspect graph and traditional ranking algorithms [2, 4, 6, 28, 39], we propose the Multi-aspect Tech-Rank (MTR) algorithm, which ranks all candidate phrases from the technical phrase perspective. After that, considering the number of technical phrases in different levels, we further design a truncation strategy for the selection of technical phrases.

    4.5.1 Candidate Rank.

    The traditional ranking algorithm Pagerank [6] was first proposed to rank webpages by measuring their importance to the entire web (graph). In recent years, this ranking algorithm has been applied to text processing, particularly in the key word/phrase extraction field [2, 4, 39]. In this part, we propose the MTR algorithm to rank all candidate phrases from the technical phrase perspective, which can effectively utilize the multi-aspect graph structure (Algorithm 1).
    In this algorithm, given the multi-aspect graph \(G = (V, S, E, W)\) , and the normalized score \(I(v_i)\) for each node \(v_i \in V\) , we are expected to obtain a ranked list for all candidate phrases. First, the nodes are set to a uniform default ranking value (Line 1); subsequently, we update the ranking value list until this algorithm converges or exceeds the maximum number of iterations4 (Lines 2–5). In more detail, we first calculate a ranking value for each subgraph (Line 3):
    \(\begin{equation} R(s_i) = \sum _{v_j \in s_i} \frac{\cos (v_j, C(s_i)) \cdot R(v_j)}{\sum _j \cos (v_j, C(s_i))}, \end{equation}\)
    (11)
    where \(C(s_i)\) represents the center point of subgraph \(s_i\) , while \(R(v_j)\) represents the ranking value of node \(v_j \in V\) . After that, we update the ranking value list \(R_{list}\) from both local and global perspectives (Line 4):
    \(\begin{equation} R(v_i)_{local} = \sum _{j:v_j-\gt v_i}\frac{w_{ji}}{\sum _{k:v_j}w_{jk}}R(v_j)\cdot Pena(v_j, v_i), \end{equation}\)
    (12)
    where \(w_{ji}\) represents the weight of edge \(e_{ji}\) between nodes \(v_j\) and \(v_i\) , \(Pena(,)\) is a penalty function and \(Pena(v_j, v_i) = 1 - overlap(v_j, v_i) / num(v_j)\) . In \(Pena(v_j, v_i)\) , \(num(v_j)\) represents the number of words in node \(v_j\) , while \(overlap(v_j, v_i)\) is the number of words appearing in both \(v_j\) and \(v_i\) .
    \(\begin{equation} R(v_i)_{global} = \sum _{j:s_j-\gt s_i}\frac{w(s_j, s_i)}{\sum _{k:s_j}w(s_j, s_k)}R(s_j) \cdot \cos (C(s_i), v_i), \end{equation}\)
    (13)
    where \(w(s_j, s_i) = \cos (C(s_j), C(s_i))\) , which represents the relation degree between two subgraphs.
    \(\begin{equation} R(v_i) = (1-d)\cdot I(v_i) + d \cdot I(v_i) \cdot ((1-\gamma) \cdot R(v_i)_{local} + \gamma \cdot R(v_i)_{global}). \end{equation}\)
    (14)
    In Equation (12), we calculate the local influence passing from nodes in the subgraph, while Equation (13) computes the global influence from other subgraphs. Moreover, with the damping factor d and harmonic factor \(\gamma\) , we obtain the revised ranking value \(R(v_i)\) in Equation (14). Specifically, in the local propagation process (Equation (12)), inspired by [41], we design a penalty mechanism to avoid the damage caused by the overlapping problem between different candidates. For example, phrases like “wireless communication system” and “some system” have relatively high relation due to the overlapping word “system”. However, “wireless communication system” is a technical phrase in the Electricity domain while “some system” is certainly not. Therefore, this type of relation is less reliable and will hinder the propagation process. With the penalty function \(Pena(v_j, v_i)\) , the relation between two candidates will be penalized according to the degree of overlap, which could ensure the effectiveness of the propagation.
    After the propagation process is complete, we rank all candidate phrases according to the ranking values and obtain the final ranked phrase list (Line 6).

    4.5.2 Candidate Selection.

    Having determined the ranking result of Candidate Rank, we select the top-K candidates as technical phrases, while the candidates with high confidence (top-1) will be put in a new prior knowledge set to be sent to the next level. As the setting of K, considering that the contents in different documents vary significantly, we set it according to the number of sentences in the document ( \(N_{sen}\) ). From the labeled data, we calculate the statistical relation between K and \(N_{sen}\) , as follows:
    \(\begin{equation} \frac{K}{N_{sen}} \approx \left\lbrace \begin{array}{rcl} 1\sim 2 & & Title,\\ 2 & & Abstract,\\ 1 & & Claim. \end{array} \right. \end{equation}\)
    (15)
    Based on this observation, we set \(K = 2 N_{sen}\) for the patent “Title” and “Abstract”, and \(K = N_{sen}\) for the patent “Claim”. Once this is complete, we can obtain the technical phrase set for each patent.

    5 Applications

    As we discussed in Section 1, compared with other phrases, technical phrases carries more essential and distinctive technical information in patents, which have a great advantage in summarizing and representing patents with limited features. In this section, to further verify the effect and application prospects of technical phrases, we introduce two practical application tasks, i.e., patent search and patent classification.

    5.1 Patent Search

    Patent search is the task of finding relevant existing patents, which is an important part of the patent examiner’s process of validating a patent application [13, 48, 67, 68]. As the information in the query patent (patent application) is often redundant and brings difficulties for searching relevant patents, existing methods tend to extract salient features to construct an effective query for the search process. From this perspective, technical phrases, carrying much distinctive technical information in patents, have a great advantage in constructing short and effective queries.
    Following the traditional strategy [67, 68], we formulate patent search as a task to find the citation patents of a given patent. Specifically, as shown in Figure 8, given a query patent, we first extract the technical phrases from its “Title”, “Abstract”, and “Claim”, and then utilize the extracted technical phrases to compose a query.5 After that, searching algorithms could be conducted to find relevant patents in the patent corpus. As our purpose is to verify the advantages of technical phrases, we directly utilize the traditional BM25 algorithm instead of other complex methods to conduct the search process, where the relevant patents will be ranked by the relevant scores. Finally, the top-k patents will be seen as the results of patent search.
    Fig. 8.
    Fig. 8. The process of patent search via technical phrases.

    5.2 Patent Classification

    Patent classification, which aims to classify each patent into one or more categories, is regarded as a basic task in the field of patent management [15, 18, 40, 54]. In the past few decades, large efforts have been made to deal with this task. And in the process of patent classification, how to effectively mine informative features (e.g., keywords, phrases) is of vital importance [18, 40, 51, 57]. Naturally, we are expected to utilize extracted technical phrases to help the classification of patents, which can verify the effectiveness of our proposed technical phrases.
    More specifically, as illustrated in Figure 9, given the patents to classify, we first extract technical phrases from each patent. From the macro point of view, the extracted technical phrases are treated as the features which represent the overall patent text. In this respect, we utilize these technical phrases as input features and conduct representation and classification algorithms.6 Since our focus is on verifying the effect of technical phrases rather than designing complex classification algorithms, we choose the traditional TF-IDF representation and Linear Support Vector Classification (Linear SVC) algorithm to implement this process. Moreover, considering that one patent may be classified into more categories, we adopt the OneVsRest strategy [44] to achieve the multi-label patent classification.
    Fig. 9.
    Fig. 9. The process of patent classification via technical phrases.

    6 Experiment

    In this section, we conduct extensive experiments to demonstrate the effectiveness of our proposed framework and its implementations. Specifically, we first introduce the experimental setup (Section 6.1). Then, we demonstrate our models’ effectiveness on extracting technical phrases from three evaluation perspectives (Section 6.2). After that, we provide detailed analyses of our models (Section 6.3) and further, discuss the application prospects of technical phrases (Section 6.4).
    Our code is available via https://github.com/liuyeah/TechPat.

    6.1 Experimental Setup

    6.1.1 Datasets.

    The experiments are performed on USPTO patent data in two domains, i.e., Mechanical Engineering and Electricity. The former collects patents relating to mechanical engineering, lighting, heating, weapons, and blasting engines or pumps, while the latter is related to the electric field. In more detail, we randomly sample 11 k and 84 k pieces of patent data in the Mechanical Engineering and Electricity datasets, respectively, for our experiments. More statistics of our datasets are presented in Table 3.
    Table 3.
    DatasetNum. patentsAvg. sentences of TitleAvg. sentences of AbstractAvg. sentences of Claim
    Mechanical Engineering11,1861.003.8513.58
    Electricity84,0691.003.8916.58
    Table 3. The Statistics of The Datasets

    6.1.2 Implementation Details.

    In this part, we describe the implementation details and parameters of the TechPat model. We run all experiments on a Linux server with two 2.20 GHz Intel Xeon E5-2650 CPUs and four Tesla K80 GPUs.
    Topic Generation Hyperparameters: In the HDBSCAN clustering procedure, we have to set the min_cluster_size, i.e., the minimum size of clusters. According to the size of data at different levels, we set it as 3 for the level of CPC Group, while 100 for others (i.e., “Title” and “Abstract”).
    Candidate Score Hyperparameters: There is a threshold in the Semantic Relation measurement indicator, which we set to \(T = 0.5\) in Equations (5) and (6). In addition, we set the weights \(\alpha = 0.5\) , \(\beta = 0.5\) in Equations (7), (10) to calculate the final semantic indicators.
    Technical Phrase Recognition Hyperparameters: In the MTR algorithm, we set the damping factor \(d=0.85\) according to the traditional propagation algorithm [6, 39]. As for the harmonic factor, we set it to \(\gamma =0.5\) to regulate local and global propagation.

    6.1.3 Comparison Baselines.

    We compare our model with a wide range of state-of-the-art approaches, as described below:
    DBpedia [8] and Spacy [17]. In the Candidate Generation part, we introduce two effective phrase extraction models to construct the large candidate pool. These two models are definitely included in our baseline group.
    Autophrase [26, 49] combines the quality estimation and occurrence identification to extract salient phrases from documents, which is domain-independent and free of human labeling.
    Rake [47] proposes extracting key words using graph-based importance measurement. It then uses these words to form key phrases based on their adjacency in text.
    NE-rank [2] first scores every word based on the occurrence frequency in the original text, then uses them to form possible phrases and provide a ranking order.
    ECON [21] aims to extract concept words/phrases based on embedding and probability theory. It utilizes many models to generate possible phrases and designs a novel architecture to evaluate the fitness of extracted concepts to the original text.
    MultipartiteRank [5] considers the topic information and further encodes this as a multipartite graph structure. Upon the multipartite graph, TextRank [39] is conducted to rank candidate phrases according to their importance.
    JMLGC [24] is the state-of-the-art unsupervised phrase recognition method, which is based on the pretrained language model (i.e., BERT [10]) and employs the local and global context in the document. We name it JMLGC in the experiment.
    UMTPE [30] is the model proposed in our preliminary work. The hyper-parameters are the same as those described in the original article.7
    Among these baselines, some models (e.g., ECON) extract both phrases and words. To facilitate fair comparison, we adopt the same filtering strategy used by our methods (Section 4.2) before evaluation. For baselines that give predicted phrases in a certain ranking order, we select top-K phrases, just like our TechPat model. As for baselines without ranking order, we randomly select K phrases as the extraction results.

    6.2 Experimental Results

    In this subsection, we compare the performance from three evaluation perspectives, i.e., overall performance evaluation, representation ability evaluation, and expert evaluation, from which we can demonstrate the effectiveness of our proposed UMTPE and TechPat models.

    6.2.1 Overall Performance Evaluation.

    The technical phrase extraction experiments are carried out on two complete datasets through different methods. And in this part, to quantify the extracting performance of these methods, we calculate three basic evaluation metrics (i.e., Precision, Recall, and F1-score) on 100 patents8 labeled with technical phrases on two datasets, respectively. To be specific, we first utilize a lemmatization tool in the pattern package9 to obtain the prototype of every word in the reference and predicted phrases. Subsequently, we can calculate the evaluation results of each level in patents (“Title”, “Abstract”, and “Claim”). For more comprehensive evaluation, we merge the extracted technical phrases across three levels to obtain the whole technical phrase set of each patent, after which we calculate the whole evaluation results. The overall performance on the whole technical phrase set is listed in Table 4, while the results on each level are shown in Tables 5 and 6.
    Table 4.
    MethodMechanical EngineeringElectricity
    PrecisionRecallF1-scorePrecisionRecallF1-score
    ECON26.7010.4314.0123.768.1911.35
    DBpedia43.1311.4916.8035.0810.2914.99
    Autophrase28.1826.8325.4727.4931.8327.27
    NE-rank20.0131.0522.8121.5333.2324.11
    Rake16.1726.8918.7814.0324.5316.53
    Spacy32.4248.8336.4132.3749.2736.20
    MultipartiteRank37.8051.2140.6636.3749.1538.84
    JMLGC34.8648.5837.9237.6750.0539.92
    UMTPE37.0454.5841.2838.4954.9341.66
    TechPat39.8355.3243.1038.9855.1041.89
    Table 4. Overall Performance Evaluation (%)
    The bold fonts represent the optimal results.
    Table 5.
    MethodTitleAbstractClaim
    PRF1PRF1PRF1
    ECON15.009.0010.9018.587.529.4918.547.8810.42
    DBpedia14.509.6711.1734.1912.9517.0933.4210.4614.80
    Autophrase8.004.505.5024.3413.8515.9823.7628.6923.68
    NE-rank34.0038.1735.1722.5336.7625.0514.0819.7314.73
    Rake59.0059.0057.4721.8935.7225.265.107.015.08
    Spacy61.5058.5058.4337.2953.2940.2822.8228.7522.33
    MultipartiteRank58.5053.6754.8339.0651.9341.3633.6344.1633.58
    JMLGC63.5059.1759.5041.2755.0843.4828.4836.1927.72
    UMTPE61.5065.6761.5739.8259.4743.7630.4343.2931.95
    TechPat61.0064.6760.9043.0161.7146.1034.2845.3234.14
    Table 5. Overall Performance on Different Levels on Mechanical Engineering Dataset(%)
    The bold fonts represent the optimal results.
    Table 6.
    MethodTitleAbstractClaim
    PRF1PRF1PRF1
    ECON7.506.006.4718.925.757.4316.666.408.59
    DBpedia14.5011.1712.3028.6211.8315.2128.5310.3014.18
    Autophrase18.5011.5013.6328.4125.1822.8222.2832.6323.90
    NE-rank27.0028.8327.1323.0733.5722.6415.3826.7317.61
    Rake58.0059.5857.1013.5722.7214.797.4012.248.56
    Spacy62.5062.0860.6032.8043.9332.0125.0039.0327.65
    MultipartiteRank53.0053.0851.0032.7841.1531.4832.8047.1535.62
    JMLGC65.5066.1764.0032.7841.6931.4231.1644.8733.48
    UMTPE63.0069.9264.1336.3351.3735.9232.2049.8935.04
    TechPat58.0064.8359.2336.7054.1637.1132.9152.3936.53
    Table 6. Overall Performance on Different Levels on Electricity Dataset(%)
    The bold fonts represent the optimal results.
    As we can see from Table 4, our TechPat model together with UMTPE outperforms all baselines in all metrics, except for DBpedia in Precision on the Mechanical Engineering dataset, which proves the effectiveness of the multi-aspect graph and multi-level architecture coupled with three kinds of measurement indicators. Moreover, although DBpedia achieves excellent performance in Precision, it performs poorly on Recall and F1-score, as it relies entirely on the external database and can only extract few phrases. Then, across different levels, we find some interesting phenomena: (1) Rake, Spacy, and JMLGC all attain good performance on “Title”, however, their performances drop a lot when it comes to longer texts compared with our methods. This is because the majority of “Titles” are short texts with only one sentence, which extensively reduces the extracting difficulty for these methods. (2) At the same time, Autophrase and MultipartiteRank both achieve more competitive performance on “Claim” than “Title” and “Abstract”. The extraction process of Autophrase greatly relies on the frequency of phrases, while MultipartiteRank proposes a multipartite graph to exploit the topic information in the document, both of which are more suitable for longer documents.
    When we examine the difference between TechPat and UMTPE, we can see that TechPat achieves obvious improvements on the overall performance evaluation with the help of the multi-aspect graph structure and newly revised measurement indicators. As for the detailed performance across different levels, TechPat exceeds UMTPE a lot in “Abstract” and “Claim”, but exhibits a reduction in “Title”. This is for two main reasons: (1) The technical phrases in “Title” are much rarer than other levels, which means a slight disturbance will greatly affect the results; (2) In TechPat, we strengthen the relation of different levels via revised inter-level indicators, which leads to effective performance in “Abstract” and “Claim” but may introduce some noise to “Title”, as the initial level CPC Group is quality-limited. However, this is acceptable as our purpose is to gain the whole technical phrase set of each patent, and a small disturbance at the “Title” level is not the focus.

    6.2.2 Representation Ability Evaluation.

    In order to supplement traditional evaluation metrics, we propose a new metric called IRE to evaluate the predicted technical phrases from the perspective of representation ability. As we discussed in Section 1, the combination of technical phrases can make a technology portrait for patents, which carries essential and distinctive technical information of patents. From this perspective, technical phrases are expected to have more powerful representation ability than general phrases, so Information Retrieval (IR) task on patent documents can effectively verify the extraction results.
    Therefore, we conducted an IR task on 1,000 patent documents, including 100 patents with labeled technical phrases and 900 randomly selected patents. For every predicted phrase in a document, we use it as a query to rank all documents according to the matching degree.10 If the document from which this phrase is obtained is in the top-10 documents set, we score the phrase as 1; otherwise, the score will be 0. We then compute the score of this document by averaging the scores of all extracted phrases, which can comprehensively evaluate the performance of models on one document.11 However, we notice that if a model only extracts one or two high-quality phrases from a document containing ten technical phrases, the score on this document still tends to be very high. This does great damage to this evaluation metric because it ignores the completeness of extracted phrases. Inspired by BLEU [42], which is an evaluation metric for machine translation and has a penalty component to avoid the effect of short translations, we set a penalty factor PF, as follows:
    \(\begin{equation} PF=\left\lbrace \begin{array}{rcl} 1 & & {r \le p},\\ e^{1-r/p} & & {r \gt p}, \end{array} \right. \end{equation}\)
    (16)
    where r is the number of reference technical phrases in the document and p is the number of phrases extracted by the model. With PF, we can revise the score for every document:
    \(\begin{equation} score_{revise} = PF \cdot score. \end{equation}\)
    (17)
    Finally, we average the revised score of these 100 labeled documents to get the final value of IRE. The results on the three levels (“Title”, “Abstract”, “Claim”) are listed in Table 7.
    Table 7.
    MethodMechanical EngineeringElectricity
    TitleAbstractClaimTitleAbstractClaim
    ECON12.3811.2910.9610.729.6212.52
    DBpedia20.4410.416.1517.1912.699.51
    Autophrase8.9822.9327.8325.5536.3434.94
    NE-rank82.6250.6029.3583.2151.5837.29
    Rake82.8264.5439.7484.7564.6849.06
    Spacy78.6155.0536.3082.5654.5642.17
    MultipartiteRank70.8756.8740.8576.5254.8546.58
    JMLGC75.7456.6538.8581.7654.7347.07
    UMTPE82.8755.9745.8887.2855.7349.95
    TechPat84.3758.2046.5386.8556.5150.33
    Table 7. Representation Ability Evaluation (%)*
    \(^{*}\) Bold font represents the optimal result, while the underline font represents the sub-optimal result.
    From the results in Table 7, we can find that the phrases extracted by our model, i.e., TechPat, UMTPE together with Rake, all have excellent representation ability and outperform other models. However, the characteristics of the phrases extracted from TechPat, UMTPE, and Rake seem quite different. Figure 10 presents the number of words in predicted phrases extracted by three models. As we can see, phrases extracted by TechPat and UMTPE are consistent with reference phrases in Figure 3, while Rake tends to extract phrases with more words. In general, it is natural that more words indicate more information and thus better performance for IR tasks. It is, therefore, intuitive that the high performance of Rake tends to benefit from the extracted lengthy phrases. By contrast, our methods can not only extract technical phrases in line with the actual situation, but also outperform Rake on “Title” and “Claim”, which indicates that phrases extracted by TechPat and UMTPE show a great advantage in representing technical information of patent documents. Meanwhile, in this evaluation, TechPat achieves certain improvements compared with UMTPE, which is consistent with the overall performance evaluation and proves the effectiveness of our newly designed modules.
    Fig. 10.
    Fig. 10. Number of words of phrases extracted by TechPat, UMTPE, and Rake.

    6.2.3 Expert Evaluation.

    To more clearly prove the effectiveness of our models, we design a manual evaluation method to compare the performances of different models. First, we extract technical phrases from a patent via different methods. For each model’s result, we randomly sample six predicted phrases: one from “Title”, two from “Abstract” and three from “Claim”.12 We then build an evaluation pool for a patent, which contains predicted phrases from all ten models. After that, we hire experts to manually label effective technical phrases from this pool. These picked phrases are regarded as the most likely technical phrases and we call them Gold Standard [29, 52].
    Taking the gold standard as ground truth, we compute hit rate (HR) [9] for each method. Here, this metric measures how close the output of a method is to the gold standard and is defined as
    \(\begin{equation} HR_i = \frac{|P_i \cap GS|}{|GS|}, \end{equation}\)
    (18)
    where \(HR_i\) is the HR of the ith phrase extraction method, \(P_i\) represents the phrases extracted by the ith method, and GS means the phrases picked by the gold standard. We conduct this evaluation on 100 randomly sampled patents and average the results on two datasets, respectively. As shown in Table 8, the results of this evaluation are consistent with the above overall performance evaluation and representation ability evaluation.
    Table 8.
    HRMechanical EngineeringElectricity
    ECON11.417.26
    DBpedia16.8316.73
    Autophrase15.7516.28
    NE-rank24.7919.19
    Rake20.5618.40
    Spacy27.0123.70
    MultipartiteRank29.2624.24
    JMLGC28.7724.04
    UMTPE32.1027.33
    TechPat33.4529.35
    Table 8. Expert Evaluation (%)
    The bold fonts represent the optimal results.
    Among the three evaluations, the overall performance evaluation is the most basic and essential part. Representation ability evaluation uses the IR task to evaluate the representation ability of predicted phrases, and we also design a penalty factor to consider the completeness of extracted phrases. Moreover, expert evaluation evaluates the extracted results using expert decisions, which plays an important role in the evaluation of such unsupervised tasks. Based on these evaluations from different perspectives, the effectiveness of our proposed methods is extensively verified.

    6.3 Model Analysis

    In this subsection, we further analyze the import properties of the TechPat model. More specifically, we discuss them from the following three aspects, i.e., component effectiveness analysis, visualization analysis, and running time analysis.

    6.3.1 Component Effectiveness Analysis.

    In this part, we conduct ablation experiments on the Mechanical Engineering dataset to prove the effectiveness of different components of our model. To be specific, we verify it from two aspects: Structure Design (Figure 11(a)) and Indicator Design (Figure 11(b)).
    Fig. 11.
    Fig. 11. Component effectiveness analysis.
    Structure Design. In this part, we validate the structure design of our TechPat model. In more detail, we remove the multi-aspect graph design from the model, which means that the multi-aspect graph degenerates to the normal fully-connected graph in UMTPE [30] and the MTR algorithm degenerates accordingly [2]. Meanwhile, we also verify the Penalty Mechanism in the MTR algorithm by removing it. From the results in Figure 11(a), the obvious decreases in both variants prove the effectiveness of the multi-aspect graph and penalty mechanism.
    Indicator Design. In this part, we aims to prove the effect of the measurement indicators (i.e., statistical, inter-level, and semantic) designed in the model architecture (Section 4.4). Specifically, we compare the results extracted by TechPat with none indicator, statistical indicators, inter-level indicators, semantic indicators, and full indicators, respectively. From the results in Figure 11(b), we could find that, compared with none-indicator model, three kinds of indicators all have significant promotion for the extraction results. Among them, statistical and inter-level indicators perform more effectively due to their reliance on the intuitive statistical law or inter-level relations of technical phrases, which play a “coarse adjustment” role in the recognition. While the semantic indicators are based on the semantic findings in embedding space, and play a “minute adjustment” role in the recognition process. Furthermore, the combination of these indicators (i.e., full TechPat) has an evident improvement compared with the single kind of indicators, which proves the necessity of these indicators in our model.
    Through the detailed effectiveness analysis from two perspectives, we thoroughly prove the validity and non-redundancy of our TechPat model.

    6.3.2 Visualization Analysis.

    In this part, we conduct a visualization analysis to further explain and demonstrate the results of the TechPat model.
    Considering the effect of illustration, we show the extraction results from one “Title” and “Abstract” in the left part of Figure 12. In this case, “matching phrase” means that the phrases are extracted by both the TechPat model and expert labeling, while “reference phrase” and “predicted phrase” represent phrases that are only extracted by expert labeling or the TechPat model, respectively. From this illustration, we can determine that our TechPat model can accurately recognize technical phrases in both “Title” and “Abstract”, such as “wireless communication”. Moreover, the technical phrases in “Title” are included by technical phrases in “Abstract”, which also verifies the effectiveness of the multi-level design in TechPat.
    Fig. 12.
    Fig. 12. Visualization analysis.
    The right part of Figure 12 lists the ranking result of candidate phrases in this “Abstract”. We can easily find that high-scoring phrases are better than low-scoring ones. For example, “wireless communication” is better than “first group” from the perspective of technical phrases. This also proves the effectiveness of our score and rank components.

    6.3.3 Running Time Analysis.

    We conduct the following experiments to understand the efficiency of our proposed models compared with all baselines. To better compare the differences between these methods, we divide them into three categories according to their modeling characteristics:
    (1)
    Traditional feature engineering methods (Rake, Autophrase, NE-rank, and Multipartite-Rank). They mainly rely on the statistical features of the text to recognize possible phrases.
    (2)
    Database-based or pre-trained methods (DBpedia, Spacy).13 These methods rely heavily on the external database (DBpedia) or train the extraction pipeline well in advance (Spacy).
    (3)
    Embedding-based methods (JMLGC, ECON, UMTPE, and TechPat). This kind of method adopts large-dimension embedding vectors to represent words or phrases for better recognition.
    For fair comparison, we run all of them on the same platform and report their consuming time on the Mechanical Engineering dataset. As illustrated in Table 9, “Total Document” means the consuming time on the whole dataset, while “Per Document” represents the average time spent on each document. In more detail, “Title”, “Abstract”, and “Claim” represent the time spent on each level, respectively, while “Patent” refers to the sum of them. From the results, we could draw some significant conclusions: First, for all methods, the running time at three levels increases gradually from “Title” to “Claim”. It is reasonable as the text length and extraction difficulty are becoming larger from top to bottom in the multi-level structure. Second, most methods based on feature engineering, database, or pretrained-pipeline are faster than embedding-based methods. This is because the large-dimension embedding vectors significantly increase the computation cost in these embedding-based approaches. Third, among the feature engineering methods, MultipartiteRank performs poorly as it adopts the topic information in the text and designs a complex directed multipartite graph structure, which greatly increases its computation burden. Fourth, ECON has the worst efficiency performance. It identifies possible phrases based on both their individual qualities and their fitness to the whole context. The latter will consume a large amount of time, especially when dealing with long documents, such as “Abstract” and “Claim”. Last, our proposed methods (UMTPE and TechPat) achieve the best efficiency performance among these embedding-based methods. And, we further argue that our methods involve a candidate generation process which is based on various other methods, such as DBpedia and Spacy.14 To analyze our methods more clearly, we separate the candidate generation time and the recognition time, as illustrated in Table 10.
    Table 9.
    CategoryMethodTotal Document (Min)Per Document (s)
    TitleAbstractClaimPatentPatent
    (1) Feature EngineeringRake0.0170.1080.5240.6490.003
    Autophrase0.2820.5722.5583.4120.018
    NE-rank1.54935.782119.944157.2750.844
    MultipartiteRank118.111126.113145.703389.9272.092
    (2) Database or Pre-trainedDBpedia0.6851.5168.50110.7020.057
    Spacy3.50210.06656.42569.9930.375
    (3) Embedding-basedJMLGC24.34135.512140.354200.2071.074
    ECON10.531>200>500>700>4
    UMTPE5.07621.511134.963161.5500.867
    TechPat6.06325.735137.847169.6450.910
    Table 9. The Running Time on Mechanical Engineering Dataset of Different Models
    Table 10.
    MethodTotal Document (Min)Per Document (s)
    TitleAbstractClaimPatentPatent
    Candidate Generation4.47314.54487.312106.3290.570
    UMTPE Recognition0.6036.96747.65155.2210.296
    TechPat Recognition1.59011.19150.53563.3160.340
    Table 10. The Running Time on Mechanical Engineering Dataset of Our Models
    In this table, Candidate Generation refers to the candidate preparation process we introduced in Section 4.2, which is the same for UMTPE and TechPat. And UMTPE/TechPat Recognition indicates other modules in UMTPE and TechPat, respectively. From the results, we could find that over half of the time is spent on the candidate generation part, while the recognition process only takes about 1/3–1/2 of the time. Even so, our methods still achieve the competitive efficiency performance. Besides, Techpat takes a little more time than UMTPE as it utilizes more refined designs, such as the multi-aspect graph structure and the corresponding propagation algorithm. However, this time cost is acceptable considering the improved extraction performance.

    6.4 Application Prospects

    In this subsection, we report the performance on two practical application tasks introduced in Section 5 to demonstrate the effect and application prospects of our work on technical phrases.

    6.4.1 Patent Search.

    As we noted in Section 5.1, we utilize the extracted technical phrases to construct the query for better patent search results. In the detailed experiments, we randomly sample 100 patents published in 2,000 as the query set and use their cited patent set as searching patent corpus. We further restrict the query patents to have at least 20 citations, which is the same strategy adopted by [67, 68]. And the data statistics are listed in Table 11.
    Table 11.
    DatasetNum. Query SetNum. Patent Corpus
    Mechanical Engineering1002,818
    Electricity1003,312
    Table 11. Data Statistics for Patent Search
    After the search process, we use \(Recall@100\) [67] to evaluate the patent search performance. To overcome the effect of randomness, we ran this search experiment five times and reported the mean value of the results. Table 12 lists the experimental results on two domains (i.e., Mechanical Engineering, and Electricity). We could find that our TechPat model together with UMTPE both show excellent performances compared with other baselines, which demonstrates the effectiveness of our proposed technical phrases in searching patents. The improvements of TechPat over UMTPE confirm the necessity of our newly designed modules.
    Table 12.
    Recall@100Mechanical EngineeringElectricity
    ECON37.7933.83
    DBpedia38.8935.80
    Autophrase41.7236.65
    NE-rank40.6836.30
    Rake37.9433.42
    Spacy39.7635.60
    MultipartiteRank41.9737.04
    JMLGC41.7337.38
    UMTPE42.4137.72
    TechPat43.2538.26
    Table 12. Patent Search Performance (%)
    The bold fonts represent the optimal results.

    6.4.2 Patent Classification.

    As discussed in Section 5.2, we extract technical phrases from each patent document and regard them as input features to drive the classification of patents. We utilize the patent data from USPTO15 to construct the patent classification datasets and the statistics of the datasets are shown in Table 13.
    Table 13.
    DatasetNum. ClassNum. TrainNum. Test
    Mechanical Engineering53,956990
    Electricity53,9991,000
    Table 13. Data Statistics for Patent Classification
    After the classification process, we adopt two widely used multi-label classification metrics, F1-score and Hamming Loss, to evaluate the performance:
    \(\begin{equation} F_1(y_s, \hat{y_s}) = \frac{1}{|S|}\sum _{s \in S} \frac{2*|y_s \cap \hat{y_s}|}{|\hat{y_s}| + |y_s|}, \end{equation}\)
    (19)
    \(\begin{equation} HammingLoss(y_s, \hat{y_s}) = \frac{1}{|S|}\sum _{s \in S} \frac{xor (y_s, \hat{y_s})}{|L|}, \end{equation}\)
    (20)
    where S represents the test set, \(y_s\) is the reference label for the sample s, \(\hat{y_s}\) represents the predicted label for s, \(|L|\) represents the number of categories for this multi-label problem. A more effective classification method is expected to achieve higher F1-score and lower Hamming Loss.
    In order to overcome the effect of randomness, we run this classification experiment five times and report the mean value of the results. Table 14 reports the performance of different phrase extraction methods. We could find that our proposed UMTPE and TechPat methods achieve the best performance,16 which is consistent with the results of patent search.
    Table 14.
    MethodMechanical EngineeringElectricity
    F1-scoreHammingLossF1-scoreHammingLoss
    ECON55.3322.6457.8316.58
    DBpedia60.2719.9261.9715.68
    Autophrase61.9118.9964.8915.03
    NE-rank63.1818.8366.2114.29
    Rake57.6021.0861.9315.42
    Spacy60.9419.6865.5314.62
    MultipartiteRank63.8318.6067.2914.09
    JMLGC61.1119.8267.0114.34
    UMTPE64.3018.0767.8213.56
    TechPat65.0017.7568.7013.30
    Table 14. Patent Classification Performance (%)
    The bold fonts represent the optimal results.
    To summarize, in this subsection, we conduct two practical application tasks to verify the effect of technical phrases. These promising results provide a guarantee for the wider application prospects of technical phrases.

    7 Generalization Ability of TechPat

    In this section, we discuss the generalization ability of our proposed technical phrase extraction methods to more types of technical documents. In fact, besides patent documents, technical phrases also appear in other documents that contain wealthy technical information, such as scientific articles and papers. Moreover, these technical documents often have the multi-level structure similar to that of patents (e.g., “Title”, “Abstract”). With few adaptations, our proposed UMTPE and TechPat can be employed to recognize technical phrases from these documents directly. To verify their generalization ability, we apply these methods to a scientific article dataset and compare their extraction performance.
    Specifically, we utilize the KP20k dataset [38], which contains the titles and abstracts of scientific articles in computer science. 100,000 pieces of data are sampled from the original KP20k dataset to conduct experiments, and more statistics of the dataset are presented in Table 15. The experiments follow the same setup we stated in Section 6.1. As for the initial level in UMTPE and TechPat (i.e., the CPC Group descriptions in the patent datasets), we replace them with the provided key phrases in these articles.17 Besides, we recalculate the statistical relation between the number of technical phrases (K) and the number of sentences ( \(N_{sen}\) ) in each scientific article:
    \(\begin{equation} \frac{K}{N_{sen}} \approx \left\lbrace \begin{array}{rcl} 2 & & Title,\\ 1 & & Abstract.\\ \end{array} \right. \end{equation}\)
    (21)
    According to this observation, we set \(K=2N_{sen}\) for “Title” and \(K=N_{sen}\) for “Abstract”, respectively.18 After that, we conduct the extraction experiments on the scientific article dataset, and evaluate the overall performance (i.e., Precision, Recall, and F1-score) on 100 scientific articles labeled with technical phrases.19 The results on the whole technical phrase set and two levels (i.e., “Title” and “Abstract”) are listed in Table 16.
    Table 15.
    DatasetNum. articlesAvg. sentences of TitleAvg. sentences of Abstract
    Scientific Article100,0001.027.01
    Table 15. The Statistics of The Scientific Article Dataset
    Table 16.
    MethodWholeTitleAbstract
    PRF1PRF1PRF1
    ECON17.966.959.455.002.663.3315.646.788.74
    DBpedia37.7416.4821.5929.5018.4921.8932.0215.2819.33
    Autophrase30.2921.6724.0823.5015.8318.1326.3421.6322.69
    NE-rank21.4925.3522.3235.0038.6635.4015.7417.2615.37
    Rake16.4218.2216.5925.3325.8324.6912.6313.4112.35
    Spacy27.6030.7227.8341.8342.2540.6022.5523.7021.59
    MultipartiteRank30.0932.9530.1837.8333.4134.3627.2929.5826.67
    JMLGC31.9135.6232.0943.3343.3341.8027.1628.7225.88
    UMTPE33.8837.0434.1242.7546.0842.9029.7930.9628.83
    TechPat35.8437.7835.0843.5045.5043.0332.2534.6731.50
    Table 16. Overall Performance on The Scientific Article Dataset(%)
    The bold fonts represent the optimal results.
    From this table, we could find that our proposed UMTPE and TechPat methods achieve the best performance compared with other baselines, which demonstrates their effectiveness and superiority. Moreover, similar to the performance on the patent datasets, DBpedia achieves excellent performance in Precision but performs poorly in Recall and F1-score. This comes from the reason that it relies entirely on the external database and can only extract few phrases. As for the performance on different levels, NE-rank, Rake, Spacy, and JMLGC all perform well on “Title” but poorly on “Abstract”. It is probably because the extraction difficulty increases as the document gets longer. Finally, when we investigate the difference between UMTPE and TechPat, we find that TechPat achieves obvious improvements with the help of the multi-aspect graph structure and newly revised measurement indicators, which is consistent with the experimental analysis on the patent datasets presented in Section 6.2.1.
    In a nutshell, we apply our proposed technical phrase extraction methods, i.e., UMTPE and TechPat, to a scientific article dataset and achieve the optimal performance. This proves the superior generalization ability of our proposed methods.

    8 Conclusion and Future Work

    In this article, we explored a motivated direction for technical phrase extraction in patent data. Specifically, we first presented a clear and detailed description of technical phrases in patents based on various prior works, practical experience, and statistical analyses. Then, by analyzing the characteristics of technical phrases and effectively modeling the complex structure of patent documents (such as multi-aspect semantics and multi-level relevances), we developed a novel unsupervised model, namely TechPat, which can recognize technical phrases from massive patent texts and does not require expensive human labeling. Subsequently, we designed a novel metric called IRE to evaluate extracted phrases from the perspective of representation ability, which could supplement traditional evaluation metrics like Precision and Recall. Extensive experiments on real-world patent data demonstrated that the TechPat model can effectively discriminate technical phrases in patents and greatly outperform existing methods. We further applied the extracted technical phrases to two practical application tasks, where the experimental results confirmed the effect and application prospects of technical phrases. Finally, we transferred our proposed extraction methods to a scientific article dataset and proved their superior generalization ability to more technical documents.
    In future work, we would like to explore more applications of technical phrases in the patent field, such as patent similarity prediction and patent valuation. Moreover, we are also willing to extract technical phrases from more types of technical documents, such as technical news.

    Footnotes

    3
    Based on expert experience and statistics, we also remove the determiners from these phrases and filter out some more distracting phrases, i.e., phrases ending with adjectives, prepositions, or adverbs.
    4
    This algorithm converges when the change of each ranking value is less than \(10^{-4}\) , and we set the maximum number of iterations as 100.
    5
    To facilitate fair comparison, we randomly select the extracted phrases to compose the query and ensure the query is composed of 10 words at most.
    6
    To facilitate fair comparison, we randomly select the extracted phrases and ensure the combination of technical phrases for each patent comprises 20 words at most.
    7
    In the Candidate Generation part, we remove the Autophrase method, as over 90% of technical phrases extracted by Autophrase are contained in the candidate phrase set generated by other methods.
    8
    Recall Section 3.2 for the labeling details.
    10
    We use LSI (Latent Semantic Indexing) model to perform this task.
    11
    For models that extract no phrases from the document, we set the score as 0.
    12
    The setting of this number is relatively casual and does not have a significant influence on this evaluation. If the extracted phrases are less than the threshold, we select them all.
    13
    It has to be noted that here the pre-trained methods indicate the pre-trained phrase extraction models rather than pre-trained language models (e.g., BERT [10]).
    14
    Recall more details in Section 4.2.
    16
    Note that the extraction process of UMTPE and TechPat never leak the category information: (1) CPC Group descriptions are applied as the initial level to drive the extraction for whole dataset, while for each specific patent, UMTPE and TechPat have no access to its category information; (2) For patent classification, we adopt CPC Subsection classification labels instead of CPC Group, overcoming the data breach problem.
    17
    In detail, we randomly sample 1,000 key phrases from the whole scientific article dataset as the initial level, which can satisfy the requirement for prior knowledge to some extent.
    18
    Recall Section 4.5.2 for the selection details.
    19
    Recall Section 3.2 for the labeling details and Section 6.2.1 for the evaluation details, respectively.

    References

    [1]
    Wasi Ahmad, Xiao Bai, Soomin Lee, and Kai-Wei Chang. 2021. Select, extract and generate: Neural keyphrase generation with layer-wise coverage attention. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 1389–1404.
    [2]
    Abdelghani Bellaachia and Mohammed Al-Dhelaan. 2012. Ne-rank: A novel graph-based keyphrase extraction in twitter. In Proceedings of the 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Vol. 1. IEEE, 372–379.
    [3]
    Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python: Analyzing Text With the Natural Language Toolkit. “O’Reilly Media, Inc.”.
    [4]
    Saroj Kr Biswas, Monali Bordoloi, and Jacob Shreya. 2018. A graph based keyword extraction model using collective node weight. Expert Systems with Applications 97 (2018), 51–59.
    [5]
    Florian Boudin. 2018. Unsupervised keyphrase extraction with multipartite graphs. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 667–672.
    [6]
    Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30, 1–7 (1998), 107–117.
    [7]
    Wang Chen, Yifan Gao, Jiani Zhang, Irwin King, and Michael R. Lyu. 2019. -Guided encoding for keyphrase generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6268–6275.
    [8]
    Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. 2013. Improving efficiency and accuracy in multilingual entity extraction. In Proceedings of the 9th International Conference on Semantic Systems. 121–124.
    [9]
    Mukund Deshpande and George Karypis. 2004. Item-based top-n recommendation algorithms. ACM Transactions on Information Systems (TOIS) 22, 1 (2004), 143–177.
    [10]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, 4171–4186.
    [11]
    Doug Downey, Matthew Broadhead, and Oren Etzioni. 2007. Locating complex named entities in web text. In IJCAI, Vol. 7. 2733–2739.
    [12]
    Songtao Fang, Zhenya Huang, Ming He, Shiwei Tong, Xiaoqing Huang, Ye Liu, Jie Huang, and Qi Liu. 2021. Guided attention network for concept extraction. In Proceedings of the 30th International Joint Conference on Artificial Intelligence. 1449–1455.
    [13]
    Atsushi Fujii, Makoto Iwayama, and Noriko Kando. 2007. Overview of the patent retrieval task at the NTCIR-6 workshop. In NTCIR.
    [14]
    Suyu Ge, Fangzhao Wu, Chuhan Wu, Tao Qi, Yongfeng Huang, and Xing Xie. 2020. Fedner: Privacy-preserving medical named entity recognition with federated learning. arXiv:2003.09288. Retrieved from https://arxiv.org/abs/2003.09288.
    [15]
    Juan Carlos Gomez and Marie-Francine Moens. 2014. A survey of automated hierarchical classification of patents. In Proceedings of the Professional Search in the Modern World. Springer, 215–249.
    [16]
    Kazi Saidul Hasan and Vincent Ng. 2014. Automatic keyphrase extraction: A survey of the state of the art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1262–1273.
    [17]
    Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear 7, 1 (2017), 411–420.
    [18]
    Jie Hu, Shaobo Li, Yong Yao, Liya Yu, Guanci Yang, and Jianjun Hu. 2018. Patent keyword extraction algorithm based on distributed representation for patent classification. Entropy 20, 2 (2018), 104.
    [19]
    Mi-Young Kim, Ying Xu, Osmar R. Zaiane, and Randy Goebel. 2015. Recognition of patient-related named entities in noisy tele-health texts. ACM Transactions on Intelligent Systems and Technology (TIST) 6, 4 (2015), 1–23.
    [20]
    Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 260–270.
    [21]
    Keqian Li, Hanwen Zha, Yu Su, and Xifeng Yan. 2018. Concept mining via embedding. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 267–276.
    [22]
    Peipei Li, Haixun Wang, Hongsong Li, and Xindong Wu. 2018. Employing semantic context for sparse information extraction assessment. ACM Transactions on Knowledge Discovery from Data (TKDD) 12, 5 (2018), 1–36.
    [23]
    Tuohang Li, Liang Hu, Hongtu Li, Chengyu Sun, Shuai Li, and Ling Chi. 2021. TripleRank: An unsupervised keyphrase extraction algorithm. Knowledge-Based Systems 219 (2021), 106846.
    [24]
    Xinnian Liang, Shuangzhi Wu, Mu Li, and Zhoujun Li. 2021. Unsupervised keyphrase extraction by jointly modeling local and global context. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 155–164.
    [25]
    Bill Yuchen Lin, Dong-Ho Lee, Ming Shen, Ryan Moreno, Xiao Huang, Prashant Shiralkar, and Xiang Ren. 2020. TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8503–8511.
    [26]
    Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Jiawei Han. 2015. Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1729–1744.
    [27]
    Qi Liu, Han Wu, Yuyang Ye, Hongke Zhao, Chuanren Liu, and Dongfang Du. 2018. Patent litigation prediction: A convolutional tensor factorization approach. In IJCAI. 5052–5059.
    [28]
    Qi Liu, Biao Xiang, Nicholas Jing Yuan, Enhong Chen, Hui Xiong, Yi Zheng, and Yu Yang. 2017. An influence propagation view of pagerank. ACM Transactions on Knowledge Discovery from Data (TKDD) 11, 3 (2017), 1–30.
    [29]
    Yuping Liu, Qi Liu, Runze Wu, Enhong Chen, Yu Su, Zhigang Chen, and Guoping Hu. 2016. Collaborative learning team formation: a cognitive modeling perspective. In Proceedings of the International Conference on Database Systems for Advanced Applications. Springer, 383–400.
    [30]
    Ye Liu, Han Wu, Zhenya Huang, Hao Wang, Jianhui Ma, Qi Liu, Enhong Chen, Hanqing Tao, and Ke Rui. 2020. Technical phrase extraction for patent mining: A multi-level approach. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM). IEEE, 1142–1147.
    [31]
    Zhiyuan Liu, Peng Li, Yabin Zheng, and Maosong Sun. 2009. Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 257–266.
    [32]
    Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Zaiqing Nie. 2015. Joint entity recognition and disambiguation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 879–888.
    [33]
    Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1064–1074.
    [34]
    Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 55–60.
    [35]
    Leland McInnes and John Healy. 2017. Accelerated hierarchical density based clustering. In Proceedings of the 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 33–42.
    [36]
    Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical density based clustering. Journal of Open Source Software 2, 11 (2017), 205.
    [37]
    Rui Meng, Xingdi Yuan, Tong Wang, Sanqiang Zhao, Adam Trischler, and Daqing He. 2021. An empirical study on neural keyphrase generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4985–5007.
    [38]
    Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and Yu Chi. 2017. Deep keyphrase generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 582–592.
    [39]
    Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 404–411.
    [40]
    Heeyong Noh, Yeongran Jo, and Sungjoo Lee. 2015. Keyword selection and processing strategy for applying text mining to patent analysis. Expert Systems with Applications 42, 9 (2015), 4348–4360.
    [41]
    Liangming Pan, Xiaochen Wang, Chengjiang Li, Juanzi Li, and Jie Tang. 2017. Course concept extraction in moocs via embedding-based graph propagation. In Proceedings of the 8th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 875–884.
    [42]
    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318.
    [43]
    Youngjin Park and Janghyeok Yoon. 2017. Application technology opportunity discovery from technology portfolios: Use of patent classification and collaborative filtering. Technological Forecasting and Social Change 118 (2017), 170–183.
    [44]
    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research 12 (2011), 2825–2830.
    [45]
    Qi Peng, Changmeng Zheng, Yi Cai, Tao Wang, Haoran Xie, and Qing Li. 2021. Unsupervised cross-domain named entity recognition using entity-aware adversarial training. Neural Networks 138 (2021), 68–77.
    [46]
    Gollam Rabby, Saiful Azad, Mufti Mahmud, Kamal Z. Zamli, and Mohammed Mostafizur Rahman. 2020. Teket: a tree-based unsupervised keyphrase extraction technique. Cognitive Computation 12, 4 (2020), 811–833.
    [47]
    Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Automatic keyword extraction from individual documents. Text Mining: Applications and Theory 1 (2010), 1–20.
    [48]
    Walid Shalaby and Wlodek Zadrozny. 2019. Patent retrieval: a literature review. Knowledge and Information Systems 61 (2019), 631–660.
    [49]
    Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R. Voss, and Jiawei Han. 2018. Automated phrase mining from massive text corpora. IEEE Transactions on Knowledge and Data Engineering 30, 10 (2018), 1825–1837.
    [50]
    Xianjie Shen, Yinghan Wang, Rui Meng, and Jingbo Shang. 2022. Unsupervised deep keyphrase generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11303–11311.
    [51]
    Tian Shi, Xuchao Zhang, Ping Wang, and Chandan K. Reddy. 2021. Corpus-level and concept-based explanations for interpretable document classification. ACM Transactions on Knowledge Discovery from Data (TKDD) 16, 3 (2021), 1–17.
    [52]
    Tadej Štajner, Bart Thomee, Ana-Maria Popescu, Marco Pennacchiotti, and Alejandro Jaimes. 2013. Automatic selection of social media responses to news. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 50–58.
    [53]
    Jie Tang, Bo Wang, Yang Yang, Po Hu, Yanting Zhao, Xinyu Yan, Bo Gao, Minlie Huang, Peng Xu, Weichang Li, et al. 2012. Patentminer: Topic-driven patent analysis and mining. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1366–1374.
    [54]
    Pingjie Tang, Meng Jiang, Bryan Ning Xia, Jed W. Pitera, Jeffrey Welser, and Nitesh V. Chawla. 2020. Multi-label patent categorization with non-local attention-based graph convolutional network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9024–9031.
    [55]
    Yuen-Hsien Tseng, Chi-Jen Lin, and Yu.-I. Lin. 2007. Text mining techniques for patent analysis. Information Processing & Management 43, 5 (2007), 1216–1247.
    [56]
    Jorge Villalon and Rafael A. Calvo. 2009. Concept extraction from student essays, towards concept map mining. In Proceedings of the 2009 9th IEEE International Conference on Advanced Learning Technologies. IEEE, 221–225.
    [57]
    Yuhui Wang, Junping Du, Yingxia Shao, Ang Li, and Xin Xu. 2022. A patent text classification method based on phrase-context fusion feature. In Proceedings of the 2021 Chinese Intelligent Automation Conference. Springer, 157–164.
    [58]
    Yanan Wang, Qi Liu, Chuan Qin, Tong Xu, Yijun Wang, Enhong Chen, and Hui Xiong. 2018. Exploiting topic-based adversarial neural network for cross-domain keyphrase extraction. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 597–606.
    [59]
    Zhijuan Wang, Yinghui Feng, and Fuxian Li. 2016. The improvements of text rank for domain-specific key phrase extraction. International Journal of Simulation Systems, Science & Technology 17, 20 (2016), 1–11.
    [60]
    Zhichun Wang, Juanzi Li, and Jie Tang. 2013. Boosting cross-lingual knowledge linking via concept annotation. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence.
    [61]
    Yuzana Win and Tomonari Masada. 2015. Exploring technical phrase frames from research paper titles. In Proceedings of the 2015 IEEE 29th International Conference on Advanced Information Networking and Applications Workshops. IEEE, 558–563.
    [62]
    Chuhan Wu, Fangzhao Wu, Tao Qi, Junxin Liu, Yongfeng Huang, and Xing Xie. 2020. Detecting entities of works for Chinese chatbot. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 6 (2020), 1–13.
    [63]
    Fangzhao Wu, Junxin Liu, Chuhan Wu, Yongfeng Huang, and Xing Xie. 2019. Neural chinese named entity recognition via CNN-LSTM-CRF and joint training with word segmentation. In Proceedings of the World Wide Web Conference. 3342–3348.
    [64]
    Han Wu, Kun Zhang, Guangyi Lv, Qi Liu, Runlong Yu, Weihao Zhao, Enhong Chen, and Jianhui Ma. 2019. Deep technology tracing for high-tech companies. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 1396–1401.
    [65]
    Han Xiao. 2018. bert-as-service. Retrieved from https://github.com/hanxiao/bert-as-service.
    [66]
    Yongxiu Xu, Heyan Huang, Chong Feng, and Yue Hu. 2021. A supervised multi-head self-attention network for nested named entity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14185–14193.
    [67]
    Xiaobing Xue and W. Bruce Croft. 2009. Automatic query generation for patent search. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 2037–2040.
    [68]
    Xiaoibng Xue and W. Bruce Croft. 2009. Transforming patents into prior-art queries. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 808–809.
    [69]
    Xi Yang, Jiang Bian, William R. Hogan, and Yonghui Wu. 2020. Clinical concept extraction using transformers. Journal of the American Medical Informatics Association 27, 12 (2020), 1935–1942.
    [70]
    Yang Yang, Jie Tang, and Juanzi Li. 2018. Learning to infer competitive relationships in heterogeneous networks. ACM Transactions on Knowledge Discovery from Data (TKDD) 12, 1 (2018), 1–23.
    [71]
    Jifan Yu, Chenyu Wang, Gan Luo, Lei Hou, Juanzi Li, Jie Tang, and Zhiyuan Liu. 2019. Course concept expansion in moocs with external knowledge and interactive game. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4292–4302.
    [72]
    Hongyuan Zha. 2002. Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 113–120.
    [73]
    Longhui Zhang, Lei Li, and Tao Li. 2015. Patent mining: A survey. ACM Sigkdd Explorations Newsletter 16, 2 (2015), 1–19.
    [74]
    Longhui Zhang, Lei Li, Tao Li, and Qi Zhang. 2014. Patentline: analyzing technology evolution on multi-view patent graphs. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. 1095–1098.
    [75]
    Ziqi Zhang, Jie Gao, and Fabio Ciravegna. 2018. Semre-rank: Improving automatic term extraction by incorporating semantic relatedness with personalised pagerank. ACM Transactions on Knowledge Discovery from Data (TKDD) 12, 5 (2018), 1–41.
    [76]
    Feng Zhao, Xianyu Gui, Yafan Huang, Hai Jin, and Laurence T. Yang. 2020. Dynamic entity-based named entity recognition under unconstrained tagging schemes. IEEE Transactions on Big Data 8, 4 (2020), 1059–1072.
    [77]
    Baohang Zhou, Xiangrui Cai, Ying Zhang, and Xiaojie Yuan. 2021. An end-to-end progressive multi-task learning framework for medical named entity recognition and normalization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 6214–6224.
    [78]
    Decong Li, Sujian Li, Wenjie Li, Wei Wang, and Weiguang Qu. 2010. A semi-supervised key phrase extraction approach: learning from title phrases through a document semantic network. In Proceedings of the ACL 2010 conference short papers. 296–300.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 17, Issue 9
    November 2023
    373 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/3604532
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 June 2023
    Online AM: 13 May 2023
    Accepted: 03 May 2023
    Revised: 03 September 2022
    Received: 12 January 2022
    Published in TKDD Volume 17, Issue 9

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Technology portrait
    2. technical phrase extraction
    3. patent mining
    4. multi-level
    5. multi-aspect

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • The provincial projects on quality engineering for colleges and universities in Anhui Province

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 2,643
      Total Downloads
    • Downloads (Last 12 months)2,447
    • Downloads (Last 6 weeks)154

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media