research-article

Open access

TechPat: Technical Phrase Extraction for Patent Mining

Authors:

Ye Liu,

Han Wu,

Qi Liu, and

Enhong ChenAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 17, Issue 9

Article No.: 129, Pages 1 - 31

https://doi.org/10.1145/3596603

Published: 15 June 2023 Publication History

PDF eReader

Abstract

In recent years, due to the explosive growth of patent applications, patent mining has drawn extensive attention and interest. An important issue of patent mining is that of recognizing the technologies contained in patents, which serves as a fundamental preparation for deeper analysis. To this end, in this article, we make a focused study on constructing a technology portrait for each patent, i.e., to recognize technical phrases concerned in it, which can summarize and represent patents from a technical perspective. Along this line, a critical challenge is how to analyze the unique characteristics of technical phrases and illustrate them with definite descriptions. Therefore, we first generate the detailed descriptions about the technical phrases existing in extensive patents based on different criteria, including various previous works, practical experience, and statistical analyses. Then, considering the unique characteristics of technical phrases and the complex structure of patent documents, such as multi-aspect semantics and multi-level relevances, we further propose a novel unsupervised model, namely TechPat, which can not only automatically recognize technical phrases from massive patents but also avoid the need for expensive human labeling. After that, we evaluate the extraction results from various aspects. Specifically, we propose a novel evaluation metric called Information Retrieval Efficiency (IRE) to quantify the performance of extracted technical phrases from a new perspective. Extensive experiments on real-world patent data demonstrate that the TechPat model can effectively discriminate technical phrases in patents and greatly outperform existing methods. We further apply extracted technical phrases to two practical application tasks, namely patent search and patent classification, where the experimental results confirm the wide application prospects of technical phrases. Finally, we discuss the generalization ability of our proposed methods.

1 Introduction

According to statistics of World Intellectual Property Organization (WIPO),¹ patent applications, which contain rich innovation ideas, keep growing rapidly worldwide and the number has reached 3.3 million in 2020. Indeed, the explosive growth of patents creates a valuable base of data for revealing the inner law of innovation through patent analysis [27, 43, 64, 70, 74], but at the same time, puts forward higher requirements on patent mining techniques [53, 73].

As a matter of fact, patent mining is often highly reliant on text analysis, i.e., how to process, organize and analyze key information of patent documents [22, 55]. An effective step here is to construct a technology portrait for each patent, that is, to identify the technical phrases [61] involved, which aids greatly in summarizing the key information it contains from a technical perspective. Generally speaking, technical phrases refer to phrases that are closely related to special technologies. For example, one given patent may contain “wireless communication” and “multiplex communication”. These technical phrases indicate that the patent is closely related to the “electric network” domain and the combination of them can be seen as a technology portrait tagging this patent (A more clear and detailed description of technical phrase can be found in Section 3.2). In order to better understand the strength of technical phrases, we present a comparison between technical and non-technical phrases in Table 1. From this table, we can observe that technical phrases vary substantially compared with non-technical phrases, as the former contain rich technical information and may represent a certain technology (e.g., wireless communication), while the latter, relatively speaking, tend to have a more general or common meaning (e.g., wire and cable). Accordingly, compared with non-technical phrases, technical phrases could reveal and represent the technologies contained in patent documents, thereby providing a vital basis for patent mining. Therefore, how to automatically extract these technical phrases from massive patent documents for constructing the technology portrait is a meaningful research issue.

Table 1.

Domain	Technical phrase	Non-technical phrase
Electricity	wireless communication, video encoding, netcentric computer service	wire and cable, TV signal, power plug
Mechanical Engineering	fluid leak detection, power transmission, rotational speed control system	building materials, steer column, seat back

Table 1. A Comparison between Technical Phrases and Non-Technical Phrases

As far as we are concerned, there have been few works specifically designed for technical phrase extraction, while some relevant works on phrase extraction have been explored. According to the extraction target, these can be divided into three categories, including key phrase extraction [58], Named Entity Recognition(NER) [20], and concept extraction [21]. In more detail, key phrase extraction aims to extract phrases that provide a concise summary of a document [58], with preference given to those that are both frequently occurring and close to the main topics. NER [20] focuses on locating and classifying named entities into pre-defined categories and pays more attention to whether the given phrases are real entities. For its part, concept extraction [21] is somewhat similar to technical phrase extraction and aims to find words or phrases describing a concept from massive texts. However, it is worth noting that a concept here does not equal to a technical phrase, as some phrases like “user preference” and “reproductive age” actually belong to concepts but not our focus (i.e., technical phrases). To summarize, although the extraction targets of these works vary from each other, most of them ignore the technical information contained in phrases, which is the key attribute of technical phrases.

Unfortunately, there are many technical and domain challenges inherent in designing and implementing an effective technical phrase extraction system in the patent field. First, as the technical meaning in patent documents is difficult to quantify, there are more perplexing and unreachable characteristics among technical phrases. Two similar phrases may sometimes have completely different implications. For example, “support vector machine (SVM)” is a technical phrase indicating a classification algorithm, while “support machine” is not. Second, technical phrases often appear at different levels of one patent (i.e., “Title”, “Abstract”, and “Claim”), and such a multi-level structure of patents shows strong connections in describing a common technology target. Therefore, how to combine the information from different levels and effectively utilize their relations are also key challenges for recognizing technical phrases from patents. Third, in the text of each level, there are often phrases that describe various aspects of the content, especially in long texts. For instance, as shown in Figure 1, in a patent of “Computer Network”, there exist phrases describing the “Transmission” aspect (such as “uplink transmission”, “media stream transmission”), while there are also some phrases concerning the “Computing” aspect (such as “cloud computing” and “parallel computing”). We refer to this as the multi-aspect semantics structure, which reveals the distribution of numerous phrases in the text, and thus provides an entry point for technical phrase recognition.

Fig. 1.

To directly achieve the primary goal of extracting technical phrases with addressing the first two challenges, in our preliminary work [30], we propose an Unsupervised Multi-level Technical Phrase Extraction (UMTPE) model, which primarily explores both the statistical, semantic characteristics of technical phrases and the multi-level structure in patent data. Specifically: (1) we first analyze the key characteristics of technical phrases in patent documents and provide a clear description of them. Then, we design several measurement indicators for technical phrases from statistical and semantic perspectives, which enable us to recognize technical phrases accurately and comprehensively. (2) Considering the relations between different levels in patents, we design components (i.e., Topic Generation, Topic Relevance) to relate adjacent levels, which could utilize the implied information in multi-level structure extensively. With the help of these designs, UMTPE can extract technical phrases from numerous patent documents; however, it neglects the multi-aspect semantics structure in patent texts.

In this article, to better mine the semantics structure in patent texts and improve the extraction performance of our proposed model, we further develop an extended version of UMTPE and propose another enhanced model, namely TechPat, in which we refine the analysis of technical phrases and patent data, and further incorporate the multi-aspect semantics structure into our modeling process. To be more specific, in the TechPat model, we propose the multi-aspect graph to characterize the relation of different phrases in the text, which is closer to the real distribution of phrases. Then, we revise the measurement indicators in UMTPE accordingly and design a novel ranking algorithm to help select technical phrases from a pre-generated candidate phrase pool. In this way, TechPat can more accurately recognize technical phrases from massive patent documents.

After the extraction process, it is still a non-trivial task to comprehensively evaluate the results, especially for these special extraction tasks. In this article, to supplement traditional evaluation metrics and improve the evaluation confidence, we propose a novel metric called Information Retrieval Efficiency (IRE) to evaluate the extracted technical phrases from a representation ability perspective. Extensive experiments on real-world patent data demonstrate that our proposed methods can effectively discriminate technical phrases in patents and greatly outperform existing baselines. Finally, we further apply the extracted technical phrases into two practical application tasks, namely patent search and patent classification, the results of which confirm the application prospects of technical phrases.

Although our proposed methods focus on the technical phrase extraction in patent documents, the designs are more general and can be transferred into more types of technical documents. We provide the way to apply our methods to the scientific article data, and discuss their generalization ability to more technical documents.

Overview. The remainder of this article is organized as follows. In Section 2, we briefly introduce some related works of our study. In Section 3, we introduce some preliminaries pertaining to both patent data and technical phrases, and further present the problem statement and our solution overview. Section 4 contains the details of our proposed TechPat model. In Section 5, we specify two application tasks of our proposed technical phrases. Section 6 presents the experimental results. After that, we further discuss the generalization ability of our proposed methods to more technical documents in Section 7. Finally, conclusions are given in Section 8.

2 Related Work

To the best of our knowledge, few existing works have been directly designed to extract technical phrases from patents. However, some relevant works can still be identified, including key phrase extraction, NER and concept extraction.

Key Phrase Extraction. Key phrase extraction aims to extract phrases that provide a concise summary of a document. It has been widely studied in data mining tasks, including supervised [1, 7, 26, 37, 38, 49, 58, 78] and unsupervised methods [2, 5, 23, 24, 31, 46, 47, 50, 59, 72]. For one thing, supervised methods often target at training a complicated model with the help of labeled data or external knowledge bases. For instance, Meng et al. [37, 38] designed an encoder-decoder framework to generate key phrases from the original text. Ahmad et al. [1] designed a novel transformer-based architecture to mine key phrases in long documents, which can extract and generate key phrases from the text simultaneously. Shang et al. [26, 49] proposed an approach of adaptively recognizing phrase occurrence based on quality estimation, which relied on external knowledge bases (e.g., Wikipedia) to some extent. For another, unsupervised methods focus on mining the inner connections in documents in response to the lack of labeled data. Bellaachia et al. [2] designed an improved ranking algorithm based on Pagerank and Textrank to evaluate the importance of words in documents, which they used to formulate key phrases. Liang et al. [24] proposed to utilize the pretrained embedding to represent both the document and candidate phrases, after which a ranking mechanism considering both local and global similarities in the context was conducted to select key phrases.

Named Entity Recognition. NER focuses on locating and classifying named entities into pre-defined categories, which is often regarded as a sequence labeling problem. In the early stages, researchers applied Conditional Random Field (CRF), SVM, and perception models with hand-crafted features [11, 19, 32]. In recent years, as deep learning has rapidly developed, NER has tended to be tackled by Recurrent Neural Networks (RNNs) and attention mechanism [14, 20, 25, 33, 45, 62, 63, 66, 76, 77]. For example, Chiu et al. [33] proposed a hybrid BiLSTM-CNNs-CRF architecture to locate named entities in the original text. Lin et al. [25] put forward an “entity trigger” to improve the traditional LSTM&CRF framework, which increased the model’s interpretability and saved substantially on labeled data. On the basis of the LSTM&CRF framework, Wu et al. [62] proposed an automatic annotation method via quote marks, which can help detect entities from Chinese chatbot conversation logs without supervision. Zhou et al. [66] proposed to treat NER as the multi-class classification problem of word pairs and designed a multi-head self-attention mechanism to mine the word level correlations for each entity type.

Some pretrained models have also been developed for this task [8, 17, 34]. For example, Honnibal et al. [17] released a package tool called Spacy for NER, noun phrase chunking and other annotation tasks, which can achieve good time efficiency and robustness.

Concept Extraction. Concept extraction aims to find words or phrases describing a concept within massive texts, which has been studied extensively in previous works [12, 21, 41, 56, 60, 69, 71, 75]. Generally speaking, concept extraction can also be formulated as a sequence labeling problem. Traditional methods often adopt a generation and selection mechanism, which consists of two steps: (1) extracting candidate concepts via hand-crafted rules or syntactic pattern matching; (2) selecting target concepts based on supervised or unsupervised methods. For instance, Li et al. [21] utilized a range of models to generate possible concepts, and then designed a novel architecture to evaluate the fitness of extracted concepts relative to the original text. Recently, with the help of strong computing power, deep learning methods have achieved remarkable performance. For example, Yang et al. [69] designed a transformer-based model to directly extract the concept from massive texts. Fand et al. [12] proposed a Guided Attention Network. In this model, three additional supervision signals were introduced to explore the structured information in raw text and it achieved good performance and learning efficiency.

In summary, the above studies focus on their respective target phrases and cannot be transferred directly into technical phrase extraction. First, it is unsuitable to apply such supervised methods to our task, as there are insufficient labeled technical phrases in massive patent documents. Second, unsupervised approaches are often sensitive to the extraction target, meaning that certain gaps exist between technical phrase extraction and the existing methods. Moreover, patent data characteristics can be another consideration, as technology relations between different levels in patents (i.e., “Title”, “Abstract”, and “Claim”) will provide opportunities for aiding technical phrase recognition from patent documents, which cannot be effectively captured by the existing models.

3 Overview

In this section, we first introduce the patent data and analyze the multi-aspect semantics in patent text. Then, based on expert experience and statistics, we provide a clear description of technical phrases in patents. Finally, we present the problem statement of technical phrase extraction and specify our solution overview.

3.1 Patent Data

The patent data we use is provided by the United States Patent and Trademark Office (USPTO),² and comprises two domains, i.e., Mechanical Engineering and Electricity. Each patent contains a multi-level structure, i.e., “Title”, “Abstract”, and “Claim”, where “Title” and “Abstract” depict the topic and brief summary of a patent, while “Claim” is a more detailed and lengthy description of the inventor’s rights.

Multi-aspect Semantics. As noted in Section 1, on each level of a patent, there are often phrases that describe various aspects of the content, especially in long texts. To facilitate better analysis and illustration, we provide an example of phrases in “Abstract” and “Claim” in Figure 2. This figure reveals the distribution of all phrases in semantic space; each node represents a phrase, while the color indicates the semantic aspect to which the phrase belongs. We can easily determine that phrases in the same aspect often gather together in a sub-semantic-space and are much more closely related than others. We refer to this phenomenon as the multi-aspect semantics structure in the patent text, which provides a vital basis for the modeling of the technical phrase recognition process.

Fig. 2.

3.2 Description of Technical Phrase

In this subsection, we hire four experts to manually extract technical phrases from 100 patents in two domains, i.e., Mechanical Engineering and Electricity, respectively. After examining the technical phrases extracted from patents, we can make several specific observations:

(1)

Part of Speech. Although the part of speech distribution of technical phrases shows various types, most of them are noun phrases. According to the statistics of extracted phrases, noun phrases account for more than 90%.

(2)

Number of Words. As Figure 3 shows, the lengths of technical phrases in different domains are slightly different; however, most of them comprise 2 \(\sim\) 4 words, sometimes reaching 5.

(3)

Semantic Context. In a patent document, there often exist similar technical phrases, such as “image encoding” and “image decoding”. It is easy to understand that technical phrases occurring in the same context will be relatively more similar to each other in semantics. Besides, technical phrases are expected to have a relatively independent technical meaning. While some phrases like “system architecture” also frequently occur in conjunction with technical phrases, these are not our focus as they have no specific technical meaning.

(4)

Local Occurrence. On each level of a patent, technical phrases often appear more than once, especially in long texts, which can be seen as a local occurrence. For example, across the technical phrases extracted from “Claim”, over 70% of them appear in the text at least twice.

(5)

Global Occurrence. In the same patent document, a common technical phrase tends to appear repeatedly across different levels. That is to say, their global occurrence in the multi-level structure may provide some insights for aiding technical phrase recognition. In order to verify this point, a focused analysis is conducted in the following.

Fig. 3.

Figure 4(a) illustrates the average number of technical phrases across different levels. As we can see, the number of technical phrases increases rapidly from “Title” to “Claim” on both datasets. Figure 4(b) shows the average ratio of the number of technical phrases to the number of words in different levels. From “Title” to “Claim”, this ratio drops significantly, indicating that more and more non-technical phrases emerge, and thus greatly increases the difficulty to recognize technical phrases. This clearly reveals that although the technical phrases become increasingly abundant from top to bottom in the multi-level structure, the extraction difficulty rapidly increases as the interference factors become even bigger.

Fig. 4.

Meanwhile, we find that over 35% of “Abstract”s contain at least one same technical phrase from “Title”s, and this percentage rises to 80% when it comes to “Claim”s and “Abstract”s. In other words, the technical phrases from the current level (e.g., “Title”) may play a guiding role in the technical phrase extraction of the next level (e.g., “Abstract”). We can therefore use the phrases extracted from the current level to help guide extractions in the next level, which can formulate a multi-level model architecture and effectively utilize the information between different levels.

Moreover, existing patent classification systems can be an initial driving force for technical phrase recognition, for example, Cooperative Patent Classification Group (CPC Group), whose descriptions (Table 2) are highly relevant to technologies. Although both the quantity and quantity of CPC Group descriptions are limited, we can still regard them as the prior knowledge of technical phrases, which could help guide the extraction process in the first level (“Title”) of the patent.

Table 2.

CPC Group	Description
H02J	systems for storing electric energy
H04H	broadcast communication
H04J	multiplex communication
H04W	wireless communication networks

Table 2. CPC Group Examples

3.3 Problem Statement

Based on the multi-level structure of patent documents, we attempt to extract technical phrases level by level. The extracted phrases in the current level will be seen as the prior knowledge for guiding the next level, while CPC Group descriptions can be seen as the initial level.

In more detail, for each level of a patent (i.e., “Title”, “Abstract”, and “Claim”), technical phrase extraction is formulated as a generation and selection problem [16]. That is to say, given the word sequence of a patent document \(\boldsymbol {x} = (x_1, x_2, \ldots , x_n)\) , we first build a large-scale candidate pool \(\boldsymbol {Y} = \lbrace y_i, i = 1, 2, \ldots \rbrace\) , where \(y_i = (x_m, x_{m+1}, \ldots , x_{m+l-1})\) is a possible technical phrase, n represents the length of the patent document, m indicates the starting location of the candidate phrase, while l represents this candidate phrase’s length. Next, from the candidate pool \(\boldsymbol {Y}\) , we design a score and rank mechanism to select the final technical phrases. For convenience, we refer to the extracted phrase list in a certain level as \(P_{level}\) , such as \(P_{title}\) . Finally, with the technical phrases extracted from “Title”, “Abstract”, and “Claim”, we can obtain the technical phrase set \(P_{all} = \lbrace P_{title}, P_{abstract}, P_{claim}\rbrace\) for each patent.

3.4 Solution Overview

Our solution overview is shown in Figure 5. Specifically, Based on the existing CPC Group descriptions and patent documents organized in multi-level structure (“Title”, “Abstract”, and “Claim”), we propose our TechPat model. TechPat mines the relations between different levels in the patent documents and obeys a generation&selection process to recognize technical phrases, which will be detailed introduced in Section 4. After the extraction of technical phrases, we further apply them to two practical application tasks, i.e., searching relevant patents in the patent database, and classifying patents into given categories, which can prove the effect and application prospects of technical phrases.

Fig. 5.

In the following, we will specify the modeling process of our proposed TechPat model.

4 The TechPat Model

In this section, we introduce the technical details of TechPat model. As Figure 6 shows, our model deals with patent data level by level, which means we use the CPC Group descriptions or predicted results from the current level to guide the extraction to the next level. In each level of patents, TechPat model contains five modules: Topic Generation, Candidate Generation, Candidate Graph Construction, Candidate Score, and Technical Phrase Recognition.

Fig. 6.

Model Overview. As illustrated in Figure 6, in the Topic Generation part, we utilize the extracted results from the last level to generate several topic centroids in embedding space, which will guide the extraction in the current level. The Candidate Generation part then builds a large-scale candidate pool from documents, and Candidate Graph Construction part further formulates a multi-aspect graph with these candidate phrases. Upon this multi-aspect graph, special score and ranking mechanism will be implemented in the Candidate Score and Technical Phrase Recognition parts, after which we obtain the final technical phrases. The remainder of this section will describe our model in detail.

4.1 Topic Generation

As we discussed in Section 3.2, CPC Group descriptions can be seen as the initial prior knowledge of technical phrases to guide the extraction. From this perspective, we first map the content in CPC Group to the embedding space. In this article, we use the pretrained model bert-as-service [10, 65] to obtain the representation of each phrase. After that, we cluster them to a few centroids, the goal of which is to find several topics of technical phrases. These topic centroids will then be utilized to guide the phrase extraction in “Title”. Subsequently, we will select technical phrases with high confidence from extracted results in “Title” to perform the same operations to “Abstract”, which will form a multi-level structure (as illustrated in Figure 6).

Rather than focusing on a particular patent, the topic generation concerns all CPC Group descriptions or high-confidence technical phrases from a certain level across the whole dataset. This design can overcome the effect of a few bad cases and improve the robustness of topic generation. As for the choice of clustering method, we use a hierarchical clustering method called Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) [35, 36], which outperforms traditional clustering algorithm in terms of both accuracy and stability, and allows us to find suitable clustering centroids.

4.2 Candidate Generation

Phrase extraction tasks often encounter a dilemma that there are no authoritative tools to generate phrases from raw texts, especially for a newly defined extraction object like technical phrases. In order to improve the completeness of the extracted phrases, this part will construct a large-scale candidate phrase pool via various phrase extraction methods. In more detail, we first generate candidate phrases using different methods. Next, after filtering out single words and removing duplications,³ we merge phrases from these methods to obtain the candidate pool for every patent document. These phrase extraction methods are as follows:

–

DBpedia [8]. This is a tool for automatically annotating mentions of DBpedia resources in texts. With the help of the external database, phrases or entities labeled by DBpedia are of high quality and confidence.

–

Spacy [17]. This is a library for advanced NLP tasks that provides phrase extraction tools. We use the entity and noun phrase chunking parts to generate candidate phrases.

–

Noun Phrase Extraction. As noted in Section 3.2, most technical phrases are noun phrases. In order to avoid missing some candidates, we extract more noun phrases to complement this pool using grammar tagging [3].

4.3 Candidate Graph Construction

The interconnections of these candidate phrases in a local and global context play an important role in mining the differences and relations among candidates, which are crucial for technical phrase discrimination. Considering this and the multi-aspect semantics structure in patent texts, we develop a multi-aspect graph to model the relation between phrases in the text, and further propose a novel ranking algorithm in Section 4.5. In more detail, as illustrated in Figure 7, we construct a multi-aspect graph \(G = (V, S, E, W)\) based on all candidates; here each node \(v_i \in V\) represents a candidate phrase, each subgraph \(s_i \in S\) contains the candidates in the same semantic aspect, and each edge \(e_{ij} \in E\) represents the relation between nodes \(v_i\) and \(v_j\) in a subgraph. Finally, the weight \(w_{ij} \in W\) indicates the relation degree of edge \(e_{ij}\) .

Fig. 7.

To define the subgraph in the multi-aspect graph, we utilize the HDBSCAN clustering algorithm [35, 36] to divide all candidates into several semantic aspects, each of which can be seen as a subgraph. In a given subgraph \(s_k\) , we define the weight \(w_{ij}\) of edge \(e_{ij}\) as the cosine similarity between candidates \(v_i\) and \(v_j\) in embedding space. We can also calculate the central point \(C(s_k)\) of subgraph \(s_k\) by averaging the embedding of all candidates it contains. After the multi-aspect graph is constructed, we will score every candidate from three perspectives (i.e., statistical, inter-level, and semantic), as outlined in the next subsection.

4.4 Candidate Score

In this subsection, we score every candidate phrase in the candidate graph from statistical, inter-level, and semantic perspectives. This can be used to comprehensively measures the possibility of a candidate node belonging to the category of technical phrases.

4.4.1 Statistical Measurement Indicators.

With the observations in Section 3, we design two intuitive statistical measurement indicators to restrict the scope from which we select technical phrases.

–

Self Length is a simple indicator to count the number of words in the phrase. From the analysis results in Figure 3, we can observe that most technical phrases are composed of 2 \(\sim\) 4 words, and sometimes the number reaches 5. According to this finding, we define Self Length as follows:

\(\begin{equation} \boldsymbol {SL_i}=\left\lbrace \begin{array}{rcl} 1 & & {len(v_i) = 2,3,4},\\ 0.5 & & {len(v_i) = 5},\\ 0 & & otherwise, \end{array} \right. \end{equation}\)

(1)

where \(len(v_i)\) represents the number of words in the candidate phrase \(v_i\) .

–

Influence Sphere measures the influence scope of a candidate phrase. On each level of a patent, technical phrases often appear in more than one sentence, as they are crucial for relating different parts of a paragraph, especially for long texts like “Claim”. From this perspective, we define Influence Sphere as the number of sentences including the candidate phrase in current document:

\(\begin{equation} \boldsymbol {IS_i} = \sum _k \mathbb {I}(v_i \in sentence_k). \end{equation}\)

(2)

4.4.2 Inter-Level Measurement Indicators.

On the basis of the multi-level structure of patent data, we observe that the relations between adjacent levels also play a crucial role in the recognition of technical phrases. To better utilize this characteristic, in this part, we design two inter-level indicators from both explicit and implicit perspectives.

–

Explicit Topic measures the overlapping extent between candidate phrases and extracted technical phrases from the last level. In other words, if a candidate has been recognized as a technical phrase in the last level, it has much more possibility of being a technical phrase in the current level. We define Explicit Topic as follows:

\(\begin{equation} \boldsymbol {ET_i}=\left\lbrace \begin{array}{rcl} 1 & & {v_i \in R},\\ 0 & & otherwise, \end{array} \right. \end{equation}\)

(3)

where R represents the extracted technical phrase list of the last level in the patent.

–

Implicit Topic focuses on the degree of relevance between candidate phrases and existing technical topics from the last level. These technical topics are generated in the Topic Generation part and represent the technical centers in embedding space. High Implicit Topic means this candidate is more associated with a specific technology topic. We define it as the largest cosine similarity between the candidate and topic centroids in embedding space:

\(\begin{equation} \boldsymbol {IT_i} = \max _{k} \cos (v_i, Topic_k). \end{equation}\)

(4)

4.4.3 Semantic Measurement Indicators.

Based on the multi-aspect graph and the observations of technical phrase semantics, we design two measurement indicators from the semantic perspective.

–

Semantic Relation measures the link ability of technical phrases. In general, similar technologies tend to appear in the same context, such as the closely associated technical phrases “image encoding” and “image decoding”. Combining the multi-aspect graph, we design Semantic Relation from both subgraph and whole-graph perspectives, which focus on the relation with the single phrase and the whole document, respectively. In the subgraph, we first cut edges with weight smaller than a threshold T and calculate the normalized degree of each candidate node as the subgraph score (Equation (5)). When it comes to the whole graph, we regard the central point of each subgraph as a node, and conduct the same series of operations on the whole graph to obtain a whole graph score for each candidate phrase (Equation (6)). After that, we integrate two scores with a hyper-parameter \(\alpha\) (Equation (7)):

\(\begin{equation} sub_i = \frac{\sum _{i, j \in s_k \cap j\ne i} \mathbb {I}(cos(v_i, v_j) \ge T)}{\sum _{i, j \in s_k \cap j\ne i}\mathbb {I}(1)}, \end{equation}\)

(5)

where \(v_i, v_j\) are the nodes in the same subgraph \(s_k\) ,

\(\begin{equation} whole_i = \frac{\sum _{k | i \notin s_k} \mathbb {I}(cos(v_i, C(s_k)) \ge T)}{\sum _{k | i \notin s_k}\mathbb {I}(1)}, \end{equation}\)

(6)

where \(C(s_k)\) represents the center point of subgraph \(s_k\) ,

\(\begin{equation} \boldsymbol {SR_i} = \alpha \cdot sub_i + (1-\alpha) \cdot whole_i. \end{equation}\)

(7)

–

Semantic Independence focuses on the independence of technical phrases in the semantic embedding space. As discussed in Section 3, technical phrases also need to have a relatively independent meaning. For example, “system architecture” often occurs alongside some technical phrases (e.g., “image processing system architecture”), but it does not have an independent meaning and is not a technical phrase. For each node, we also design Semantic Independence from both subgraph and whole-graph perspectives, which emphasizes the differences with the single phrase and the whole document, respectively. In the subgraph, we calculate its smallest cosine distance with other nodes (Equation (8)). In the whole graph, we calculate its smallest cosine distance with the central point of each subgraph (Equation (9)). Subsequently, with a hyper parameter \(\beta\) , we integrate the subgraph score and the whole graph score to obtain the final Semantic Independence (Equation (10)):

\(\begin{equation} sub_i = \min _{i, j \in s_k \cap j\ne i} (1 - \cos (v_i, v_j)), \end{equation}\)

(8)

\(\begin{equation} whole_i = \min _{k | i \notin s_k} (1-cos(v_i, C(s_k))), \end{equation}\)

(9)

\(\begin{equation} \boldsymbol {SI_i} = \beta \cdot sub_i + (1-\beta) \cdot whole_i, \end{equation}\)

(10)

where \(v_i, v_j, s_k, C(s_k)\) are the same as in Equations (5)–(7). And a low \(\boldsymbol {SI}\) score means that this phrase is relatively common or easy to form a longer phrase with arbitrary words or phrases.

Among these three kinds of measurement indicators, the statistical measurement indicators mentioned above are designed more simply and intuitively; inter-level measurement indicators play a bridging role in relating different levels in patents, while semantic measurement indicators focus on the interconnections of candidate phrases in the multi-aspect graph. Based on these measurement indicators, we can comprehensively evaluate each candidate, and the normalized sum of these scores will be set as the weight \(I(v_i)\) of each node \(v_i\) in the multi-aspect graph.

4.5 Technical Phrase Recognition

On the basis of the multi-aspect graph and traditional ranking algorithms [2, 4, 6, 28, 39], we propose the Multi-aspect Tech-Rank (MTR) algorithm, which ranks all candidate phrases from the technical phrase perspective. After that, considering the number of technical phrases in different levels, we further design a truncation strategy for the selection of technical phrases.

4.5.1 Candidate Rank.

The traditional ranking algorithm Pagerank [6] was first proposed to rank webpages by measuring their importance to the entire web (graph). In recent years, this ranking algorithm has been applied to text processing, particularly in the key word/phrase extraction field [2, 4, 39]. In this part, we propose the MTR algorithm to rank all candidate phrases from the technical phrase perspective, which can effectively utilize the multi-aspect graph structure (Algorithm 1).

In this algorithm, given the multi-aspect graph \(G = (V, S, E, W)\) , and the normalized score \(I(v_i)\) for each node \(v_i \in V\) , we are expected to obtain a ranked list for all candidate phrases. First, the nodes are set to a uniform default ranking value (Line 1); subsequently, we update the ranking value list until this algorithm converges or exceeds the maximum number of iterations⁴ (Lines 2–5). In more detail, we first calculate a ranking value for each subgraph (Line 3):

\(\begin{equation} R(s_i) = \sum _{v_j \in s_i} \frac{\cos (v_j, C(s_i)) \cdot R(v_j)}{\sum _j \cos (v_j, C(s_i))}, \end{equation}\)

(11)

where \(C(s_i)\) represents the center point of subgraph \(s_i\) , while \(R(v_j)\) represents the ranking value of node \(v_j \in V\) . After that, we update the ranking value list \(R_{list}\) from both local and global perspectives (Line 4):

\(\begin{equation} R(v_i)_{local} = \sum _{j:v_j-\gt v_i}\frac{w_{ji}}{\sum _{k:v_j}w_{jk}}R(v_j)\cdot Pena(v_j, v_i), \end{equation}\)

(12)

where \(w_{ji}\) represents the weight of edge \(e_{ji}\) between nodes \(v_j\) and \(v_i\) , \(Pena(,)\) is a penalty function and \(Pena(v_j, v_i) = 1 - overlap(v_j, v_i) / num(v_j)\) . In \(Pena(v_j, v_i)\) , \(num(v_j)\) represents the number of words in node \(v_j\) , while \(overlap(v_j, v_i)\) is the number of words appearing in both \(v_j\) and \(v_i\) .

\(\begin{equation} R(v_i)_{global} = \sum _{j:s_j-\gt s_i}\frac{w(s_j, s_i)}{\sum _{k:s_j}w(s_j, s_k)}R(s_j) \cdot \cos (C(s_i), v_i), \end{equation}\)

(13)

where \(w(s_j, s_i) = \cos (C(s_j), C(s_i))\) , which represents the relation degree between two subgraphs.

\(\begin{equation} R(v_i) = (1-d)\cdot I(v_i) + d \cdot I(v_i) \cdot ((1-\gamma) \cdot R(v_i)_{local} + \gamma \cdot R(v_i)_{global}). \end{equation}\)

(14)

In Equation (12), we calculate the local influence passing from nodes in the subgraph, while Equation (13) computes the global influence from other subgraphs. Moreover, with the damping factor d and harmonic factor \(\gamma\) , we obtain the revised ranking value \(R(v_i)\) in Equation (14). Specifically, in the local propagation process (Equation (12)), inspired by [41], we design a penalty mechanism to avoid the damage caused by the overlapping problem between different candidates. For example, phrases like “wireless communication system” and “some system” have relatively high relation due to the overlapping word “system”. However, “wireless communication system” is a technical phrase in the Electricity domain while “some system” is certainly not. Therefore, this type of relation is less reliable and will hinder the propagation process. With the penalty function \(Pena(v_j, v_i)\) , the relation between two candidates will be penalized according to the degree of overlap, which could ensure the effectiveness of the propagation.

After the propagation process is complete, we rank all candidate phrases according to the ranking values and obtain the final ranked phrase list (Line 6).

4.5.2 Candidate Selection.

Having determined the ranking result of Candidate Rank, we select the top-K candidates as technical phrases, while the candidates with high confidence (top-1) will be put in a new prior knowledge set to be sent to the next level. As the setting of K, considering that the contents in different documents vary significantly, we set it according to the number of sentences in the document ( \(N_{sen}\) ). From the labeled data, we calculate the statistical relation between K and \(N_{sen}\) , as follows:

\(\begin{equation} \frac{K}{N_{sen}} \approx \left\lbrace \begin{array}{rcl} 1\sim 2 & & Title,\\ 2 & & Abstract,\\ 1 & & Claim. \end{array} \right. \end{equation}\)

(15)

Based on this observation, we set \(K = 2 N_{sen}\) for the patent “Title” and “Abstract”, and \(K = N_{sen}\) for the patent “Claim”. Once this is complete, we can obtain the technical phrase set for each patent.

5 Applications

As we discussed in Section 1, compared with other phrases, technical phrases carries more essential and distinctive technical information in patents, which have a great advantage in summarizing and representing patents with limited features. In this section, to further verify the effect and application prospects of technical phrases, we introduce two practical application tasks, i.e., patent search and patent classification.

5.1 Patent Search

Patent search is the task of finding relevant existing patents, which is an important part of the patent examiner’s process of validating a patent application [13, 48, 67, 68]. As the information in the query patent (patent application) is often redundant and brings difficulties for searching relevant patents, existing methods tend to extract salient features to construct an effective query for the search process. From this perspective, technical phrases, carrying much distinctive technical information in patents, have a great advantage in constructing short and effective queries.

Following the traditional strategy [67, 68], we formulate patent search as a task to find the citation patents of a given patent. Specifically, as shown in Figure 8, given a query patent, we first extract the technical phrases from its “Title”, “Abstract”, and “Claim”, and then utilize the extracted technical phrases to compose a query.⁵ After that, searching algorithms could be conducted to find relevant patents in the patent corpus. As our purpose is to verify the advantages of technical phrases, we directly utilize the traditional BM25 algorithm instead of other complex methods to conduct the search process, where the relevant patents will be ranked by the relevant scores. Finally, the top-k patents will be seen as the results of patent search.

Fig. 8.

5.2 Patent Classification

Patent classification, which aims to classify each patent into one or more categories, is regarded as a basic task in the field of patent management [15, 18, 40, 54]. In the past few decades, large efforts have been made to deal with this task. And in the process of patent classification, how to effectively mine informative features (e.g., keywords, phrases) is of vital importance [18, 40, 51, 57]. Naturally, we are expected to utilize extracted technical phrases to help the classification of patents, which can verify the effectiveness of our proposed technical phrases.

More specifically, as illustrated in Figure 9, given the patents to classify, we first extract technical phrases from each patent. From the macro point of view, the extracted technical phrases are treated as the features which represent the overall patent text. In this respect, we utilize these technical phrases as input features and conduct representation and classification algorithms.⁶ Since our focus is on verifying the effect of technical phrases rather than designing complex classification algorithms, we choose the traditional TF-IDF representation and Linear Support Vector Classification (Linear SVC) algorithm to implement this process. Moreover, considering that one patent may be classified into more categories, we adopt the OneVsRest strategy [44] to achieve the multi-label patent classification.

Fig. 9.

6 Experiment

In this section, we conduct extensive experiments to demonstrate the effectiveness of our proposed framework and its implementations. Specifically, we first introduce the experimental setup (Section 6.1). Then, we demonstrate our models’ effectiveness on extracting technical phrases from three evaluation perspectives (Section 6.2). After that, we provide detailed analyses of our models (Section 6.3) and further, discuss the application prospects of technical phrases (Section 6.4).

Our code is available via https://github.com/liuyeah/TechPat.

6.1 Experimental Setup

6.1.1 Datasets.

The experiments are performed on USPTO patent data in two domains, i.e., Mechanical Engineering and Electricity. The former collects patents relating to mechanical engineering, lighting, heating, weapons, and blasting engines or pumps, while the latter is related to the electric field. In more detail, we randomly sample 11 k and 84 k pieces of patent data in the Mechanical Engineering and Electricity datasets, respectively, for our experiments. More statistics of our datasets are presented in Table 3.

Table 3.

Dataset	Num. patents	Avg. sentences of Title	Avg. sentences of Abstract	Avg. sentences of Claim
Mechanical Engineering	11,186	1.00	3.85	13.58
Electricity	84,069	1.00	3.89	16.58

Table 3. The Statistics of The Datasets

6.1.2 Implementation Details.

In this part, we describe the implementation details and parameters of the TechPat model. We run all experiments on a Linux server with two 2.20 GHz Intel Xeon E5-2650 CPUs and four Tesla K80 GPUs.

–

Topic Generation Hyperparameters: In the HDBSCAN clustering procedure, we have to set the min_cluster_size, i.e., the minimum size of clusters. According to the size of data at different levels, we set it as 3 for the level of CPC Group, while 100 for others (i.e., “Title” and “Abstract”).

–

Candidate Score Hyperparameters: There is a threshold in the Semantic Relation measurement indicator, which we set to \(T = 0.5\) in Equations (5) and (6). In addition, we set the weights \(\alpha = 0.5\) , \(\beta = 0.5\) in Equations (7), (10) to calculate the final semantic indicators.

–

Technical Phrase Recognition Hyperparameters: In the MTR algorithm, we set the damping factor \(d=0.85\) according to the traditional propagation algorithm [6, 39]. As for the harmonic factor, we set it to \(\gamma =0.5\) to regulate local and global propagation.

6.1.3 Comparison Baselines.

We compare our model with a wide range of state-of-the-art approaches, as described below:

–

DBpedia [8] and Spacy [17]. In the Candidate Generation part, we introduce two effective phrase extraction models to construct the large candidate pool. These two models are definitely included in our baseline group.

–

Autophrase [26, 49] combines the quality estimation and occurrence identification to extract salient phrases from documents, which is domain-independent and free of human labeling.

–

Rake [47] proposes extracting key words using graph-based importance measurement. It then uses these words to form key phrases based on their adjacency in text.

–

NE-rank [2] first scores every word based on the occurrence frequency in the original text, then uses them to form possible phrases and provide a ranking order.

–

ECON [21] aims to extract concept words/phrases based on embedding and probability theory. It utilizes many models to generate possible phrases and designs a novel architecture to evaluate the fitness of extracted concepts to the original text.

–

MultipartiteRank [5] considers the topic information and further encodes this as a multipartite graph structure. Upon the multipartite graph, TextRank [39] is conducted to rank candidate phrases according to their importance.

–

JMLGC [24] is the state-of-the-art unsupervised phrase recognition method, which is based on the pretrained language model (i.e., BERT [10]) and employs the local and global context in the document. We name it JMLGC in the experiment.

–

UMTPE [30] is the model proposed in our preliminary work. The hyper-parameters are the same as those described in the original article.⁷

Among these baselines, some models (e.g., ECON) extract both phrases and words. To facilitate fair comparison, we adopt the same filtering strategy used by our methods (Section 4.2) before evaluation. For baselines that give predicted phrases in a certain ranking order, we select top-K phrases, just like our TechPat model. As for baselines without ranking order, we randomly select K phrases as the extraction results.

6.2 Experimental Results

In this subsection, we compare the performance from three evaluation perspectives, i.e., overall performance evaluation, representation ability evaluation, and expert evaluation, from which we can demonstrate the effectiveness of our proposed UMTPE and TechPat models.

6.2.1 Overall Performance Evaluation.

The technical phrase extraction experiments are carried out on two complete datasets through different methods. And in this part, to quantify the extracting performance of these methods, we calculate three basic evaluation metrics (i.e., Precision, Recall, and F1-score) on 100 patents⁸ labeled with technical phrases on two datasets, respectively. To be specific, we first utilize a lemmatization tool in the pattern package⁹ to obtain the prototype of every word in the reference and predicted phrases. Subsequently, we can calculate the evaluation results of each level in patents (“Title”, “Abstract”, and “Claim”). For more comprehensive evaluation, we merge the extracted technical phrases across three levels to obtain the whole technical phrase set of each patent, after which we calculate the whole evaluation results. The overall performance on the whole technical phrase set is listed in Table 4, while the results on each level are shown in Tables 5 and 6.

Table 4.

Method	Mechanical Engineering			Electricity
Method	Precision	Recall	F1-score	Precision	Recall	F1-score
ECON	26.70	10.43	14.01	23.76	8.19	11.35
DBpedia	43.13	11.49	16.80	35.08	10.29	14.99
Autophrase	28.18	26.83	25.47	27.49	31.83	27.27
NE-rank	20.01	31.05	22.81	21.53	33.23	24.11
Rake	16.17	26.89	18.78	14.03	24.53	16.53
Spacy	32.42	48.83	36.41	32.37	49.27	36.20
MultipartiteRank	37.80	51.21	40.66	36.37	49.15	38.84
JMLGC	34.86	48.58	37.92	37.67	50.05	39.92
UMTPE	37.04	54.58	41.28	38.49	54.93	41.66
TechPat	39.83	55.32	43.10	38.98	55.10	41.89

Table 4. Overall Performance Evaluation (%)

The bold fonts represent the optimal results.

Table 5.

Method	Title			Abstract			Claim
Method	P	R	F1	P	R	F1	P	R	F1
ECON	15.00	9.00	10.90	18.58	7.52	9.49	18.54	7.88	10.42
DBpedia	14.50	9.67	11.17	34.19	12.95	17.09	33.42	10.46	14.80
Autophrase	8.00	4.50	5.50	24.34	13.85	15.98	23.76	28.69	23.68
NE-rank	34.00	38.17	35.17	22.53	36.76	25.05	14.08	19.73	14.73
Rake	59.00	59.00	57.47	21.89	35.72	25.26	5.10	7.01	5.08
Spacy	61.50	58.50	58.43	37.29	53.29	40.28	22.82	28.75	22.33
MultipartiteRank	58.50	53.67	54.83	39.06	51.93	41.36	33.63	44.16	33.58
JMLGC	63.50	59.17	59.50	41.27	55.08	43.48	28.48	36.19	27.72
UMTPE	61.50	65.67	61.57	39.82	59.47	43.76	30.43	43.29	31.95
TechPat	61.00	64.67	60.90	43.01	61.71	46.10	34.28	45.32	34.14

Table 5. Overall Performance on Different Levels on Mechanical Engineering Dataset(%)

The bold fonts represent the optimal results.

Table 6.

Method	Title			Abstract			Claim
Method	P	R	F1	P	R	F1	P	R	F1
ECON	7.50	6.00	6.47	18.92	5.75	7.43	16.66	6.40	8.59
DBpedia	14.50	11.17	12.30	28.62	11.83	15.21	28.53	10.30	14.18
Autophrase	18.50	11.50	13.63	28.41	25.18	22.82	22.28	32.63	23.90
NE-rank	27.00	28.83	27.13	23.07	33.57	22.64	15.38	26.73	17.61
Rake	58.00	59.58	57.10	13.57	22.72	14.79	7.40	12.24	8.56
Spacy	62.50	62.08	60.60	32.80	43.93	32.01	25.00	39.03	27.65
MultipartiteRank	53.00	53.08	51.00	32.78	41.15	31.48	32.80	47.15	35.62
JMLGC	65.50	66.17	64.00	32.78	41.69	31.42	31.16	44.87	33.48
UMTPE	63.00	69.92	64.13	36.33	51.37	35.92	32.20	49.89	35.04
TechPat	58.00	64.83	59.23	36.70	54.16	37.11	32.91	52.39	36.53

Table 6. Overall Performance on Different Levels on Electricity Dataset(%)

The bold fonts represent the optimal results.

As we can see from Table 4, our TechPat model together with UMTPE outperforms all baselines in all metrics, except for DBpedia in Precision on the Mechanical Engineering dataset, which proves the effectiveness of the multi-aspect graph and multi-level architecture coupled with three kinds of measurement indicators. Moreover, although DBpedia achieves excellent performance in Precision, it performs poorly on Recall and F1-score, as it relies entirely on the external database and can only extract few phrases. Then, across different levels, we find some interesting phenomena: (1) Rake, Spacy, and JMLGC all attain good performance on “Title”, however, their performances drop a lot when it comes to longer texts compared with our methods. This is because the majority of “Titles” are short texts with only one sentence, which extensively reduces the extracting difficulty for these methods. (2) At the same time, Autophrase and MultipartiteRank both achieve more competitive performance on “Claim” than “Title” and “Abstract”. The extraction process of Autophrase greatly relies on the frequency of phrases, while MultipartiteRank proposes a multipartite graph to exploit the topic information in the document, both of which are more suitable for longer documents.

When we examine the difference between TechPat and UMTPE, we can see that TechPat achieves obvious improvements on the overall performance evaluation with the help of the multi-aspect graph structure and newly revised measurement indicators. As for the detailed performance across different levels, TechPat exceeds UMTPE a lot in “Abstract” and “Claim”, but exhibits a reduction in “Title”. This is for two main reasons: (1) The technical phrases in “Title” are much rarer than other levels, which means a slight disturbance will greatly affect the results; (2) In TechPat, we strengthen the relation of different levels via revised inter-level indicators, which leads to effective performance in “Abstract” and “Claim” but may introduce some noise to “Title”, as the initial level CPC Group is quality-limited. However, this is acceptable as our purpose is to gain the whole technical phrase set of each patent, and a small disturbance at the “Title” level is not the focus.

6.2.2 Representation Ability Evaluation.

In order to supplement traditional evaluation metrics, we propose a new metric called IRE to evaluate the predicted technical phrases from the perspective of representation ability. As we discussed in Section 1, the combination of technical phrases can make a technology portrait for patents, which carries essential and distinctive technical information of patents. From this perspective, technical phrases are expected to have more powerful representation ability than general phrases, so Information Retrieval (IR) task on patent documents can effectively verify the extraction results.

Therefore, we conducted an IR task on 1,000 patent documents, including 100 patents with labeled technical phrases and 900 randomly selected patents. For every predicted phrase in a document, we use it as a query to rank all documents according to the matching degree.¹⁰ If the document from which this phrase is obtained is in the top-10 documents set, we score the phrase as 1; otherwise, the score will be 0. We then compute the score of this document by averaging the scores of all extracted phrases, which can comprehensively evaluate the performance of models on one document.¹¹ However, we notice that if a model only extracts one or two high-quality phrases from a document containing ten technical phrases, the score on this document still tends to be very high. This does great damage to this evaluation metric because it ignores the completeness of extracted phrases. Inspired by BLEU [42], which is an evaluation metric for machine translation and has a penalty component to avoid the effect of short translations, we set a penalty factor PF, as follows:

\(\begin{equation} PF=\left\lbrace \begin{array}{rcl} 1 & & {r \le p},\\ e^{1-r/p} & & {r \gt p}, \end{array} \right. \end{equation}\)

(16)

where r is the number of reference technical phrases in the document and p is the number of phrases extracted by the model. With PF, we can revise the score for every document:

\(\begin{equation} score_{revise} = PF \cdot score. \end{equation}\)

(17)

Finally, we average the revised score of these 100 labeled documents to get the final value of IRE. The results on the three levels (“Title”, “Abstract”, “Claim”) are listed in Table 7.

Table 7.

Method	Mechanical Engineering			Electricity
Method	Title	Abstract	Claim	Title	Abstract	Claim
ECON	12.38	11.29	10.96	10.72	9.62	12.52
DBpedia	20.44	10.41	6.15	17.19	12.69	9.51
Autophrase	8.98	22.93	27.83	25.55	36.34	34.94
NE-rank	82.62	50.60	29.35	83.21	51.58	37.29
Rake	82.82	64.54	39.74	84.75	64.68	49.06
Spacy	78.61	55.05	36.30	82.56	54.56	42.17
MultipartiteRank	70.87	56.87	40.85	76.52	54.85	46.58
JMLGC	75.74	56.65	38.85	81.76	54.73	47.07
UMTPE	82.87	55.97	45.88	87.28	55.73	49.95
TechPat	84.37	58.20	46.53	86.85	56.51	50.33

Table 7. Representation Ability Evaluation (%)*

\(^{*}\) Bold font represents the optimal result, while the underline font represents the sub-optimal result.

From the results in Table 7, we can find that the phrases extracted by our model, i.e., TechPat, UMTPE together with Rake, all have excellent representation ability and outperform other models. However, the characteristics of the phrases extracted from TechPat, UMTPE, and Rake seem quite different. Figure 10 presents the number of words in predicted phrases extracted by three models. As we can see, phrases extracted by TechPat and UMTPE are consistent with reference phrases in Figure 3, while Rake tends to extract phrases with more words. In general, it is natural that more words indicate more information and thus better performance for IR tasks. It is, therefore, intuitive that the high performance of Rake tends to benefit from the extracted lengthy phrases. By contrast, our methods can not only extract technical phrases in line with the actual situation, but also outperform Rake on “Title” and “Claim”, which indicates that phrases extracted by TechPat and UMTPE show a great advantage in representing technical information of patent documents. Meanwhile, in this evaluation, TechPat achieves certain improvements compared with UMTPE, which is consistent with the overall performance evaluation and proves the effectiveness of our newly designed modules.

Fig. 10.

6.2.3 Expert Evaluation.

To more clearly prove the effectiveness of our models, we design a manual evaluation method to compare the performances of different models. First, we extract technical phrases from a patent via different methods. For each model’s result, we randomly sample six predicted phrases: one from “Title”, two from “Abstract” and three from “Claim”.¹² We then build an evaluation pool for a patent, which contains predicted phrases from all ten models. After that, we hire experts to manually label effective technical phrases from this pool. These picked phrases are regarded as the most likely technical phrases and we call them Gold Standard [29, 52].

Taking the gold standard as ground truth, we compute hit rate (HR) [9] for each method. Here, this metric measures how close the output of a method is to the gold standard and is defined as

\(\begin{equation} HR_i = \frac{|P_i \cap GS|}{|GS|}, \end{equation}\)

(18)

where \(HR_i\) is the HR of the ith phrase extraction method, \(P_i\) represents the phrases extracted by the ith method, and GS means the phrases picked by the gold standard. We conduct this evaluation on 100 randomly sampled patents and average the results on two datasets, respectively. As shown in Table 8, the results of this evaluation are consistent with the above overall performance evaluation and representation ability evaluation.

Table 8.

HR	Mechanical Engineering	Electricity
ECON	11.41	7.26
DBpedia	16.83	16.73
Autophrase	15.75	16.28
NE-rank	24.79	19.19
Rake	20.56	18.40
Spacy	27.01	23.70
MultipartiteRank	29.26	24.24
JMLGC	28.77	24.04
UMTPE	32.10	27.33
TechPat	33.45	29.35

Table 8. Expert Evaluation (%)

The bold fonts represent the optimal results.

Among the three evaluations, the overall performance evaluation is the most basic and essential part. Representation ability evaluation uses the IR task to evaluate the representation ability of predicted phrases, and we also design a penalty factor to consider the completeness of extracted phrases. Moreover, expert evaluation evaluates the extracted results using expert decisions, which plays an important role in the evaluation of such unsupervised tasks. Based on these evaluations from different perspectives, the effectiveness of our proposed methods is extensively verified.

6.3 Model Analysis

In this subsection, we further analyze the import properties of the TechPat model. More specifically, we discuss them from the following three aspects, i.e., component effectiveness analysis, visualization analysis, and running time analysis.

6.3.1 Component Effectiveness Analysis.

In this part, we conduct ablation experiments on the Mechanical Engineering dataset to prove the effectiveness of different components of our model. To be specific, we verify it from two aspects: Structure Design (Figure 11(a)) and Indicator Design (Figure 11(b)).

Fig. 11.

Structure Design. In this part, we validate the structure design of our TechPat model. In more detail, we remove the multi-aspect graph design from the model, which means that the multi-aspect graph degenerates to the normal fully-connected graph in UMTPE [30] and the MTR algorithm degenerates accordingly [2]. Meanwhile, we also verify the Penalty Mechanism in the MTR algorithm by removing it. From the results in Figure 11(a), the obvious decreases in both variants prove the effectiveness of the multi-aspect graph and penalty mechanism.

Indicator Design. In this part, we aims to prove the effect of the measurement indicators (i.e., statistical, inter-level, and semantic) designed in the model architecture (Section 4.4). Specifically, we compare the results extracted by TechPat with none indicator, statistical indicators, inter-level indicators, semantic indicators, and full indicators, respectively. From the results in Figure 11(b), we could find that, compared with none-indicator model, three kinds of indicators all have significant promotion for the extraction results. Among them, statistical and inter-level indicators perform more effectively due to their reliance on the intuitive statistical law or inter-level relations of technical phrases, which play a “coarse adjustment” role in the recognition. While the semantic indicators are based on the semantic findings in embedding space, and play a “minute adjustment” role in the recognition process. Furthermore, the combination of these indicators (i.e., full TechPat) has an evident improvement compared with the single kind of indicators, which proves the necessity of these indicators in our model.

Through the detailed effectiveness analysis from two perspectives, we thoroughly prove the validity and non-redundancy of our TechPat model.

6.3.2 Visualization Analysis.

In this part, we conduct a visualization analysis to further explain and demonstrate the results of the TechPat model.

Considering the effect of illustration, we show the extraction results from one “Title” and “Abstract” in the left part of Figure 12. In this case, “matching phrase” means that the phrases are extracted by both the TechPat model and expert labeling, while “reference phrase” and “predicted phrase” represent phrases that are only extracted by expert labeling or the TechPat model, respectively. From this illustration, we can determine that our TechPat model can accurately recognize technical phrases in both “Title” and “Abstract”, such as “wireless communication”. Moreover, the technical phrases in “Title” are included by technical phrases in “Abstract”, which also verifies the effectiveness of the multi-level design in TechPat.

Fig. 12.

The right part of Figure 12 lists the ranking result of candidate phrases in this “Abstract”. We can easily find that high-scoring phrases are better than low-scoring ones. For example, “wireless communication” is better than “first group” from the perspective of technical phrases. This also proves the effectiveness of our score and rank components.

6.3.3 Running Time Analysis.

We conduct the following experiments to understand the efficiency of our proposed models compared with all baselines. To better compare the differences between these methods, we divide them into three categories according to their modeling characteristics:

(1)

Traditional feature engineering methods (Rake, Autophrase, NE-rank, and Multipartite-Rank). They mainly rely on the statistical features of the text to recognize possible phrases.

(2)

Database-based or pre-trained methods (DBpedia, Spacy).¹³ These methods rely heavily on the external database (DBpedia) or train the extraction pipeline well in advance (Spacy).

(3)

Embedding-based methods (JMLGC, ECON, UMTPE, and TechPat). This kind of method adopts large-dimension embedding vectors to represent words or phrases for better recognition.

For fair comparison, we run all of them on the same platform and report their consuming time on the Mechanical Engineering dataset. As illustrated in Table 9, “Total Document” means the consuming time on the whole dataset, while “Per Document” represents the average time spent on each document. In more detail, “Title”, “Abstract”, and “Claim” represent the time spent on each level, respectively, while “Patent” refers to the sum of them. From the results, we could draw some significant conclusions: First, for all methods, the running time at three levels increases gradually from “Title” to “Claim”. It is reasonable as the text length and extraction difficulty are becoming larger from top to bottom in the multi-level structure. Second, most methods based on feature engineering, database, or pretrained-pipeline are faster than embedding-based methods. This is because the large-dimension embedding vectors significantly increase the computation cost in these embedding-based approaches. Third, among the feature engineering methods, MultipartiteRank performs poorly as it adopts the topic information in the text and designs a complex directed multipartite graph structure, which greatly increases its computation burden. Fourth, ECON has the worst efficiency performance. It identifies possible phrases based on both their individual qualities and their fitness to the whole context. The latter will consume a large amount of time, especially when dealing with long documents, such as “Abstract” and “Claim”. Last, our proposed methods (UMTPE and TechPat) achieve the best efficiency performance among these embedding-based methods. And, we further argue that our methods involve a candidate generation process which is based on various other methods, such as DBpedia and Spacy.¹⁴ To analyze our methods more clearly, we separate the candidate generation time and the recognition time, as illustrated in Table 10.

Table 9.

Category	Method	Total Document (Min)				Per Document (s)
Category	Method	Title	Abstract	Claim	Patent	Patent
(1) Feature Engineering	Rake	0.017	0.108	0.524	0.649	0.003
	Autophrase	0.282	0.572	2.558	3.412	0.018
	NE-rank	1.549	35.782	119.944	157.275	0.844
	MultipartiteRank	118.111	126.113	145.703	389.927	2.092
(2) Database or Pre-trained	DBpedia	0.685	1.516	8.501	10.702	0.057
(2) Database or Pre-trained	Spacy	3.502	10.066	56.425	69.993	0.375
(3) Embedding-based	JMLGC	24.341	35.512	140.354	200.207	1.074
	ECON	10.531	>200	>500	>700	>4
	UMTPE	5.076	21.511	134.963	161.550	0.867
	TechPat	6.063	25.735	137.847	169.645	0.910

Table 9. The Running Time on Mechanical Engineering Dataset of Different Models

Table 10.

Method	Total Document (Min)				Per Document (s)
Method	Title	Abstract	Claim	Patent	Patent
Candidate Generation	4.473	14.544	87.312	106.329	0.570
UMTPE Recognition	0.603	6.967	47.651	55.221	0.296
TechPat Recognition	1.590	11.191	50.535	63.316	0.340

Table 10. The Running Time on Mechanical Engineering Dataset of Our Models

In this table, Candidate Generation refers to the candidate preparation process we introduced in Section 4.2, which is the same for UMTPE and TechPat. And UMTPE/TechPat Recognition indicates other modules in UMTPE and TechPat, respectively. From the results, we could find that over half of the time is spent on the candidate generation part, while the recognition process only takes about 1/3–1/2 of the time. Even so, our methods still achieve the competitive efficiency performance. Besides, Techpat takes a little more time than UMTPE as it utilizes more refined designs, such as the multi-aspect graph structure and the corresponding propagation algorithm. However, this time cost is acceptable considering the improved extraction performance.

6.4 Application Prospects

In this subsection, we report the performance on two practical application tasks introduced in Section 5 to demonstrate the effect and application prospects of our work on technical phrases.

6.4.1 Patent Search.

As we noted in Section 5.1, we utilize the extracted technical phrases to construct the query for better patent search results. In the detailed experiments, we randomly sample 100 patents published in 2,000 as the query set and use their cited patent set as searching patent corpus. We further restrict the query patents to have at least 20 citations, which is the same strategy adopted by [67, 68]. And the data statistics are listed in Table 11.

Table 11.

Dataset	Num. Query Set	Num. Patent Corpus
Mechanical Engineering	100	2,818
Electricity	100	3,312

Table 11. Data Statistics for Patent Search

After the search process, we use \(Recall@100\) [67] to evaluate the patent search performance. To overcome the effect of randomness, we ran this search experiment five times and reported the mean value of the results. Table 12 lists the experimental results on two domains (i.e., Mechanical Engineering, and Electricity). We could find that our TechPat model together with UMTPE both show excellent performances compared with other baselines, which demonstrates the effectiveness of our proposed technical phrases in searching patents. The improvements of TechPat over UMTPE confirm the necessity of our newly designed modules.

Table 12.

Recall@100	Mechanical Engineering	Electricity
ECON	37.79	33.83
DBpedia	38.89	35.80
Autophrase	41.72	36.65
NE-rank	40.68	36.30
Rake	37.94	33.42
Spacy	39.76	35.60
MultipartiteRank	41.97	37.04
JMLGC	41.73	37.38
UMTPE	42.41	37.72
TechPat	43.25	38.26

Table 12. Patent Search Performance (%)

The bold fonts represent the optimal results.

6.4.2 Patent Classification.

As discussed in Section 5.2, we extract technical phrases from each patent document and regard them as input features to drive the classification of patents. We utilize the patent data from USPTO¹⁵ to construct the patent classification datasets and the statistics of the datasets are shown in Table 13.

Table 13.

Dataset	Num. Class	Num. Train	Num. Test
Mechanical Engineering	5	3,956	990
Electricity	5	3,999	1,000

Table 13. Data Statistics for Patent Classification

After the classification process, we adopt two widely used multi-label classification metrics, F1-score and Hamming Loss, to evaluate the performance:

\(\begin{equation} F_1(y_s, \hat{y_s}) = \frac{1}{|S|}\sum _{s \in S} \frac{2*|y_s \cap \hat{y_s}|}{|\hat{y_s}| + |y_s|}, \end{equation}\)

(19)

\(\begin{equation} HammingLoss(y_s, \hat{y_s}) = \frac{1}{|S|}\sum _{s \in S} \frac{xor (y_s, \hat{y_s})}{|L|}, \end{equation}\)

(20)

where S represents the test set, \(y_s\) is the reference label for the sample s, \(\hat{y_s}\) represents the predicted label for s, \(|L|\) represents the number of categories for this multi-label problem. A more effective classification method is expected to achieve higher F1-score and lower Hamming Loss.

In order to overcome the effect of randomness, we run this classification experiment five times and report the mean value of the results. Table 14 reports the performance of different phrase extraction methods. We could find that our proposed UMTPE and TechPat methods achieve the best performance,¹⁶ which is consistent with the results of patent search.

Table 14.

Method	Mechanical Engineering		Electricity
Method	F1-score	HammingLoss	F1-score	HammingLoss
ECON	55.33	22.64	57.83	16.58
DBpedia	60.27	19.92	61.97	15.68
Autophrase	61.91	18.99	64.89	15.03
NE-rank	63.18	18.83	66.21	14.29
Rake	57.60	21.08	61.93	15.42
Spacy	60.94	19.68	65.53	14.62
MultipartiteRank	63.83	18.60	67.29	14.09
JMLGC	61.11	19.82	67.01	14.34
UMTPE	64.30	18.07	67.82	13.56
TechPat	65.00	17.75	68.70	13.30

Table 14. Patent Classification Performance (%)

The bold fonts represent the optimal results.

To summarize, in this subsection, we conduct two practical application tasks to verify the effect of technical phrases. These promising results provide a guarantee for the wider application prospects of technical phrases.

7 Generalization Ability of TechPat

In this section, we discuss the generalization ability of our proposed technical phrase extraction methods to more types of technical documents. In fact, besides patent documents, technical phrases also appear in other documents that contain wealthy technical information, such as scientific articles and papers. Moreover, these technical documents often have the multi-level structure similar to that of patents (e.g., “Title”, “Abstract”). With few adaptations, our proposed UMTPE and TechPat can be employed to recognize technical phrases from these documents directly. To verify their generalization ability, we apply these methods to a scientific article dataset and compare their extraction performance.

Specifically, we utilize the KP20k dataset [38], which contains the titles and abstracts of scientific articles in computer science. 100,000 pieces of data are sampled from the original KP20k dataset to conduct experiments, and more statistics of the dataset are presented in Table 15. The experiments follow the same setup we stated in Section 6.1. As for the initial level in UMTPE and TechPat (i.e., the CPC Group descriptions in the patent datasets), we replace them with the provided key phrases in these articles.¹⁷ Besides, we recalculate the statistical relation between the number of technical phrases (K) and the number of sentences ( \(N_{sen}\) ) in each scientific article:

\(\begin{equation} \frac{K}{N_{sen}} \approx \left\lbrace \begin{array}{rcl} 2 & & Title,\\ 1 & & Abstract.\\ \end{array} \right. \end{equation}\)

(21)

According to this observation, we set \(K=2N_{sen}\) for “Title” and \(K=N_{sen}\) for “Abstract”, respectively.¹⁸ After that, we conduct the extraction experiments on the scientific article dataset, and evaluate the overall performance (i.e., Precision, Recall, and F1-score) on 100 scientific articles labeled with technical phrases.¹⁹ The results on the whole technical phrase set and two levels (i.e., “Title” and “Abstract”) are listed in Table 16.

Table 15.

Dataset	Num. articles	Avg. sentences of Title	Avg. sentences of Abstract
Scientific Article	100,000	1.02	7.01

Table 15. The Statistics of The Scientific Article Dataset

Table 16.

Method	Whole			Title			Abstract
Method	P	R	F1	P	R	F1	P	R	F1
ECON	17.96	6.95	9.45	5.00	2.66	3.33	15.64	6.78	8.74
DBpedia	37.74	16.48	21.59	29.50	18.49	21.89	32.02	15.28	19.33
Autophrase	30.29	21.67	24.08	23.50	15.83	18.13	26.34	21.63	22.69
NE-rank	21.49	25.35	22.32	35.00	38.66	35.40	15.74	17.26	15.37
Rake	16.42	18.22	16.59	25.33	25.83	24.69	12.63	13.41	12.35
Spacy	27.60	30.72	27.83	41.83	42.25	40.60	22.55	23.70	21.59
MultipartiteRank	30.09	32.95	30.18	37.83	33.41	34.36	27.29	29.58	26.67
JMLGC	31.91	35.62	32.09	43.33	43.33	41.80	27.16	28.72	25.88
UMTPE	33.88	37.04	34.12	42.75	46.08	42.90	29.79	30.96	28.83
TechPat	35.84	37.78	35.08	43.50	45.50	43.03	32.25	34.67	31.50

Table 16. Overall Performance on The Scientific Article Dataset(%)

The bold fonts represent the optimal results.

From this table, we could find that our proposed UMTPE and TechPat methods achieve the best performance compared with other baselines, which demonstrates their effectiveness and superiority. Moreover, similar to the performance on the patent datasets, DBpedia achieves excellent performance in Precision but performs poorly in Recall and F1-score. This comes from the reason that it relies entirely on the external database and can only extract few phrases. As for the performance on different levels, NE-rank, Rake, Spacy, and JMLGC all perform well on “Title” but poorly on “Abstract”. It is probably because the extraction difficulty increases as the document gets longer. Finally, when we investigate the difference between UMTPE and TechPat, we find that TechPat achieves obvious improvements with the help of the multi-aspect graph structure and newly revised measurement indicators, which is consistent with the experimental analysis on the patent datasets presented in Section 6.2.1.

In a nutshell, we apply our proposed technical phrase extraction methods, i.e., UMTPE and TechPat, to a scientific article dataset and achieve the optimal performance. This proves the superior generalization ability of our proposed methods.

8 Conclusion and Future Work

In this article, we explored a motivated direction for technical phrase extraction in patent data. Specifically, we first presented a clear and detailed description of technical phrases in patents based on various prior works, practical experience, and statistical analyses. Then, by analyzing the characteristics of technical phrases and effectively modeling the complex structure of patent documents (such as multi-aspect semantics and multi-level relevances), we developed a novel unsupervised model, namely TechPat, which can recognize technical phrases from massive patent texts and does not require expensive human labeling. Subsequently, we designed a novel metric called IRE to evaluate extracted phrases from the perspective of representation ability, which could supplement traditional evaluation metrics like Precision and Recall. Extensive experiments on real-world patent data demonstrated that the TechPat model can effectively discriminate technical phrases in patents and greatly outperform existing methods. We further applied the extracted technical phrases to two practical application tasks, where the experimental results confirmed the effect and application prospects of technical phrases. Finally, we transferred our proposed extraction methods to a scientific article dataset and proved their superior generalization ability to more technical documents.

In future work, we would like to explore more applications of technical phrases in the patent field, such as patent similarity prediction and patent valuation. Moreover, we are also willing to extract technical phrases from more types of technical documents, such as technical news.

Footnotes

https://www.wipo.int/ipstats/en/.

https://www.uspto.gov.

Based on expert experience and statistics, we also remove the determiners from these phrases and filter out some more distracting phrases, i.e., phrases ending with adjectives, prepositions, or adverbs.

⁴

This algorithm converges when the change of each ranking value is less than \(10^{-4}\) , and we set the maximum number of iterations as 100.

⁵

To facilitate fair comparison, we randomly select the extracted phrases to compose the query and ensure the query is composed of 10 words at most.

⁶

To facilitate fair comparison, we randomly select the extracted phrases and ensure the combination of technical phrases for each patent comprises 20 words at most.

⁷

In the Candidate Generation part, we remove the Autophrase method, as over 90% of technical phrases extracted by Autophrase are contained in the candidate phrase set generated by other methods.

⁸

Recall Section 3.2 for the labeling details.

⁹

https://github.com/clips/pattern.

¹⁰

We use LSI (Latent Semantic Indexing) model to perform this task.

¹¹

For models that extract no phrases from the document, we set the score as 0.

¹²

The setting of this number is relatively casual and does not have a significant influence on this evaluation. If the extracted phrases are less than the threshold, we select them all.

¹³

It has to be noted that here the pre-trained methods indicate the pre-trained phrase extraction models rather than pre-trained language models (e.g., BERT [10]).

¹⁴

Recall more details in Section 4.2.

¹⁵

https://www.uspto.gov.

¹⁶

Note that the extraction process of UMTPE and TechPat never leak the category information: (1) CPC Group descriptions are applied as the initial level to drive the extraction for whole dataset, while for each specific patent, UMTPE and TechPat have no access to its category information; (2) For patent classification, we adopt CPC Subsection classification labels instead of CPC Group, overcoming the data breach problem.

¹⁷

In detail, we randomly sample 1,000 key phrases from the whole scientific article dataset as the initial level, which can satisfy the requirement for prior knowledge to some extent.

¹⁸

Recall Section 4.5.2 for the selection details.

¹⁹

Recall Section 3.2 for the labeling details and Section 6.2.1 for the evaluation details, respectively.

References

[1]

Wasi Ahmad, Xiao Bai, Soomin Lee, and Kai-Wei Chang. 2021. Select, extract and generate: Neural keyphrase generation with layer-wise coverage attention. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 1389–1404.

Abstract

1 Introduction

2 Related Work

3 Overview

3.1 Patent Data

3.2 Description of Technical Phrase

3.3 Problem Statement

3.4 Solution Overview

4 The TechPat Model

4.1 Topic Generation

4.2 Candidate Generation

4.3 Candidate Graph Construction

4.4 Candidate Score

4.4.1 Statistical Measurement Indicators.

4.4.2 Inter-Level Measurement Indicators.

4.4.3 Semantic Measurement Indicators.

4.5 Technical Phrase Recognition

4.5.1 Candidate Rank.

4.5.2 Candidate Selection.

5 Applications

5.1 Patent Search

5.2 Patent Classification

6 Experiment

6.1 Experimental Setup

6.1.1 Datasets.

6.1.2 Implementation Details.

6.1.3 Comparison Baselines.

6.2 Experimental Results

6.2.1 Overall Performance Evaluation.

6.2.2 Representation Ability Evaluation.

6.2.3 Expert Evaluation.

6.3 Model Analysis

6.3.1 Component Effectiveness Analysis.

6.3.2 Visualization Analysis.

6.3.3 Running Time Analysis.

6.4 Application Prospects

6.4.1 Patent Search.

6.4.2 Patent Classification.

7 Generalization Ability of TechPat

8 Conclusion and Future Work

Footnotes

References

Cited By

Index Terms

Recommendations

Patent Mining: A Survey

Unsupervised technical phrase extraction by incorporating structure and position information

A Feasible Dashboard to predict Patent Mining Using Classification Algorithms

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations