1 Introduction
According to statistics of
World Intellectual Property Organization (
WIPO),
1 patent applications, which contain rich innovation ideas, keep growing rapidly worldwide and the number has reached 3.3 million in 2020. Indeed, the explosive growth of patents creates a valuable base of data for revealing the inner law of innovation through patent analysis [
27,
43,
64,
70,
74], but at the same time, puts forward higher requirements on patent mining techniques [
53,
73].
As a matter of fact, patent mining is often highly reliant on text analysis, i.e., how to process, organize and analyze key information of patent documents [
22,
55]. An effective step here is to construct a technology portrait for each patent, that is, to identify the technical phrases [
61] involved, which aids greatly in summarizing the key information it contains from a technical perspective. Generally speaking, technical phrases refer to phrases that are closely related to special technologies. For example, one given patent may contain “wireless communication” and “multiplex communication”. These technical phrases indicate that the patent is closely related to the “electric network” domain and the combination of them can be seen as a technology portrait tagging this patent (A more clear and detailed description of
technical phrase can be found in Section
3.2). In order to better understand the strength of technical phrases, we present a comparison between technical and non-technical phrases in Table
1. From this table, we can observe that technical phrases vary substantially compared with non-technical phrases, as the former contain rich technical information and may represent a certain technology (e.g., wireless communication), while the latter, relatively speaking, tend to have a more general or common meaning (e.g., wire and cable). Accordingly, compared with non-technical phrases, technical phrases could reveal and represent the technologies contained in patent documents, thereby providing a vital basis for patent mining. Therefore, how to automatically extract these technical phrases from massive patent documents for constructing the technology portrait is a meaningful research issue.
As far as we are concerned, there have been few works specifically designed for technical phrase extraction, while some relevant works on phrase extraction have been explored. According to the extraction target, these can be divided into three categories, including
key phrase extraction [
58],
Named Entity Recognition(NER) [
20], and
concept extraction [
21]. In more detail, key phrase extraction aims to extract phrases that provide a concise summary of a document [
58], with preference given to those that are both frequently occurring and close to the main topics. NER [
20] focuses on locating and classifying named entities into pre-defined categories and pays more attention to whether the given phrases are real entities. For its part, concept extraction [
21] is somewhat similar to technical phrase extraction and aims to find words or phrases describing a concept from massive texts. However, it is worth noting that a concept here does not equal to a technical phrase, as some phrases like “user preference” and “reproductive age” actually belong to concepts but not our focus (i.e., technical phrases). To summarize, although the extraction targets of these works vary from each other, most of them ignore the technical information contained in phrases, which is the key attribute of technical phrases.
Unfortunately, there are many technical and domain challenges inherent in designing and implementing an effective technical phrase extraction system in the patent field. First, as the technical meaning in patent documents is difficult to quantify, there are more perplexing and unreachable characteristics among technical phrases. Two similar phrases may sometimes have completely different implications. For example, “
support vector machine (
SVM)” is a technical phrase indicating a classification algorithm, while “support machine” is not. Second, technical phrases often appear at different levels of one patent (i.e., “Title”, “Abstract”, and “Claim”), and such a multi-level structure of patents shows strong connections in describing a common technology target. Therefore, how to combine the information from different levels and effectively utilize their relations are also key challenges for recognizing technical phrases from patents. Third, in the text of each level, there are often phrases that describe various aspects of the content, especially in long texts. For instance, as shown in Figure
1, in a patent of “Computer Network”, there exist phrases describing the “Transmission” aspect (such as “uplink transmission”, “media stream transmission”), while there are also some phrases concerning the “Computing” aspect (such as “cloud computing” and “parallel computing”). We refer to this as the multi-aspect semantics structure, which reveals the distribution of numerous phrases in the text, and thus provides an entry point for technical phrase recognition.
To directly achieve the primary goal of extracting technical phrases with addressing the first two challenges, in our preliminary work [
30], we propose an
Unsupervised Multi-level Technical Phrase Extraction (
UMTPE) model, which primarily explores both the statistical, semantic characteristics of technical phrases and the multi-level structure in patent data. Specifically: (1) we first analyze the key characteristics of technical phrases in patent documents and provide a clear description of them. Then, we design several measurement indicators for technical phrases from statistical and semantic perspectives, which enable us to recognize technical phrases accurately and comprehensively. (2) Considering the relations between different levels in patents, we design components (i.e., Topic Generation, Topic Relevance) to relate adjacent levels, which could utilize the implied information in multi-level structure extensively. With the help of these designs, UMTPE can extract technical phrases from numerous patent documents; however, it neglects the multi-aspect semantics structure in patent texts.
In this article, to better mine the semantics structure in patent texts and improve the extraction performance of our proposed model, we further develop an extended version of UMTPE and propose another enhanced model, namely TechPat, in which we refine the analysis of technical phrases and patent data, and further incorporate the multi-aspect semantics structure into our modeling process. To be more specific, in the TechPat model, we propose the multi-aspect graph to characterize the relation of different phrases in the text, which is closer to the real distribution of phrases. Then, we revise the measurement indicators in UMTPE accordingly and design a novel ranking algorithm to help select technical phrases from a pre-generated candidate phrase pool. In this way, TechPat can more accurately recognize technical phrases from massive patent documents.
After the extraction process, it is still a non-trivial task to comprehensively evaluate the results, especially for these special extraction tasks. In this article, to supplement traditional evaluation metrics and improve the evaluation confidence, we propose a novel metric called Information Retrieval Efficiency (IRE) to evaluate the extracted technical phrases from a representation ability perspective. Extensive experiments on real-world patent data demonstrate that our proposed methods can effectively discriminate technical phrases in patents and greatly outperform existing baselines. Finally, we further apply the extracted technical phrases into two practical application tasks, namely patent search and patent classification, the results of which confirm the application prospects of technical phrases.
Although our proposed methods focus on the technical phrase extraction in patent documents, the designs are more general and can be transferred into more types of technical documents. We provide the way to apply our methods to the scientific article data, and discuss their generalization ability to more technical documents.
Overview. The remainder of this article is organized as follows. In Section
2, we briefly introduce some related works of our study. In Section
3, we introduce some preliminaries pertaining to both patent data and technical phrases, and further present the problem statement and our solution overview. Section
4 contains the details of our proposed TechPat model. In Section
5, we specify two application tasks of our proposed technical phrases. Section
6 presents the experimental results. After that, we further discuss the generalization ability of our proposed methods to more technical documents in Section
7. Finally, conclusions are given in Section
8.
2 Related Work
To the best of our knowledge, few existing works have been directly designed to extract technical phrases from patents. However, some relevant works can still be identified, including key phrase extraction, NER and concept extraction.
Key Phrase Extraction. Key phrase extraction aims to extract phrases that provide a concise summary of a document. It has been widely studied in data mining tasks, including supervised [
1,
7,
26,
37,
38,
49,
58,
78] and unsupervised methods [
2,
5,
23,
24,
31,
46,
47,
50,
59,
72]. For one thing, supervised methods often target at training a complicated model with the help of labeled data or external knowledge bases. For instance, Meng et al. [
37,
38] designed an encoder-decoder framework to generate key phrases from the original text. Ahmad et al. [
1] designed a novel transformer-based architecture to mine key phrases in long documents, which can extract and generate key phrases from the text simultaneously. Shang et al. [
26,
49] proposed an approach of adaptively recognizing phrase occurrence based on quality estimation, which relied on external knowledge bases (e.g., Wikipedia) to some extent. For another, unsupervised methods focus on mining the inner connections in documents in response to the lack of labeled data. Bellaachia et al. [
2] designed an improved ranking algorithm based on Pagerank and Textrank to evaluate the importance of words in documents, which they used to formulate key phrases. Liang et al. [
24] proposed to utilize the pretrained embedding to represent both the document and candidate phrases, after which a ranking mechanism considering both local and global similarities in the context was conducted to select key phrases.
Named Entity Recognition. NER focuses on locating and classifying named entities into pre-defined categories, which is often regarded as a sequence labeling problem. In the early stages, researchers applied
Conditional Random Field (
CRF), SVM, and perception models with hand-crafted features [
11,
19,
32]. In recent years, as deep learning has rapidly developed, NER has tended to be tackled by
Recurrent Neural Networks (
RNNs) and attention mechanism [
14,
20,
25,
33,
45,
62,
63,
66,
76,
77]. For example, Chiu et al. [
33] proposed a hybrid BiLSTM-CNNs-CRF architecture to locate named entities in the original text. Lin et al. [
25] put forward an “entity trigger” to improve the traditional LSTM&CRF framework, which increased the model’s interpretability and saved substantially on labeled data. On the basis of the LSTM&CRF framework, Wu et al. [
62] proposed an automatic annotation method via quote marks, which can help detect entities from Chinese chatbot conversation logs without supervision. Zhou et al. [
66] proposed to treat NER as the multi-class classification problem of word pairs and designed a multi-head self-attention mechanism to mine the word level correlations for each entity type.
Some pretrained models have also been developed for this task [
8,
17,
34]. For example, Honnibal et al. [
17] released a package tool called Spacy for NER, noun phrase chunking and other annotation tasks, which can achieve good time efficiency and robustness.
Concept Extraction. Concept extraction aims to find words or phrases describing a concept within massive texts, which has been studied extensively in previous works [
12,
21,
41,
56,
60,
69,
71,
75]. Generally speaking, concept extraction can also be formulated as a sequence labeling problem. Traditional methods often adopt a generation and selection mechanism, which consists of two steps: (1) extracting candidate concepts via hand-crafted rules or syntactic pattern matching; (2) selecting target concepts based on supervised or unsupervised methods. For instance, Li et al. [
21] utilized a range of models to generate possible concepts, and then designed a novel architecture to evaluate the fitness of extracted concepts relative to the original text. Recently, with the help of strong computing power, deep learning methods have achieved remarkable performance. For example, Yang et al. [
69] designed a transformer-based model to directly extract the concept from massive texts. Fand et al. [
12] proposed a Guided Attention Network. In this model, three additional supervision signals were introduced to explore the structured information in raw text and it achieved good performance and learning efficiency.
In summary, the above studies focus on their respective target phrases and cannot be transferred directly into technical phrase extraction. First, it is unsuitable to apply such supervised methods to our task, as there are insufficient labeled technical phrases in massive patent documents. Second, unsupervised approaches are often sensitive to the extraction target, meaning that certain gaps exist between technical phrase extraction and the existing methods. Moreover, patent data characteristics can be another consideration, as technology relations between different levels in patents (i.e., “Title”, “Abstract”, and “Claim”) will provide opportunities for aiding technical phrase recognition from patent documents, which cannot be effectively captured by the existing models.
3 Overview
In this section, we first introduce the patent data and analyze the multi-aspect semantics in patent text. Then, based on expert experience and statistics, we provide a clear description of technical phrases in patents. Finally, we present the problem statement of technical phrase extraction and specify our solution overview.
3.1 Patent Data
The patent data we use is provided by the
United States Patent and Trademark Office (
USPTO),
2 and comprises two domains, i.e., Mechanical Engineering and Electricity. Each patent contains a multi-level structure, i.e., “Title”, “Abstract”, and “Claim”, where “Title” and “Abstract” depict the topic and brief summary of a patent, while “Claim” is a more detailed and lengthy description of the inventor’s rights.
Multi-aspect Semantics. As noted in Section
1, on each level of a patent, there are often phrases that describe various aspects of the content, especially in long texts. To facilitate better analysis and illustration, we provide an example of phrases in “Abstract” and “Claim” in Figure
2. This figure reveals the distribution of all phrases in semantic space; each node represents a phrase, while the color indicates the semantic aspect to which the phrase belongs. We can easily determine that phrases in the same aspect often gather together in a sub-semantic-space and are much more closely related than others. We refer to this phenomenon as the multi-aspect semantics structure in the patent text, which provides a vital basis for the modeling of the technical phrase recognition process.
3.2 Description of Technical Phrase
In this subsection, we hire four experts to manually extract technical phrases from 100 patents in two domains, i.e., Mechanical Engineering and Electricity, respectively. After examining the technical phrases extracted from patents, we can make several specific observations:
(1)
Part of Speech. Although the part of speech distribution of technical phrases shows various types, most of them are noun phrases. According to the statistics of extracted phrases, noun phrases account for more than 90%.
(2)
Number of Words. As Figure
3 shows, the lengths of technical phrases in different domains are slightly different; however, most of them comprise 2
\(\sim\) 4 words, sometimes reaching 5.
(3)
Semantic Context. In a patent document, there often exist similar technical phrases, such as “image encoding” and “image decoding”. It is easy to understand that technical phrases occurring in the same context will be relatively more similar to each other in semantics. Besides, technical phrases are expected to have a relatively independent technical meaning. While some phrases like “system architecture” also frequently occur in conjunction with technical phrases, these are not our focus as they have no specific technical meaning.
(4)
Local Occurrence. On each level of a patent, technical phrases often appear more than once, especially in long texts, which can be seen as a local occurrence. For example, across the technical phrases extracted from “Claim”, over 70% of them appear in the text at least twice.
(5)
Global Occurrence. In the same patent document, a common technical phrase tends to appear repeatedly across different levels. That is to say, their global occurrence in the multi-level structure may provide some insights for aiding technical phrase recognition. In order to verify this point, a focused analysis is conducted in the following.
Figure
4(a) illustrates the average number of technical phrases across different levels. As we can see, the number of technical phrases increases rapidly from “Title” to “Claim” on both datasets. Figure
4(b) shows the average ratio of the number of technical phrases to the number of words in different levels. From “Title” to “Claim”, this ratio drops significantly, indicating that more and more non-technical phrases emerge, and thus greatly increases the difficulty to recognize technical phrases. This clearly reveals that although the technical phrases become increasingly abundant from top to bottom in the multi-level structure, the extraction difficulty rapidly increases as the interference factors become even bigger.
Meanwhile, we find that over 35% of “Abstract”s contain at least one same technical phrase from “Title”s, and this percentage rises to 80% when it comes to “Claim”s and “Abstract”s. In other words, the technical phrases from the current level (e.g., “Title”) may play a guiding role in the technical phrase extraction of the next level (e.g., “Abstract”). We can therefore use the phrases extracted from the current level to help guide extractions in the next level, which can formulate a multi-level model architecture and effectively utilize the information between different levels.
Moreover, existing patent classification systems can be an initial driving force for technical phrase recognition, for example,
Cooperative Patent Classification Group (CPC Group), whose descriptions (Table
2) are highly relevant to technologies. Although both the quantity and quantity of CPC Group descriptions are limited, we can still regard them as the prior knowledge of technical phrases, which could help guide the extraction process in the first level (“Title”) of the patent.
3.3 Problem Statement
Based on the multi-level structure of patent documents, we attempt to extract technical phrases level by level. The extracted phrases in the current level will be seen as the prior knowledge for guiding the next level, while CPC Group descriptions can be seen as the initial level.
In more detail, for each level of a patent (i.e., “Title”, “Abstract”, and “Claim”), technical phrase extraction is formulated as a generation and selection problem [
16]. That is to say, given the word sequence of a patent document
\(\boldsymbol {x} = (x_1, x_2, \ldots , x_n)\) , we first build a large-scale candidate pool
\(\boldsymbol {Y} = \lbrace y_i, i = 1, 2, \ldots \rbrace\) , where
\(y_i = (x_m, x_{m+1}, \ldots , x_{m+l-1})\) is a possible technical phrase,
n represents the length of the patent document,
m indicates the starting location of the candidate phrase, while
l represents this candidate phrase’s length. Next, from the candidate pool
\(\boldsymbol {Y}\) , we design a score and rank mechanism to select the final technical phrases. For convenience, we refer to the extracted phrase list in a certain level as
\(P_{level}\) , such as
\(P_{title}\) . Finally, with the technical phrases extracted from “Title”, “Abstract”, and “Claim”, we can obtain the technical phrase set
\(P_{all} = \lbrace P_{title}, P_{abstract}, P_{claim}\rbrace\) for each patent.
3.4 Solution Overview
Our solution overview is shown in Figure
5. Specifically, Based on the existing CPC Group descriptions and patent documents organized in multi-level structure (“Title”, “Abstract”, and “Claim”), we propose our TechPat model. TechPat mines the relations between different levels in the patent documents and obeys a generation&selection process to recognize technical phrases, which will be detailed introduced in Section
4. After the extraction of technical phrases, we further apply them to two practical application tasks, i.e., searching relevant patents in the patent database, and classifying patents into given categories, which can prove the effect and application prospects of technical phrases.
In the following, we will specify the modeling process of our proposed TechPat model.
7 Generalization Ability of TechPat
In this section, we discuss the generalization ability of our proposed technical phrase extraction methods to more types of technical documents. In fact, besides patent documents, technical phrases also appear in other documents that contain wealthy technical information, such as scientific articles and papers. Moreover, these technical documents often have the multi-level structure similar to that of patents (e.g., “Title”, “Abstract”). With few adaptations, our proposed UMTPE and TechPat can be employed to recognize technical phrases from these documents directly. To verify their generalization ability, we apply these methods to a scientific article dataset and compare their extraction performance.
Specifically, we utilize the KP20k dataset [
38], which contains the titles and abstracts of scientific articles in computer science. 100,000 pieces of data are sampled from the original KP20k dataset to conduct experiments, and more statistics of the dataset are presented in Table
15. The experiments follow the same setup we stated in Section
6.1. As for the initial level in UMTPE and TechPat (i.e., the CPC Group descriptions in the patent datasets), we replace them with the provided key phrases in these articles.
17 Besides, we recalculate the statistical relation between the number of technical phrases (
K) and the number of sentences (
\(N_{sen}\) ) in each scientific article:
According to this observation, we set
\(K=2N_{sen}\) for “Title” and
\(K=N_{sen}\) for “Abstract”, respectively.
18 After that, we conduct the extraction experiments on the scientific article dataset, and evaluate the overall performance (i.e., Precision, Recall, and F1-score) on 100 scientific articles labeled with technical phrases.
19 The results on the whole technical phrase set and two levels (i.e., “Title” and “Abstract”) are listed in Table
16.
From this table, we could find that our proposed UMTPE and TechPat methods achieve the best performance compared with other baselines, which demonstrates their effectiveness and superiority. Moreover, similar to the performance on the patent datasets, DBpedia achieves excellent performance in Precision but performs poorly in Recall and F1-score. This comes from the reason that it relies entirely on the external database and can only extract few phrases. As for the performance on different levels, NE-rank, Rake, Spacy, and JMLGC all perform well on “Title” but poorly on “Abstract”. It is probably because the extraction difficulty increases as the document gets longer. Finally, when we investigate the difference between UMTPE and TechPat, we find that TechPat achieves obvious improvements with the help of the multi-aspect graph structure and newly revised measurement indicators, which is consistent with the experimental analysis on the patent datasets presented in Section
6.2.1.
In a nutshell, we apply our proposed technical phrase extraction methods, i.e., UMTPE and TechPat, to a scientific article dataset and achieve the optimal performance. This proves the superior generalization ability of our proposed methods.
8 Conclusion and Future Work
In this article, we explored a motivated direction for technical phrase extraction in patent data. Specifically, we first presented a clear and detailed description of technical phrases in patents based on various prior works, practical experience, and statistical analyses. Then, by analyzing the characteristics of technical phrases and effectively modeling the complex structure of patent documents (such as multi-aspect semantics and multi-level relevances), we developed a novel unsupervised model, namely TechPat, which can recognize technical phrases from massive patent texts and does not require expensive human labeling. Subsequently, we designed a novel metric called IRE to evaluate extracted phrases from the perspective of representation ability, which could supplement traditional evaluation metrics like Precision and Recall. Extensive experiments on real-world patent data demonstrated that the TechPat model can effectively discriminate technical phrases in patents and greatly outperform existing methods. We further applied the extracted technical phrases to two practical application tasks, where the experimental results confirmed the effect and application prospects of technical phrases. Finally, we transferred our proposed extraction methods to a scientific article dataset and proved their superior generalization ability to more technical documents.
In future work, we would like to explore more applications of technical phrases in the patent field, such as patent similarity prediction and patent valuation. Moreover, we are also willing to extract technical phrases from more types of technical documents, such as technical news.