Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Automatic Keyword Extraction Using Domain Knowledge

2010, … Linguistics and Intelligent …

Automatic Keyword Extraction Using Domain Knowledge Anette Hulth1 , Jussi Karlgren2, Anna Jonsson3, Henrik Bostrom1, and Lars Asker1 1 Dept. of Computer and Systems Sciences, Stockholm University, Electrum 230, SE-164 40 Kista, Sweden 2 3 [hulth|henke|asker]@dsv.su.se Swedish Institute of Computer Science, Box 1263, SE-164 29 Kista, Sweden jussi@sics.se Department of Information Studies, University of Sheeld, Western Bank, Sheeld, S10 2TN, UK a.jonsson@sheffield.ac.uk Abstract. Documents can be assigned keywords by frequency analysis of the terms found in the document text, which arguably is the primary source of knowledge about the document itself. By including a hierarchically organised domain speci c thesaurus as a second knowledge source the quality of such keywords was improved considerably, as measured by match to previously manually assigned keywords. 1 Introduction Information retrieval research has long focused on developing and re ning methods for full text indexing, with the aim to improve full text retrieval. The practice of assigning keywords1 to documents in order to either describe the content or to facilitate future retrieval, which is what human indexers do, has more or less been ignored by researchers in the various elds of computer science. However, we believe that keyword indexing | apart from being useful on its own | may play a complementary role to full text indexing. In addition, it is an interesting task for machine learning experiments due to the complexity of the activity. We extend previous work on automatic keyword assignment (see e.g., [1]) to include knowledge from a thesaurus. In this article, we present experiments where we for each document in a collection automatically extract a list of potential keywords. This list, we envision, can be given to a human indexer, who in turn can choose the most suitable terms from the list. The experiments were conducted on a set of documents 1 We will call a small set of terms selected to capture the content of a document keywords. Index terms is an alternative term we also use; the choice mostly depending on what the set of words is used for: describing the document or facilitating its retrieval. from the Swedish Parliament which all previously have been manually indexed by professional indexers. Using machine learning algorithms and morphological pre-processing tools we combined knowledge from both the documents themselves and a hierarchically organised thesaurus developed to suit the domain, and found we were able to generate lists of potential keywords that well covered the manually extracted examples. 2 Keyword Extraction 2.1 Manual Keyword Assignment The traditional way of organising documents and books is by sorting them physically in shelves after categories that have been predetermined. This generally works well, but nding the right balance between category generality and category speci city is dicult; the library client has to learn the categorisation scheme; quite often it is dicult to determine what category a document belongs to; and quite often a document may rightly belong to several categories. Some of the drawbacks of categorisation can be remedied by installing an index to the document collection. Documents can be given several pointers using several methods and can thus be reached by any of several routes. Indexing is the practice of establishing correspondences between a set, possibly large and typically nite, of keywords or index terms and individual documents or sections thereof. Keywords are meant to indicate the topic or the content of the text: the set of terms is chosen to re ect the topical structure of the collection, such as it can be determined. Indexing is typically done by indexers | persons who read documents and assign keywords to them. Manual indexing is often both dicult and dull; it poses great demands on consistency from indexing session to indexing session and between di erent indexers. It is the sort of job which is a prime candidate for automatisation. Automating human performance is, however, never trivial, even when the task at hand may seem repetitive and non-creative at rst glance. Manual indexing is a quite complex task, and dicult to emulate by computers. Manual indexers and abstractors are not consistent, much to the astonishment of documentation researchers [2]. In fact, establishing a general purpose representation of a text's content is probably an impossible task: anticipating future uses of a document is dicult at best. 2.2 Automatic Indexing By and large computerised indexing schemes have distanced themselves from their early goal of emulating human indexing performance to concentrating on what computers do well, namely working over large bodies of data. Where initially the main body of work in information retrieval research has been to develop methods to handle the relative poverty of data in reference databases, and title-only or abstract-only document bases, the focus has shifted to developing methods to cope with the abundance of data and dynamic nature of document databases. Typically manual indexing schemes control the indexing process by careful instructions and an established set of allowed keywords or index terms. This naturally reduces variation, but also limits the exibility of the resulting searches: the trade-o between predictability and exibility becomes a key issue. The idea of limiting semantic variation to a discrete and predetermined set of well de ned terms | an idea which crops up regularly in elds such as arti cial intelligence or machine translation | is of course a dramatic simpli cation of human linguistic behaviour. This is where the most noticeable methodological shift during the past forty years can be found. Systems today typically do not take the set of index terms to be prede ned, but use the material they nd in the texts themselves as the starting point [3, 4]. This shift is accompanied by the shift from a set-theoretical view of document bases to a probabilistic view of retrieval: modern retrieval systems typically do not view retrieval as operations on a set of documents, with user requests as constraints on set membership, but instead rank documents for likelihood of relevance to the words or terms the reader has o ered to the system, based on some probabilistic calculation. The indexes typically generated by present-day systems are geared towards fully automatic retrieval of full texts rather than a traditional print index which will be used for access to bibliographical data or card catalogues. A traditional print index naturally must be small enough to be useful for human users. Under the assumption that no human user ever will actually read the index terms, the number of index terms can be allowed to grow practically with no limit. This makes the task of indexing texts di erent from the task that earlier e orts worked on. 2.3 Integrating the approaches While the past decades have seen rapid development of full-text systems, in general, manual indexing has not been supplanted by automatic full-text retrieval systems. Manual indexing has been performed continuously all along, and recently renewed attention to the value of manual indexing has been brought to the eld, by Yahoo, e.g., with its manually produced catalogue index made up of few, well-edited terms. (Experiments on automatically assigning Yahoo categories have been performed by Mladenic [5].) Manual indexing with its high quality and excellent precision will continue to have a role to ful l in information access applications and services | but there is ample room to develop semi-automatic tools to ensure consistency and raise productivity of human indexers. Combinations of automatic and manual approaches seem most promising. A digital library can capitalise on the qualitative work done by manual indexing to improve topical clustering of documents. If a simple topical clustering tool is available, clustering hand-categorised documents in a suitable number of topical clusters a ords the possibility of using the manually assigned keywords as reasonably lucid representation of the topical clusters. Thereafter uncategorised documents can be directed automatically to the cluster nearest to them, with the clusters of higher quality, and better described | thanks to the keywords. 3 Document Representation with Thesaurus Knowledge A thesaurus or a term database which is hierarchically organised will have valuable information for indexing purposes. The richness of the semantical relations between the included terms, which to some extent resembles the knowledge of the world of a human, is dicult to derive solely from word occurrence frequencies in documents. We will here report on experiments on bills from the 16 committees at the Swedish parliament and the thesaurus used for manual indexing of these documents. 3.1 Standard Methods: tf.idf Arguably, the most important knowledge source for nding important descriptors for a document is the document itself. Picking the most central terms in a document can be done using term frequency or the tf measure: frequent terms | allowing for document length normalisation | can be assumed to be important. A second important knowledge source about the comparative utility of descriptors is their linguistic context: a frequent term is only important if it frequently is infrequent. This insight can be estimated using the standard collection frequency or idf measure: calculating the proportion of documents a term participates in. 3.2 Thesaurus The public record of parliamentary activities has a central place in the public perception of Swedish political life, and it is important that the material is accessible. The Swedish Parliament manually indexes a large number of documents in order to give access both to information specialists and to the general public. This indexing e ort has been ongoing for a long period of time, during which an extensive hierarchically organised domain speci c thesaurus has been developed, assuring a consistent vocabulary. The thesaurus from the parliament, which follows the ISO 2788 standard, consists of 2 500 terms organised hierarchically by broader term (BT)/narrower term (NT) relations. Figure 1 shows an excerpt from the thesaurus: the word arbetshandikapp (employment disability), its broader term, some narrower terms, some related terms (RT) and a brief description of the concept the term refers to (SN { scope notes). (employment disability) (working life) (working assistant) (grant for resettlement in an independent business) Skyddat arbete (sheltered employment) Anst allningsfr amjande  atg arder (measures to stimulate employment opportunities) Handikapp (handicap) L onebidrag (salary contribution) Arbetshandikapp BT Arbetsliv NT Arbetsbitr ade NT N aringshj alp NT RT RT RT SN Nedsatt arbetsf orm aga pga fysiska, psykiska, f orst andsm assiga eller socialmedicinska handikapp --- d ari inbegripet missbruk av alkohol eller annat berusningsmedel. (Reduced ability to work due to physical, psychological, rational or social medical disability | including abuse of alcohol or other intoxicant.) Fig. 1. Excerpt from the thesaurus used at the Swedish Parliament (with English equivalents). 4 Empirical Evaluation The goal of our experiments was to automatically nd all the keywords assigned by the human indexers as well as to suggest or generate further potential keywords. The decision to identify a term as a potential keyword was made on the basis of a set of features calculated for each content class word in the text. We will refer to words from the texts that actually were chosen as keywords by the human indexer as positive examples and the other terms as negative examples. By an example we mean one term with its feature values. Our purpose was to build a keyword identi er that would emphasise high recall with rather less regard for precision | or in other words to get false positives rather than to miss keywords assigned by the human indexers. 4.1 Experimental Set-up For our experiments we used 128 electronic documents in Swedish: bills from the 16 di erent committees from the year 98/99. The style of the texts is quite homogeneous: rather formal and dense in information. The subjects, however, di er widely, being for example social welfare, foreign a airs and housing policy. The length of the bills is also quite varying: in this set ranging from 117 to 11 269 words per document, although only 26 documents have more than 1 000 words. For all documents, the corresponding keywords, assigned by the professional indexers were available. The number of keywords per document varies between 1 and 12 in the used set, and a longer document tend to have a larger number of keywords. In order to know in what way to best make use of the hierarchy of the thesaurus, we rst inspected how the training texts were indexed manually. We found that several texts that contained a number of sibling terms | i.e. terms that shared a common broader term | were indexed with either the broader mother term, or even the yet broader grandmother term. We found | unsurprisingly | that the number of sibling terms seemed to in uence the presence of the mother or grandmother term. This seemed to be a useful factor to take into consideration to nd potential keywords along with the frequency data. In conclusion, the features we used for our experiment are displayed in gure 2. Term features Thesaurus features Term frequency (tf) Mother term; present or not (ma) Normalised frequency (nf) Grandmother term; present or not (gran) Inverse document frequency (idf) Number of siblings in document; including term itself (sib(d)) Number of siblings in thesaurus (sib(t)) Number of children in document; including term itself (kid(d)) Number of children in thesaurus (kid(t)) Fig. 2. The nine chosen features from the two knowledge sources. The words in the documents were annotated for part of speech and morphologically normalised to base form using a two-level morphological analysis tool developed by Lingsoft Oy of Helsinki. After this process, all words were in lower case, and they were all single-word terms. Form word classes were removed, as well as verbs and adjectives, leaving the nouns for further experiments. To limit mismatches in comparisons towards the thesaurus, whose items were not always in base form, but occasionally determinate or plural, both the surface form as well as the lemmatised form of the documents' words were kept and matched. For each word a set of feature values was calculated. An example of the output of this process is shown in gure 3 for two terms. Term bistand (aid) tf nf idf ma gran sib(d) sib(t) kid(d) kid(t) 6 0.0050 1/5 1 0 2 5 2 7 3 0.0025 1/11 1 0 3 6 1 5 Term handel (trade) tf nf idf ma gran sib(d) sib(t) kid(d) kid(t) Fig. 3. Example of data for two terms (with English equivalents) (d = in document; t = in thesaurus). The whole set of documents was divided into two: one set consisting of 99 texts, used for nding the hypothesis; and the rest (29 documents) for testing. Thus, the material used for testing did in no way in uence the training. The division into training and test set was made arbitrarily, only taking the diversity of subjects into consideration. Because of this arbitrariness, the proportions of positive and negative examples in the two sets di er slightly. In table 1, we present some details for the training set, the test set and the whole set. As can be noted, the proportion of negative examples is very large, which is often the case in information access tasks. Table 1. The data set in detail. Training set Test set Total Positive ex. (no) Positive ex. (%) Negative ex. (no) Negative ex. (%) Total (no) 185 1.38 13 175 98.62 13 360 57 242 1.20 1.34 4 708 17 883 98.80 98.66 4 765 18 125 4.2 Method Virtual Predict is a system for induction of rules from pre-classi ed examples [6]. It is based on recent developments within the eld of machine learning, in particular inductive logic programming [7]. The system can be viewed as an upgrade of standard decision tree and rule induction systems in that it allows for more expressive hypotheses to be generated and more expressive background knowledge (i.e., logic programs) to be incorporated in the induction process. The major design goal has been to achieve this upgrade in a way so that it should still be possible to emulate the standard techniques with lower expressiveness (but also lower computational cost) within the system if desired. As a side e ect, this has allowed the incorporation of several recent methods that have been developed for standard machine learning techniques into the more powerful framework of Virtual Predict. Boosting is one of the techniques that have been incorporated. Boosting is an ensemble learning method which uses a probability distribution over the training examples. This distribution is re-adjusted on each iteration so that the learning algorithm focuses on those examples that have been incorrectly classi ed on previous iterations. New examples are classi ed according to a weighted vote of the classi ers produced on each iteration. The boosting method used in Virtual Predict is called AdaBoost with an optional setting (stumps only) to allow for faster induction and more compact hypotheses (see [8] for details). Another feature of Virtual Predict is that it allows instances belonging to di erent classes to be weighted di erently. This turned out to be a very useful feature in the current study, as the data set is very unbalanced. For the training phase we were only interested in the values of the features associated with each word, as described in section 3; contextual data such as collocation of words in a common document or in similar contexts were not taken into account. 4.3 Experimental Results The parameter setting with the best performance was that with 200 iterations and where the positive examples were given 100 times higher weights than the negative ones. This result can be seen in gure 4 (for the 29 documents in the test set). In the result calculations, we considered only those manually assigned keywords that were single-word terms actually present in the documents. No. positive correctly classi ed 54 No. positive incorrectly classi ed 3 No. negative correctly classi ed 4291 No. negative incorrectly classi ed (false positives) 417 Recall positive 0.9474 Fig. 4. Test result. In gure 4 we can see that the 29 documents have 417 candidate keywords in addition to the ones that were correctly classi ed. Looking at each document separately, the number of new potential keywords varies between 2 and 47, with a median value of 10. In other words most documents had a reasonably low number of additional suggestions: for 24 documents in the test set this number is below 18. The number of suggestions (including the correct ones) in percent of the total number of words in the documents ranges from 0.453% to 3.91%, the average being 1.90%. Of all the suggested words, just one meaningless word slipped through (n1, which is the name of a funding). Of the three positive that we were unable to nd, one was due to a shortcoming of the pre-processing program that treated the letter p from an abbreviation as a word, in addition to a bug in the program assigning the class that, because of the &-character in Pharmacia & Upjohn, also treated this word as the word p. (The term Pharmacia & Upjohn should not have been there at all, as it is a multi-word term.) An example of the output for one document is shown in gure 5. 4.4 Evaluation The results from these initial experiments have not yet been evaluated by indexers from the Swedish Parliament. To get an accurate judgement as to the true Name of document No. of words in input to Virtual Predict No. of keywords present in input No. of correct keywords found (physiotherapy) (medical treatment compensation) (freedom of establishment) (occupational health care) (work environment) sjukgymnastik l akarv ardsers attning etableringsfrihet f oretagsh alsov ard arbetsmilj o No. of candidate keywords (false positives) (inheritance) (employee) (employer) (competition) (medical care) (reporting sick) (primary care) (doctor) (right of establishment) (patients) (county council) (rehabilitation) (occupational safety and health act) (rates for medical treatment) ( nancing) arv arbetstagare arbetsgivare konkurrens sjukv ard sjukskrivning prim arv ard l akare etableringsr att patienter landsting rehabilitering arbetsmilj olag l akarv ardstaxa finansiering No. of missed keywords present in input a12 192 5 5 15 0 Fig. 5. Example of output for one document (with English equivalents). quality of the results this would be highly desirable, since only persons working in the eld, with thorough knowledge of the domain, can tell whether speci c keywords are likely to be useful. We have, however, ventured an evaluation by reading through 15 of the 29 documents in the test set to be able to compare their actual content to the corresponding derived keywords. The conclusion drawn from this is that 60 up to 80% of the suggested keywords (including the correct ones) are in fact relevant to the content of the documents. We estimated the extracted keywords to be appropriate in describing the documents, and, in some cases, even to better re ect the content than the manually indexed model keywords. This would seem to indicate the potential utility of the tool for an indexing editor. A convenient feature of inductive logic programming algorithms for the purpose of result evaluation is the ease whereby you can study rules in the generated hypotheses. The rule applied rst in the process is the one based on the most discriminating feature, which gives us the possibility of assessing which of the features is the most important in the data set. According to Virtual Predict this feature is the mother term (i.e., a hierarchical feature from the thesaurus). 4.5 A Second Run In order to establish that a thesaurus is indeed a useful knowledge source, we made a new run with the same parameter setting and the same training set | only removing the six thesaurus features. The result of this run is presented in gure 6. As can be noted, this result supports our view that a hierarchically composed domain knowledge is crucial to a tool with the current aim. No. positive correctly classi ed 37 No. positive incorrectly classi ed 20 No. negative correctly classi ed 3935 No. negative incorrectly classi ed (false positives) 773 Recall positive 0.6491 Fig. 6. Test result without the thesaurus features. 5 Concluding remarks The initial results are, as mentioned earlier, very encouraging. In this stage of algorithm development, we consider recall to be the most important evaluation criterion as it re ects the potential of the current approach: it is crucial to know that none of the terms chosen by an indexer have been missed by the system. There are, however, additional issues that need to be looked into in more depth. The rst, and most trivial, thing would be to remove meaningless words, e.g., single characters, before the data is processed by Virtual Predict. In addition we need to take into account abbreviations, as some words are only represented by the short form in the text, e.g., IT, not by the full form informationsteknik (information technology). For applications to real-life indexing tasks, some form of utility likelihood measure should be used in result (as in e.g., [1]). As stated earlier, we have so far only looked at single-word terms. This means that a certain amount of both potential index terms as well as terms selected by the human indexers have been ignored. However, as Swedish is rich in compounding, this is much less of a problem than had it been for English (all words in gure 5, e.g., are single-word terms in Swedish). One could possibly start by matching terms in the thesaurus with phrases in the documents. We also need to further investigate how to take into account the cases where the thesaurus form of a word is not its base form. Alternatives include allowing more forms of the word in the system or normalising the thesaurus. Another improvement to the extraction of potential keywords would be to recognise proper nouns, as they sometimes play an important role in this type of documents. Adding a module for name recognition seems like a straightforward way to approach this issue. Sometimes an indexer will choose a word that is not present in a document itself, and suggesting keywords absent in the text is a challenging matter. A thesaurus will, however, most likely provide some clues to this such as when broader terms tend to be preferred to some speci c terms used in the text. The potential keywords in our experiments are not necessarily terms chosen from the thesaurus. Very rarely do human indexers go beyond the thesaurus limits | this mainly happens in the case of proper names. We did not feel the need to limit the suggestions to the thesaurus. We want to keep track of new potential words and propose their timely inclusion to the thesaurus, as well as point out terms that do not seem to be used any longer in the thesaurus. Word usage re ect the constantly changing nature of our society, and as phenomena in society vary over time so does the use of words. Keeping a thesaurus up to date is a dicult task, and is in itself a complex research issue. However, we believe that this sort of tool set can lend itself to thesaurus management tools as well as document labelling. References 1. Turney, P.D. (2000). Learning Algorithms for Keyphrase Extraction. Information Retrieval, 2(4):303{336. Kluwer Academic Publishers. 2. Earl, L.L. (1970). Information Storage & Retrieval, volume 6, pp. 313{334. Pergamon Press. 3. Luhn, H.P. (1957). A Statistical Approach to Mechanical Encoding and searching of Literary Information. IBM Journal of Research and Development, 1:309{317. 4. Luhn, H.P. (1959). Auto-Encoding of Documents for Information Retrieval Systems. In: Boaz. M. (ed.) Modern Trends in Documentation, pp. 45{58. Pergamon Press, London. 5. Mladenic, D. (1998). Turning Yahoo into an Automatic Web-Page Classi er. In: Prade, H. (ed.) 13th European Conference on Arti cial Intelligence ECAI 98, pp. 473{474. 6. Bostrom, H. (2000). Manual for Virtual Predict 0.8, Virtual Genetics Inc. 7. Nienhuys-Cheng, S.-H., and de Wolf, R. (1997). Foundations of Inductive Logic Programming. LNAI 1228. Springer. 8. Freund Y., and Schapire R.E. (1996). Experiments with a new boosting algorithm. In: Machine Learning: Proceedings of the Thirteenth International Conference, pp. 148{156.