Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Journal of Machine Learning Research 4 (2003) 493-525 Submitted 5/01; Published 8/03 Learning Semantic Lexicons from a Part-of-Speech and Semantically Tagged Corpus Using Inductive Logic Programming Vincent Claveau Pascale Sébillot V INCENT.C LAVEAU @ IRISA . FR PASCALE .S EBILLOT @ IRISA . FR IRISA Campus de Beaulieu 35042 Rennes cedex, France Cécile Fabre C ÉCILE .FABRE @ UNIV- TLSE 2. FR ERSS University of Toulouse II 5 allées A. Machado 31058 Toulouse cedex, France Pierrette Bouillon P IERRETTE .B OUILLON @ ISSCO . UNIGE . CH TIM/ISSCO - ETI University of Geneva 40 Bvd du Pont-d’Arve CH-1205 Geneva, Switzerland Editors: James Cussens and Alan M. Frisch Abstract This paper describes an inductive logic programming learning method designed to acquire from a corpus specific Noun-Verb (N-V) pairs—relevant in information retrieval applications to perform index expansion—in order to build up semantic lexicons based on Pustejovsky’s generative lexicon (GL) principles (Pustejovsky, 1995). In one of the components of this lexical model, called the qualia structure, words are described in terms of semantic roles. For example, the telic role indicates the purpose or function of an item (cut for knife ), the agentive role its creation mode (build for house ), etc. The qualia structure of a noun is mainly made up of verbal associations, encoding relational information. The learning method enables us to automatically extract, from a morphosyntactically and semantically tagged corpus, N-V pairs whose elements are linked by one of the semantic relations defined in the qualia structure in GL. It also infers rules explaining what in the surrounding context distinguishes such pairs from others also found in sentences of the corpus but which are not relevant. Stress is put here on the learning efficiency that is required to be able to deal with all the available contextual information, and to produce linguistically meaningful rules. Keywords: corpus-based acquisition, lexicon learning, generative lexicon, inductive logic programming, subsumption under object identity, private properties 1. Introduction The aim of information retrieval (IR) is to develop systems able to provide a user who questions a document database with the most relevant texts. In order to achieve this goal, a representation of the contents of the documents and/or the query is needed, and one commonly used technique is to associate those elements with a collection of some of the words that they contain, called inc 2003 Vincent Claveau, Pascale Sébillot, Cécile Fabre and Pierrette Bouillon. C LAVEAU , S ÉBILLOT, FABRE AND B OUILLON dex terms. For example, the most frequent (simple or compound) common nouns (N), verbs (V) and/or adjectives (A) can be chosen as indexing terms. See Salton (1989), Spärck Jones (1999) and Strzalkowski (1995) for other possibilities. The solutions proposed to the user are the texts whose indexes better match the query index. The quality of IR systems therefore depends highly on the indexing language that has been chosen. Their performance can be improved by offering more extended possibilities of matching between indexes. This can be achieved through index expansion, that is, the extension of index words with other words that are close to them in order to get more matching chances. Morpho-syntactic expansion is quite usual: for example, the same index words in plural and singular forms can be matched. Systems with linguistic knowledge databases at their disposal can also deal with one type of semantic similarity, usually limited to specific intra-category reformulations (especially N-to-N ones), following synonymy or hyperonymy links: for example, the index word car can be expanded into vehicle. Here we deal with a new kind of expansion that has been proven to be particularly useful (Grefenstette, 1997; Fabre and Sébillot, 1999) for document database questioning. It concerns N-V links and aims at allowing matching between nominal and verbal formulations that are semantically close. Our objective is to permit a matching, for example, between a query index disk store and the text formulation to sell disks, related by the semantic affinity between an entity (store) and its typical function (sell). N-V index expansion however has to be controlled in order to ensure that the same concept is involved in the two formulations. We have chosen Pustejovsky’s generative lexicon (GL) framework (Pustejovsky, 1995; Bouillon and Busa, 2001) to define what a relevant N-V link is, that is, an N-V pair in which the N and the V are related by a semantic link that is prominent enough to be used to expand index terms. In the GL formalism, lexical entries consist of structured sets of predicates that define a word. In one of the components of this lexical model, called the qualia structure, words are described in terms of semantic roles. The telic role indicates the purpose or function of an item (for example, cut for knife ), the agentive role its creation mode (build for house ), the constitutive role its constitutive parts (handle for handcup ) and the formal role its semantic category (contain (information) for book ). The qualia structure of a noun is mainly made up of verbal associations, encoding relational information. Such N-V links are especially relevant for index expansion in IR systems (Fabre and Sébillot, 1999; Bouillon et al., 2000b). In what follows, we will thus consider as a relevant N-V pair a pair composed of an N and a V related by one of the four semantic relations defined in the qualia structure in GL. However, GL is currently no more than a formalism; no generative lexicons exist that are precise enough for every domain and application (for example IR), and the manual construction cost of a lexicon based on GL principles is prohibitive. Moreover, the real N-V links that are the keypoint of the GL formalism vary from one corpus to another and cannot therefore be defined a priori. A way of building such lexicons—that is, such N-V pairs in which V plays one of the qualia roles of N—is required. The aim of this paper is to present a machine learning method, developed in the inductive logic programming framework, that enables us to automatically extract from a corpus N-V pairs whose elements are linked by one of the semantic relations defined in the GL qualia structure (called qualia pairs hereafter), and to distinguish them, in terms of surrounding categorial (Part-of-Speech, POS) and semantic context, from N-V pairs also found in sentences of the corpus but not relevant. Our method must respect two kinds of properties: firstly it must be robust, that is, it must infer rules explaining the concept of qualia pair that can be used on a corpus to automatically acquire GL semantic lexicons. Secondly it has to be efficient in producing generalizations from a large amount 494 L EARNING S EMANTIC L EXICONS U SING ILP of possible contextual information found in very large corpora. This work has also a linguistic motivation: linguists do not currently know all the patterns that are likely to convey qualia relations in texts and cannot therefore verbalize rules that describe them; the generalizations inferred by our learning method have thus a linguistic interest. The paper will be divided into four parts. Section 2 briefly presents a little more information about GL and motivates using N-V index expansion based on this formalism in information retrieval applications. Section 3 describes the corpus that we have used in order to build and test our learning method, and the POS and semantic tagging that we have associated with its words to be able to characterize the context of N-V qualia pairs. Section 4 explains the machine learning method that we have developed and in particular raises questions of expressiveness and efficiency. Section 5 is dedicated to its theoretical and empirical validation, when applied to our technical corpus, and ends with a discussion about the linguistic relevance of the generalized clauses that we have learnt in order to explain the concept of qualia pairs. 2. The Generative Lexicon and Information Retrieval In this section, we first describe the structure of a lexical entry in the GL formalism. We then argue for the use of N-V index expansion based on GL qualia structure in information retrieval. 2.1 Lexical Entries in the Generative Lexicon As mentioned above, lexical entries in GL consist of structured sets of typed predicates that define a word. Lexical representations can thus be considered as reserves of types on which different interpretative strategies operate; these representations are responsible for word meaning in context. This generative theory of the lexicon includes three levels of representation for a lexical entry: the argument structure (argstr), the event structure (eventstr), and the qualia structure (qs) as illustrated in Figure 1 for word W. W ARGSTR = EVENTSTR = QS = ARG1 = ... D-ARG1 = ... E1 = ... E2 = ... RESTR = temporal relation between events HEAD = relation of prominence W-lcp FORMAL = ... CONST = ... TELIC = ... AGENTIVE = ... Figure 1: Lexical entry in GL All the syntactic categories receive the same levels of description. Argument and event structures contain the arguments and the events that occur in the definitions of the words. These elements can be either necessarily expressed syntactically or not—in this last case, they are called default arguments (D - ARG) or default events (D - E). The qualia structure links these arguments and events and defines the way they take part in the semantics of the word. 495 C LAVEAU , S ÉBILLOT, FABRE AND B OUILLON In the qualia structure, the four semantic roles correspond to interpreted features that form the basic vocabulary for the lexical description of a word, and determine the structure of the information associated with it (that is, its lexical conceptual paradigm (lcp )). Their meanings have already been given in Section 1. Figure 2 presents the lexical representation of book as mentioned by Pustejovsky (1995), in which the item both appears as a physical object, and as an object that contains information (denoted by info.physobj-lcp ). book ARGSTR = EVENTSTR = QS = ARG1 = y : info ARG2 = x : physobj D-E1 = e1 D-E2 = e2 info.physobj-lcp FORMAL = contain(x, y) CONST = part-of(x.y, z : cover, pages, ...) TELIC = read(e1, w, x.y) AGENTIVE = write(e2 , v, x.y) Figure 2: Lexical representation of book This representation can be interpreted as: λx.y[book(x : physobj.y : info) ∧ contain(x, y) ∧ λwλe1 [read(e1 , w, x.y)] ∧ ∃e2 ∃v[write(e2 , v, x.y)]]. A network of relations is thus defined for each noun, for example book-write, book-read, bookcontain for book. These relations are not empirically defined but are linguistically motivated: they are the relations that are necessary to explain the semantic behaviour of the noun. These are the kinds of relations we want to use in information retrieval (IR) applications to expand index terms and deal with intercategorial semantic paraphrases of users’ requests. 2.2 N-V Qualia Relations for Information Retrieval Arguments for using GL theory to define N-V pairs relevant for index reformulation have already been reported by Bouillon et al. (2000b). We only point out here the main reasons for that option. Many authors agree on the fact that index reformulation must not be limited to N-N relations. For example, Grefenstette (1997) suggests the importance of syntagmatic N-V links to explicit and disambiguate nouns contained in short requests in an IR application. One way to semantically characterize research is to extract verbs that co-occur with it to know what research can do (research shows, research reveals, etc.), or what is done for research (do research, support research, etc.). Our work within GL framework is a way to systemize such a proposition. From the theoretical point of view, GL is a theory of words in context: it defines under-specified lexical representations that acquire their specifications within corpora. For example (see Figure 2), book in a given corpus can receive the agentive predicate publish, the telic predicate teach, etc. Those representations can be considered as a way to structure information in a corpus and, in that sense, the relations that are defined in GL are privileged information for IR. Moreover, in this perspective, GL has been preferred to existing lexical resources such as WordNet (Fellbaum, 1998) for two main reasons: the lexical relations we want to exhibit—namely N-V links—are unavailable 496 L EARNING S EMANTIC L EXICONS U SING ILP in WordNet, which focuses on paradigmatic lexical relations; WordNet is a domain-independent, static resource, which, as such, cannot be used to describe lexical associations in specific texts, considering the great variability of semantic associations from one domain to another (Voorhees, 1994; Smeaton, 1999). Concerning practical issues, the validity of using GL theory to define N-V couples relevant for reformulation has already been partly tested. First, Fabre (1996) has shown that N-V qualia pairs can be used to calculate the semantic representations of binominal sequences (NN compounds in English and N preposition N sequences in French), and thus offer extended possibilities for reformulations of compound index terms. Fabre and Sébillot (1999) have then used those relations in an experiment conducted on a French telematic service system. They have shown that the context of binominal sequences can be used to disambiguate nouns, provided that syntagmatic links exist or are developed within the thesaurus of the retrieval system, and that these syntagmatic relations can be used to discover semantic paraphrase links between a user’s question and the texts of an indexed database. A second test has also been carried out in the documentation service of a Belgian bank (Vandenbroucke, 2000). Its documentalists traditionally use boolean questions with nominal terms. They were asked to evaluate the relevance of proposed qualia verbs associated with nouns of their questions to specify their requests or to access documents they had not thought of. Those first results were quite promising. However, in order to be able to make the most of N-V qualia pairs and deeply evaluate their relevance for information retrieval applications, a method to automatically acquire these pairs from a corpus is necessary. Our goal is thus to learn GL-based semantic lexicons from corpora (more precisely N-V qualia pairs). Before describing the learning method we have developed to achieve this goal, we first present the corpus we have used, and the information we have associated with its words to be able to characterize the context of N-V qualia pairs. 3. The MATRA-CCR Corpus and its Tagging In this section, the technical corpus that we have used to learn semantic lexicons based on GL principles is described. This corpus has first undergone part-of-speech (POS) tagging which aims at providing each word of the text with an unambiguous categorial tag (singular common noun, infinitive verb, etc.); categorial tagging is presented in Section 3.2. Secondly, in order to permit learning of what distinguishes qualia pairs from non-qualia ones that appear in exactly the same syntactic structures, semantic tags have been added. For example in structures like Verbinf det N1 prep N2, the pair N2 Verbinf is sometimes non-qualia (for example (corrosion, vérifier ) (corrosion, check) in vérifier l’absence de corrosion (check the absence of corrosion)) but sometimes qualia (for example (réservoir, vider ) (tank, empty) in vider le fond du réservoir (empty the bottom of the tank)) when N1 indicates for example a part of an object. A simple POS-tagging of those two sentences does not display any difference between them. Section 3.3 is dedicated to the description of the semantic tagging of the corpus, that is to the addition of tags unambiguously describing the semantic class of each of its words. 3.1 The MATRA-CCR Corpus The French corpus used in this project is a 700 KBytes handbook of helicopter maintenance, provided by MATRA - CCR Aérospatiale, which contains more than 104,000 word occurrences. The MATRA - CCR corpus has some special characteristics that are especially well suited for our task: it 497 C LAVEAU , S ÉBILLOT, FABRE AND B OUILLON is coherent, that is, its vocabulary and syntactic structures are homogeneous; it contains many concrete terms (screw, door, etc.) that are frequently used in sentences together with verbs indicating their telic (screws must be tightened, etc.) or agentive roles (execute a setting, etc.). 3.2 Part-of-Speech Tagging This corpus has been POS-tagged with the help of annotation tools developed in the MULTEXT project (Ide and Véronis, 1994; Armstrong, 1996); sentences and words are first segmented with MT S EG ; words are analyzed and lemmatized with MMORPH (Petitpierre and Russell, 1998; Bouillon et al., 1998), and finally disambiguated by the TATOO tool, a hidden Markov model tagger (Armstrong et al., 1995). Each word therefore only receives one POS tag which indicates its morphosyntactic category (and its gender, number, etc.) with high precision: less than 2% of errors have been detected when compared to a manually tagged 4,000-word test-sample of the corpus. Those POS tags are one of the elements used by our learning method to characterize the context in which qualia pairs can be found. 3.3 Semantic Tagging The semantic tagging is done on the already POS-tagged MATRA - CCR corpus; we therefore benefit from the disambiguation of polyfunctional words (that is, words that have different syntactic categories, such as règle in French which can be the indicative of the verb to regulate and the common noun rule ) (Wilks and Stevenson, 1996). We have first built the semantic classification which we used as tagset for the semantic tagging. This tagging process is then carried out with the help of the same probabilistic tagger as for POS-tagging and, as shown here, the majority of the semantic ambiguities are solved. More precisely, a lexicon containing every word (the lexicon entries) of the MATRA - CCR corpus is created; it associates with each word all its possible semantic tags. The most relevant tagset for each category must be chosen. We only describe here the semantic classification of the main POS categories of the MATRA - CCR corpus. We also give the results of its semantic tagging using the hidden Markov model tagger. A more detailed presentation can be found in (Bouillon et al., 2001). WordNet’s (Fellbaum, 1998) most generic classes have initially been selected to systematically classify the nouns. However, irrelevant classes (for our corpus) have been withdrawn and, for large classes, a more precise granularity has been chosen (for example the class artefact has been split into more precise categories). This has led to 33 classes, hierarchically organized as shown in Figure 3 (WordNet classes not used for tagging are in italics and semantic tags are bracketed). Only 8.7% of the entries of the common noun lexicon are ambiguous. Most of those ambiguities correspond to complementary polysemy (for example, enfoncement can both indicate a process (pushing in) or its result (hollow); it is therefore classified as both pro and sta). Concerning verbs, WordNet classification was judged too specific. A minimal partition into 7 classes has been selected. Only 7 verbs (among about 570) are ambiguous. Adjectives and prepositions, etc. have also been classified and have led to the creation of lexicons in which very few entries are ambiguous. Those various lexicons are then used to carry out the semantic tagging of the POS-tagged MATRA - CCR corpus by projecting the semantic tags on the corresponding words. Ambiguities are solved with the help of the probabilistic tagger, following principles described in (Bouillon et al., 2000a). A 6,000-word sample of the corpus has been chosen to evaluate the semantic tagging pre498 L EARNING S EMANTIC L EXICONS U SING ILP form (frm) attribute (atr) property (pty) time unit (tme) unit (unt) abstraction measure (mea) definite quantity (qud) relation (rel) social relation communication (com) natural event (hap) event human activity (acy) act (act) phenomenon (phm) noun group (grp) process (pro) social group (grs) psychological feature (psy) state (sta) body part (prt) entity (ent) causal agent (agt) human (hum) artefact (art) object (pho) instrument (ins) part (por) substance (sub) chemical compound (chm) stuff (stu) location (loc) point (pnt) position (pos) Figure 3: Hierarchy of classes for the semantic tagging of common nouns 499 container (cnt) C LAVEAU , S ÉBILLOT, FABRE AND B OUILLON cision. It contains 7.78% of ambiguous words; 85% of them have been correctly disambiguated (1.18% of semantic tagging errors). All those POS and semantic tags in the MATRA - CCR corpus are the contextual key information used by the learning method that we have developed in order to automatically extract N-V qualia pairs. The next section explains its realization. 4. The Machine Learning Method We aim at learning a special kind of semantic relations from our POS and semantically tagged MATRA - CCR corpus, that is, verbs playing a specific role in the semantic representation of common nouns, as defined in the qualia structure in GL formalism. Trying to infer lexical semantic information from corpora is not new: a lot of work has already been conducted on this subject, especially in the statistical learning domain. See Grefenstette (1994b), for example, or Habert et al. (1997) and Pichon and Sébillot (1997) for surveys of this field. Following Harris’s framework (Harris et al., 1989), such research tries to extract both syntagmatic and paradigmatic information, respectively studying the words that appear in the same window-based or syntactic contexts as a considered lexical unit (first order word affinities Grefenstette, 1994a), or the words that generate the same contexts as the key word (second order word affinities). For example, Briscoe and Carroll (1997) and Faure and Nédellec (1999) try to automatically learn verbal argument structures and selectional restrictions; Agarwal (1995) and Bouaud et al. (1997) build semantic classes; Hearst (1992) and Morin (1999) focus on particular lexical relations, like hyperonymy. Some of this research is concerned with automatically obtaining more complete lexical semantic representations (Grefenstette, 1994b; Pichon and Sébillot, 2000). Among these studies, mention must be made of the research described by Pustejovsky et al. (1993) which gives some principles for acquiring GL qualia structures from a corpus; this work is however quite different from ours because it is based upon the assumption that the extraction of the qualia structure of a noun can be performed by spotting a set of syntactic structures related to qualia roles; we propose to go one step further as we have no a priori assumptions concerning the structures that are likely to convey these roles in a given corpus. In order to automatically acquire N-V pairs whose elements are linked by one of the semantic relations defined in the qualia structure in GL, we have decided to use a symbolic machine learning method. Moreover, symbolic learning has led to several studies on the automatic acquisition of semantic lexical elements from corpora (Wermter et al., 1996) during the last few years. This section is devoted to the explanation of our choice and to the description of the method that we have developed. Our selection of a learning method is guided by the fact that this method must not only provide a predictor (this N-V pair is relevant, this one is not) but also infer general rules able to explain the examples, that is, bring linguistically interpretable elements about the predicted qualia relations. This essential explanatory characteristic has motivated our choice of the inductive logic programming (ILP) framework (Muggleton and De Raedt, 1994) in which programs, that are inferred from a set of facts and a background knowledge, are logic programs, that is, sets of Horn clauses. Contrary to some statistical methods, it does not just give raw results but explains the concept that is learnt, that is, here, what characterizes a qualia pair (versus a non-qualia one). This choice is also especially justified by the fact that, up to now, linguists do not know what all the textual patterns that express qualia relations are; they cannot thus verbalize rules describing them. Therefore, ILP seems to be an appropriate option since its relational nature can provide a powerful expressiveness 500 L EARNING S EMANTIC L EXICONS U SING ILP for these linguistic patterns. Moreover, as linguistic theories provide no clues concerning elements that indicate qualia relations, ILP’s adaptable framework is particularly suitable for us. Lastly, the errors inherent in the automatic POS and semantic tagging process previously described make the choice of an error-tolerant learning method essential. The possibility of handling data noise in ILP guarantees this robustness. For our experiments, we provide a set of N-V pairs related by one of the qualia relations (positive example set, E + ) within a POS and semantic context (elements from sentences containing those NV pairs in the corpus), and a set of N-V pairs that are not semantically linked (negative example set, E − ). Generalizing rules from semantic and POS information about words that occur in the context of N-V qualia pairs in the corpus and from distances between N and V in the sentences from which examples are built is a particularly hard task. The difficulty is mainly due to the amount of information that has to be handled by the ILP algorithm. We must therefore focus on the efficiency of this learning step to be certain to obtain linguistically meaningful clauses in a relatively small amount of time. Most ILP systems provide a way to deal more or less with the problem of the form of the rules but only some of them enable a total control of this form and of the rule search efficiency. Moreover, the particular structure of our POS and semantic information makes it essential to use a system capable of processing relational background knowledge. For our project, we have thus chosen ALEPH1 , Srinivasan’s ILP implementation that has already been proven well suited to deal with a large amount of data in multiple domains (mutagenesis, drug structure. . . ) and permits complete and precise customization of all the settings of the learning task. For research use, ALEPH is also very attractive since it is entirely written in Prolog and thus allows the user to easily have a comprehensive view of the learning process, and in particular to write his/her own refinement operator to adequately perform rule search. However, this is certainly not the fastest choice: other ILP programs could be used that would perform in shorter time, but it would be to the detriment of a complete user control on the learning task. A few experiments have indeed been carried out with Quinlan’s FOIL; the computing time was better (about half of the ALEPH time, see Section 5.1), but some of the produced rules did not match the linguistically motivated form requirements we defined in Section 4.2. These results are certainly due to the greedy search strategy used by FOIL. In this section we first explain the construction of E + and E − for ALEPH. We then define the space in which the rules that we want to learn are searched for (that is, what the rules we learn are and how they are related to each other). We finally describe how we improve the efficiency of the search by pruning some irrelevant hypotheses. The clauses that are obtained and their evaluation are detailed in Section 5. 4.1 Example Construction Our first task consists in building up E + and E − for ALEPH, in order for it to infer generalized clauses that explain what, in the POS and semantic contexts of N-V pairs, distinguishes relevant pairs from non-relevant ones. Here is our methodology for their construction. First, every common noun in the MATRA - CCR corpus is considered. More precisely, we only deal with a 81,314 word occurrence subcorpus of the MATRA - CCR corpus, which is formed by all the sentences that contain at least one N and one V. This subcorpus contains 1,489 different N (29,633 noun occurrences) and 567 different V (9,522 verb occurrences). For each N, the 10 most strongly associated V, in terms of χ2 (a statistical correlation measure based upon the relative frequencies of 1. http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/aleph/aleph toc.html 501 C LAVEAU , S ÉBILLOT, FABRE AND B OUILLON words), are selected. This first step produces at the same time pairs whose components are correctly bound by one qualia role ((roue, gonfler) (wheel, inflate)) and pairs that are fully irrelevant ((roue, prescrire) (wheel, prescribe)). Each pair is manually annotated as relevant or irrelevant according to Pustejovsky’s qualia structure principles. A Perl program is then used to find the occurrences of these N-V pairs in the sentences of the corpus. For each occurrence of each pair that is supposed to be used to build one positive example (that is, pairs that have been globally annotated as relevant), a manual control has to be done to ensure that the N and the V really are in the expected relation within the studied sentence. After this control, a second Perl program automatically produces the positive example by adding a clause of the form is qualia(noun identifier,verb identifier). to the set E + . Information is also added to the background knowledge that describes each word of the sentence and the position in the sentence of the N-V pair. For example, for a five word long sentence whose word identifiers are w 1 ... w 5, and the N-V pair w 4-w 2, the following clauses are added: tags(w 1,POS-tag,semantic-tag). tags(w 2,POS-tag,semantic-tag). pred(w 2,w 1). tags(w 3,POS-tag,semantic-tag). pred(w 3,w 2). tags(w 4,POS-tag,semantic-tag). pred(w 4,w 3). tags(w 5,POS-tag,semantic-tag). pred(w 5,w 4). distances(w 4,w 2,distance in words,distance in verbs). where pred(x,y) indicates that word y occurs just before word x in the sentence, the predicate tags/3 gives the POS and semantic tags of a word, and distances/4 specifies the number of words and the number of verbs between N and V in the sentence (a negative distance indicates that N occurs before V, a positive one indicates that V occurs before N in the studied sentence; distances are shifted by one in order to distinguish a positive null distance from a negative null one). For example, the N-V qualia pair in boldface in the sentence “ L’installation se compose : de deux atterrisseurs protégés par des carénages, fixés et articulés. . . ” (the system is composed: of two landing devices protected by streamlined bodies, fixed and articulated. . . ) is transformed into is qualia(m11124 52,m11124 35). and tags(m11123 3 deb,tc vide,ts vide). tags(m11123 3,tc noun sg,ts pro). pred(m11123 3,m11123 3 deb). tags(m11123 16,tc pron,ts ppers). pred(m11123 16,m11123 3). tags(m11123 19,tc verb sg,ts posv). pred(m11123 19,m11123 16). tags(m11123 27,tc wpunct pf,ts ponct). pred(m11123 27,m11123 19). tags(m11124 1,tc prep,ts rde). pred(m11124 1,m11123 27). tags(m11124 4,tc num,ts quant). pred(m11124 4,m11124 1). tags(m11124 9,tc noun pl,ts art). pred(m11124 9,m11124 4). tags(m11124 35,tc verb adj,ts acp). 502 L EARNING S EMANTIC L EXICONS U SING ILP pred(m11124 35,m11124 9). tags(m11124 44,tc prep,ts rman). pred(m11124 44,m11124 35). tags(m11124 52,tc noun pl,ts art). pred(m11124 52,m11124 44). tags(m11124 62,tc wpunct,ts virg). pred(m11124 62,m11124 52). tags(m11125 1,tc verb adj,ts acp). pred(m11125 1,m11124 62). tags(m11125 7,tc conj coord,ts rconj). pred(m11125 7,m11125 1). tags(m11125 10,tc verb adj,ts acp). pred(m11125 10,m11125 7). ... distances(m11124 52,m11125 35,2,1). where the special tags tc vide and ts vide describe the empty word which is used to indicate the beginning and the end of the sentence. The negative examples are elaborated in the same way as the positive ones, with the same Perl program. They are automatically built from the above mentioned highly correlated N-V pairs that have been manually annotated as irrelevant, and from the occurrences in the corpus of potential relevant N-V pairs rejected during E + construction (see above). For example, the non-qualia pair in boldface in the following sentence: “Au montage : gonfler la roue à la pression prescrite, . . . ” (When assembling: inflate the wheel to the prescribed pressure, . . . ) is added to the set E − as is qualia(m7978 15,m7978 31). and the following clauses are stored into the background knowledge: tags(m7977 1 deb,tc vide,ts vide). tags(m7977 1,tc prep,ts ra). pred(m7977 1,m7977 1 deb). tags(m7977 3,tc noun sg,ts acy). pred(m7977 3,m7977 1). tags(m7977 11,tc wpunct pf,ts ponct). pred(m7977 11,m7977 3). tags(m7978 7,tc verb inf,ts acp). pred(m7978 7,m7977 11). tags(m7978 15,tc noun sg,ts ins). pred(m7978 15,m7978 7). tags(m7978 20,tc prep,ts ra). pred(m7978 20,m7978 9). tags(m7978 22,tc noun sg,ts phm). pred(m7978 22,m7978 20). tags(m7978 31,tc verb adj,ts acc). pred(m7978 31,m7978 22). tags(m7978 41,tc wpunc,ts virg). pred(m7978 41,m7978 31). ... distances(m7978 15,m7978 31,-3,-1). During this step, as shown in the encoding of the previous positive and negative examples, some categories of words are not taken into account: the determiners, and some adjectives, which are not considered as relevant to bring up information about context of qualia or non-qualia pairs. 503 C LAVEAU , S ÉBILLOT, FABRE AND B OUILLON 3,099 positive examples and about 3,176 negative ones are automatically produced this way from the MATRA - CCR corpus. A LEPH’s background knowledge is also provided with other information, that describes special relationships among POS and semantic tags. Those relationships encode, for example, the fact that a tag tc verb pl indicates a conjugated verb in the plural (conjugated plural), that can be considered as a conjugated verb (conjugated) or simply a verb (verb). Here is an example of those literals describing the words from a linguistic point of view: verb( W ) :- conjugated( W ). verb( W ) :- infinitive( W ). ... conjugated( W ) :- conjugated plural( W ). conjugated( W ) :- conjugated singular( W ). conjugated plural( W ) :- tagcat( W, tc verb pl ). ... The background knowledge file describing all these relations and all the predicate definitions is given in appendix A. All the datasets (examples and background knowledge) used for the experiments are available from the authors on demand. Let us define some terms we use later in this paper: – “most general literals” are literals describing words that do not appear in the body of a clause in the background knowledge (for example, common noun/1, verb/1). Note that every word can be described by one and only one “most general literal”. – “POS literals” (resp. “semantic literals”) are literals describing the morpho-syntactic (semantic) aspect of a word (the most general literals are not considered as semantic or POS literals), – “most general POS literals” (resp. “most general semantic literals”) are POS (semantic) literals that appear in the body of most general literals (for example, infinitive/1, entity/1). – for two literals l1 and l2 such that a rule l1:-l2. exists in the background knowledge, l1 is called the immediate generalization of l2 and l2 is the immediate specialization of l1. The immediate generalizations of literals are unique with respect to the background knowledge. For any word W of our corpus, our background knowledge is such that all the literals describing W can be ordered in a tree whose particular structure is used in the learning process. Indeed, the root of the tree is the most general literal describing W and it has two branches, one for the POS literals and the other for the semantic literals. Any node (literal) of these two branches has only one upper node (its immediate generalization) and at most one lower (its immediate specialization if it exists). Other useful predicates are also stored in the background knowledge, for example tagcat/2 and tagsem/2, that are used as an interface between the examples and the POS and semantic literals, and the predicate suc/2 defined as suc(X,Y) :- pred(Y,X).; suc/2 is only used for reading convenience and is considered, especially in the hypothesis construction, as the equivalent of pred/2 (that is, is qualia(A,B) :- suc(A,B). and is qualia(A,B) :- pred(B,A). are considered as one unique hypothesis). Given E + , E − and the background knowledge B, ALEPH tries to deal with that large amount of information and discover rules that explain (most of) the positive examples and reject (most of) the negative ones. To infer those rules, it uses examples to generate and test various hypotheses, and keeps those that seem relevant regarding what we want to learn. To sum up, ALEPH algorithm follows a very simple procedure that can be described in 4 steps, as stated in ALEPH’s manual: 504 L EARNING S EMANTIC L EXICONS U SING ILP 1 select one example to be generalized. If none exist, stop; 2 build ⊥, that is, the most specific clause that explains the example; 3 search the space of solutions bounded below by ⊥ for the hypothesis that maximizes a score function. This is done with the help of a refinement operator ; 4 remove examples that are “covered” (“explained”) by the hypothesis that has been found. Return to step 1. The search of hypotheses (step 3) is the most complex task of this algorithm, and also the longest one. To improve the efficiency of the learning and control the expressiveness of the solutions, this search space must be characterized. 4.2 Hypothesis Search Lattice Many machine learning tasks can be considered as a search problem. In ILP, the hypothesis H that has to be learnt must satisfy: ∀e+ ∈ E + : B ∪ H |= e+ (completeness) ∀e− ∈ E − : B ∪ H 6|= e− (correctness) Such a hypothesis is searched for through the space of all Horn clauses to find the one that is complete and correct. Unfortunately, the tests required on the training data are costly and preclude an exhaustive search throughout the entire hypothesis space. Several kinds of biases are therefore used to limit that search space (see Nédellec et al., 1996). One of the most natural ones is the hypothesis language bias which defines syntactic constraints on the hypotheses to be found. This restriction on the search space considerably limits the number of potential solutions, prevents overfitting and ensures that only well-formed ones are obtained. For us, a well-formed hypothesis is defined as a clause that gives (semantic and/or POS) information about words (N, V or words occurring in their context) and/or information about respective positions of N and V in the sentence. For example is qualia(A,B) :- artefact(A), pred(B,C), suc(A,C), auxiliary(C).—which means that a N-V pair is qualia if N is an artefact, V is preceded by an auxiliary verb and N is followed by the same verb—is a well-formed hypothesis. We have therefore to indicate in ALEPH’s settings that the predicates artefact/1, pred/2, suc/2, auxiliary/1. . . can be used to construct a hypothesis. Another constraint on the hypothesis language is that there can be at most one item of POS information and one item of semantic information about a given word. This means that the hypothesis is qualia(A,B) :- pred(B,C), participle(C), past participle(C). is not considered legal since there are two items of POS information about the word represented by C. Conversely, the hypotheses is qualia(A,B) :- pred(B,C), participle(C), action verb(C). or is qualia(A,B) :- pred(B,C), past participle(C), physical action verb(C). or even is qualia(A,B) :- pred(B,C), suc(A,C). are well-formed with respect to our task. Redundant information on one word is indeed superfluous and useless since all our POS and semantic information is hierarchically organized: one of the literals is thus more specific than the others and describes the word in a precisely enough way; the other literals are therefore useless. In our example, there is no need to say that C is a participle (participle(C)) if it is known to be a past participle (past participle(C)). This superfluousness issue is managed by our refinement operator. Several other predicates, in particular those dealing with the distances between N and V and their relative positions, are used in the hypothesis language. More than 100 different predicates can thus occur in a hypothesis. 505 C LAVEAU , S ÉBILLOT, FABRE AND B OUILLON Even with this language bias, our learning search space remains huge. Fortunately, the hypotheses can be organized by a generality relation (with the help of a quasi-order on hypotheses) which permits the algorithm to run intelligently across the space of solutions. Several quasi-orderings have been studied in the ILP framework. Logical implication would ideally be the preferred generality relation, but undecidability results lead to its rejection (Nienhuys-Cheng and de Wolf, 1996). Another order, commonly used by ILP systems, is θ-subsumption (Plotkin, 1970), defined below. Definition 1 A clause C1 θ-subsumes a clause C2 (C1 θ C2 ) if and only if (iff) there is a substitution θ such that C1 θ ⊆ C2 (considering the clauses as sets of literals). This order is weaker than implication (C1 θ C2 ⇒ C1 |= C2 but reverse is not true) but allows an easier handling of the clauses. θ-subsumption remains however too strong for our application. Indeed, let us consider H1 ≡ is qualia(X1,Z1 ) :- suc(X1 ,Y1 ), pred(Z1,W1 ), verb(Y1 ), verb(W1 ). and H2 ≡ is qualia(X2,Z2 ) :- suc(X2 ,Y2 ), pred(Z2 ,Y2 ), verb(Y2 ).. Then, we have H1 θ H2 with θ = [X1 /X2 ,Y1 /Y2 , Z1 /Z2 ,W1 /Y2 ] and since in our application, variables represent words, this means that θ-subsumption allows to consider one word as two different ones in a clause, as this is the case with the word Y1 /W1 in H1 . This property is not considered as relevant for our learning task; we thus focus our attention on a coercive form of θ-subsumption: θ-subsumption under object identity (henceforth θOI -subsumption) (Esposito et al., 1996) defined below. Definition 2 (after Badea and Stanciu (1999)) A clause C1 θOI -subsumes a clause C2 (C1 OI C2 ) iff there is a substitution θ such that C1 θ ⊆ C2 and θ is injective (that is, θ does not unify variables of C1 ). θOI -subsumption is obviously weaker than θ-subsumption (C1 OI C2 ⇒ C1 θ C2 but reverse is false) but preserves the expected property H1 6OI H2 (with H1 and H2 as defined above). This is handled in ALEPH by generating hypotheses with sets of inequalities stating that variables with two different names cannot be unified. For example, H1 is internally represented in ALEPH by is qualia(X,Z):-suc(X,Y),pred(Z,W),verb(Y),verb(W),X6=Z,X6=Y,Z6=Y,X6=W,Y6=W,Z6=W. For reading convenience, in the remaining of this paper we do not write these sets of inequalities and we assume that two differently named variables are distinct. The notion of generality (we call it θNV -subsumption) that we use is derived from the θOI subsumption and adapted to fit the needs of our application. Indeed, θOI -subsumption, as defined above, does not totally capture the generality notion we want to use in our hypothesis space. First, we wish to take into account the hierarchical organization of our POS and semantic information, that is, we want our generality notion to make the most of the domain theory described in the background knowledge, following ideas developed in the generalized subsumption framework (Buntine, 1988). For example, we want the hypothesis is qualia(A,B) :- object(A). to be considered as more general than is qualia(A,B) :- artefact(A). which must itself be considered as more general than is qualia(A,B) :instrument(A). (see Figure 3). Moreover, we want to avoid clauses with no constraint set on a variable. For example, the hypothesis is qualia(A,B) :- infinitive(B), pred(A,C). could simply be expressed by is qualia(A,B) :- infinitive(B). since pred(A,C) does not bring any linguistically interesting information. However, is qualia(A,B) :suc(A,C), suc(C,D), object(D). is considered as well-formed since there is a semantic constraint on the 506 L EARNING S EMANTIC L EXICONS U SING ILP word D, and C is coerced by the two suc/2. This condition is very similar to the well-known linkedness: according to Helft (1987), a clause is said to be linked if all its variables are linked; a variable V is linked in a clause C if and only if V occurs in the head of C, or there is a literal l in C that contains the variables V and W (V 6= W ) and W is linked in C. It also corresponds to the connection constraint (Quinlan, 1990), i1-determinate clauses in the ij-determinacy framework (Muggleton and Feng, 1990) or chain-clause concept (Rieger, 1996), but in our case, every variable must not only be connected to head variables by a path of variables (with the help of pred/2 and suc/2), but besides, it must be “used” elsewhere in the hypothesis body. A hypothesis meeting all these conditions is said to be well-formed with respect to our learning task. Therefore, we say that with respect to the background knowledge B, C NV D if there exist an injective substitution θ and a function fD is such that fD (C)θ ⊆ D ( fD ({l1 , l2 , ..., lm }) means { fD (l1 ), fD (l2 ), ..., fD (lm )}) where fD such that ∀l ∈ C, B, fD (l) |= l. Intuitively, this means that a clause D can be more specific than C if 1 – D has literals in addition to literals of C; 2 – D contains literals more specific (with respect to POS and semantic information hierarchy) on the same variables than C. As for θ-subsumption and θOI -subsumption, θNV -subsumption induces a quasi-ordering upon the space of hypotheses with respect to our particular background knowledge and our definition of well-formed hypothesis, as stated by the three following results: – C NV C (reflexivity) – C1 NV C2 and C2 NV C1 ⇒ C1 and C2 are equivalent (written C1 ∼NV C2 ); in our case (as well as for θOI -subsumption) C1 ∼NV C2 means C1 = C2 up to variable renaming (antisymmetry) – C1 NV C2 and C2 NV C3 ⇒ C1 NV C3 (transitivity) Proof 1 - Reflexivity: trivial. 2 - Antisymmetry: C1 NV C2 and C2 NV C1 , thus there exist f1 , f2 , θ1 and θ2 such that f1 (C1 )θ1 ⊆ C2 and f2 (C2 )θ2 ⊆ C1 , with ∀l ∈ C1 , B, f1 (l) |= l and ∀l ∈ C2 , B, f2 (l) |= l. Therefore, ∀l ∈ C1 , B, f2 ( f1 (l)) |= f1 (l) and thus ∀l ∈ C1 , B, f2 ( f1 (l)) |= l with f2 ( f1 (l)) ∈ C1 . Since C1 is considered as well-formed and with respect to our background knowledge, we have ∀l ∈ C1 , f2 ( f1 (l)) = l and f1 (l) = l; similarly, ∀l ∈ C2 , f2 (l) = l. This means that C1 θ1 ⊆ C2 and C2 θ2 ⊆ C1 and since θ1 and θ2 are injective, C1 and C2 are only alphabetic variants. 3 - Transitivity: C1 NV C2 and C2 NV C3 , thus there exist f1 , f2 , θ1 and θ2 such that f1 (C1 )θ1 ⊆ C2 and f2 (C2 )θ2 ⊆ C3 . We have f2 ( f1 (C1 ))θ1 θ2 ⊆ C3 , and f1 ◦ f2 (composition of f1 and f2 ) and θ1 ◦ θ2 are injective, therefore C1 NV C3 . Thanks to our example representation and the background knowledge used, all the literals that can occur in hypotheses are deterministic; such hypotheses are said to be determinate clauses. With these linked determinate clauses, the θNV -subsumption quasi-ordering implies that the hypothesis space is structured as a lattice (detailed proof is given in appendix B for θOI -subsumption and θNV subsumption). At the top of this lattice, we find the most general clause (⊤) and below, a most 507 C LAVEAU , S ÉBILLOT, FABRE AND B OUILLON is qualia(A,B). 1 is qualia(A,B) :- common noun(A). 1 2 is qualia(A,B) :- pred(B,C), preposition(C). 2 is qualia(A,B) :- entity(A). is qualia(A,B) : - singular common noun(A). 2 2 is qualia(A,B) :- object(A). 1 1 is qualia(A,B) :- pred(B,C), goal preposition(C). is qualia(A,B) :- object(A), singular common noun(A). 1 1 2 is qualia(A,B) :- artefact(A), singular common noun(A). is qualia(A,B) :- object(A), singular common noun(A), pred(B,C), goal preposition(C). Figure 4: Hypothesis lattice for θNV -subsumption specific clause (called MSC or bottom and henceforth written down ⊥). In our case, ⊤ is the clause clause containing all the literals that can be found to describe the example to be generalized (see Muggleton, 1995, for details about ⊥ construction) minus superfluous literals (literals giving more general information about a word than other literals in ⊥). Figure 4 shows a simple example of our lattice; numbers on the edges refer to the first or the second condition of the given definition of θNV -subsumption. The way the search is performed in this lattice is really important to find the best hypothesis (with respect to the chosen score function) in the shortest possible time. As our background knowledge has the structure of a forest (a set of trees) and the relation introducing variables (the sequence relation pred/suc) is determinate, it is quite easy to build a perfect refinement operator (Badea and Stanciu, 1999) allowing an effective traversal of this hypothesis space ordered by θNV -subsumption using the methods described there. However, in order to save computation time, we avoid exploring parts of the hypothesis space (that is, refinements of hypotheses) if we know that there cannot be any good solution in those parts. is qualia(A,B). stating that all N-V pairs are qualia pairs, and ⊥ is a constant-free 4.3 Pruning and Private Properties Pruning the search is a delicate task and must be controlled so as not to “miss” a potential solution. The problem is that if a hypothesis violates some property P, one of its refinements can perhaps be 508 L EARNING S EMANTIC L EXICONS U SING ILP correct with respect to P. Let us see how we manage pruning in our lattice with the guarantee of not leaving a valid solution out. Some properties, called private properties (Torre and Rouveirol, 1997a,b,c), allow safe pruning with respect to a given refinement operator. They enable us to avoid refining a given hypothesis that does not satisfy the expected properties without taking the risk of missing a solution since no descendant of the hypothesis will satisfy those properties. Definition 3 (from Torre and Rouveirol 1997c) A property P is said to be private with respect to the refinement operator ρ into the search space S iff: ∀H, H ′ ∈ S : ∀(H ′ ∈ ρ∗ (H) ∧ P(H) ⇒ P(H ′ )) where X indicates the negation of X and ∀F, with F a formula, denotes the universal closure of F, which is the closed formula obtained by adding a universal quantifier for every variable having a free occurrence in F. Let us examine a very simple and well-known private property (used as an example by Torre and Rouveirol, 1997c) that allows us to prune the search safely: the length of a clause. Formally, the property that binds the length of a clause to k literals can be expressed as |H| ≤ k (|C| denotes the number of literals in clause C). This property is private with respect to the operator ρ in the search space S iff ∀H, H ′ ∈ S : ∀k ∈ N : (H ′ ∈ ρ∗nv (H) ∧ |H| > k ⇒ |H ′ | > k). Our operator basically consists in adding literals (H ′ ∈ ρnv (H) ∧ |H ′ | > |H|) or in replacing a literal by a more specific one (then H ′ ∈ ρnv (H) ∧ |H ′| = |H|). The clause length property is thus private with respect to ρnv and allows a safe pruning as soon as a hypothesis has too many literals. Several other private properties are used to prune search in a safe way. We use for example the minimal number of positive examples to be covered, that is, if a clause does not explain at least a given number of positive examples this hypothesis is not considered as relevant. That property is obviously private with respect to ρnv since the numbers of covered positive and negative examples decrease through specialization. In ILP systems, properties about the score function are often used to prune search. This function permits us to decide which hypothesis is the best one for the learning task. The one we have chosen is s(H) = (P − N, |H|) where P is the number of positive examples and N the number of negative examples covered by hypothesis H. H1 is said to be a better hypothesis than H2 (with s(H1 ) = (P1 − N1 , |H1 |) and s(H2 ) = (P2 − N2 , |H2 |)) iff P1 − N1 > P2 − N2 or P1 − N1 = P2 − N2 ∧ |H1 | < |H2 |. Unfortunately, since P − N is not monotonic, we cannot say anything in general about the score of the refinements of a given hypothesis H that does satisfy a score criterion such that s(H) < k, where k can be the best score found until then in the search. This property would permit an optimal pruning, but since it is not private in our case, we cannot use it. The private property about this score function we make the most of to prune search is weaker: sopt (H) ≥ Sbest where Sbest is the greatest difference P − N found during the search and sopt (H) = Pcurrent − N⊥. Pcurrent is the number of positive examples covered by the current hypothesis, N⊥ is the number of negative examples covered by ⊥ (evaluated at its construction time). ∀H, H ′ ∈ S : ∀Sbest ∈ N : (H ′ ∈ ρ∗ (H)∧ sopt (H) < Sbest ⇒ sopt (H ′ ) < Sbest ) since P decreases through the search and N⊥ is constant. All those (safe) prunings ensure finding the best solution in a minimal amount of time. Two kinds of output are produced by this learning process: some clauses that have not been generalized (that is, some of the positive examples), and a set of generalized clauses, called G hereafter, and 509 C LAVEAU , S ÉBILLOT, FABRE AND B OUILLON on which we shall focus our attention. Before using the clauses in G on the MATRA - CCR corpus to acquire N-V qualia pairs and automatically produce a GL-based semantic lexicon, we must validate our learning process in different ways and examine what kind of rules have been learnt. 5. Learning Validation and Results This section is dedicated to three aspects of the validation of the machine learning method we have described. First we focus on the theoretical results of the learning, that is, we take an interest in the quality of G with respect to the training data (E + and E − ). The second step of the validation concerns its empirical aspect. We have applied the generalized clauses that have been inferred to the MATRA - CCR corpus and have evaluated the relevance of the decision made on the classification of N-V pairs as qualia or not. The last step concerns the linguistic relevance evaluation of the learnt rules, that is, from a linguistic point of view, what information do we learn about the semantic and syntactic context in which qualia pairs appear? 5.1 Theoretical Validation This first point concerns the determination of a learning quality measure with the chosen parameter setting. We are particularly interested in the proportion of positive examples that are covered by the generalized clauses, and if we accept some noise in ALEPH parameter adjustment to allow more generalizations, by the proportion of negative examples that are rejected by those generalized clauses. The measure of the recall and precision rates of the learning method can be summed up using the Pearson coefficient, which is used to compare the results of different experiments: Pearson = (T √P∗T N)−(FP∗FN) PrP∗PrN∗AP∗AN where A = actual, Pr = predicated, P = positive, N= negative, T= true, F= false ; a value close to 1 indicates a good learning. In order to obtain good approximations of the main characteristic numbers of the learning method, we perform a 10-fold cross-validation (Kohavi, 1995) on the initial sets of 3,099 positive examples and 3,176 negative ones. Thus, the set of examples (E+ and E− ) is divided into ten subsets, each of whose is in turn used as a testing set while the nine others are used as training set; ten learning processes are then performed with these training sets and evaluated onto the corresponding testing sets. Table 1 summarizes time,2 precision, recall and Pearson coefficient averages and standard deviations obtained through this 10-fold cross-validation. Average Standard deviation Time (seconds) 10285 1440 Precision Recall Pearson 0.813 0.028 0.890 0.024 0.693 0.047 Table 1: Cross-validation results 2. Experiments were conducted on a 966MHz PC running Linux. 510 L EARNING S EMANTIC L EXICONS U SING ILP The entire set of examples is then used as training set by ALEPH; 9 generalized clauses (see Section 5.3) are found in less than 3 hours. We now try to estimate the performance of these rules by comparing their results on an unknown dataset with those obtained by 4 experts. 5.2 Empirical Validation In order to evaluate the empirical validity of our learning method, we have applied the 9 generalized clauses to the MATRA - CCR corpus and have studied the appropriateness of their decisions concerning the classification of each pair as relevant or not. Since it is impossible to test all the N-V combinations found in the corpus, our evaluation has focused on 7 significant common nouns in the domain which were not used as examples, (vis, écrou, porte, voyant, prise, capot, bouchon) (screw, nut, door, indicator signal, plug, cowl, cap). The evaluation has been carried out in two steps as follows. First, a Perl program retrieves all N-V pairs that appear in the same sentence in a part of the corpus and include one of the studied common nouns, and forwards them to 4 GL experts. The experts manually tag each pair as relevant or not. Divergences are discussed until complete agreement is reached. In a second stage, this reference corpus is compared to the answers obtained for these N-V pairs of the same part in the corpus by the application of the clauses learnt with ALEPH. The results obtained for the seven selected common nouns are presented in Table 2. One N-V pair is considered as tagged “relevant” by the clauses if at least one of them covers this pair. qualia pairs detected qualia non-qualia pairs detected qualia qualia pairs detected non-qualia non-qualia pairs detected non-qualia Pearson 62 40 4 180 0.666 Table 2: Empirical validation on the MATRA - CCR corpus These results are quite promising, especially if we compare them to those obtained by χ2 correlation (see Table 3) which was the first step of our selection of N-V couples in the corpus (see Section 4.1). qualia pairs detected qualia non-qualia pairs detected qualia qualia pairs detected non-qualia non-qualia pairs detected non-qualia Pearson 33 35 33 185 0.337 Table 3: χ2 results on the MATRA - CCR corpus On one side, our ILP method detects most of the qualia N-V couples, like porte-ouvrir (dooropen) or voyant-signaler (warning light-warn). The four non-detected pairs appear in very rare constructions in our corpus, like prise-relier (plug-connect) in la citerne est reliée à l’appareil par des prises (the tank is connected to the machine by plugs) where a prepositional phrase (PP) à 511 C LAVEAU , S ÉBILLOT, FABRE AND B OUILLON l’appareil (to the machine) is inserted between the verb and the par-PP (by-PP). On the other side, only 8 pairs from the 40 non-qualia pairs detected qualia by our learning method cannot be linked syntactically. That means that the ILP algorithm can already reliably distinguish between syntactically and not syntactically linked pairs. In comparison, 25 of the 35 non-qualia pairs detected qualia by the χ2 are not even syntactically related. The main problem for the ILP algorithm is therefore to correctly identify N-V pairs related by a telic or agentive relation—the most common qualia links in our corpus—among the pairs that could be syntactically related. But here we should carefully distinguish two types of errors. The first ones are caused by constructions that are ambiguous and where the N-V can or cannot be syntactically related, as enlever-prises (remove-plugs) in enlever les shunts sur les prises (remove the shunts from the plugs). They cannot be disambiguated by superficial clues about the context in which the V and the N occur and show the limitation of using tagged corpus for the learning process. However they are very rare in our corpus (8 pairs). On the contrary, all remaining errors seem more related to the parameterizing of the learning method. For example, taking into consideration the number of nouns between the V and the N could avoid a lot of wrong pairs like poser-capot (put up-cover) in poser les obturateurs capots (put up cover stopcocks) or assurer-voyant (make sure-warning light) in s’assurer de l’allumage du voyant (make sure that the warning light is switched on). The empirical validation can be therefore considered as positive and we can now focus on the last step of the evaluation that consists in assessing the linguistic validity of the generalized clauses. 5.3 Linguistic Validation For the linguist, the issue is not only to find good examples of qualia relations but also to identify in texts the linguistic patterns that are used to express them. Consequently, the question is: what do these clauses tell us about the linguistic structures that are likely to convey qualia relations between a noun and a verb? We know from previous research (Morin, 1999) dealing with other types of semantic relations that a given relation can be instantiated by a wide variety of linguistic patterns, and that this set of structures may greatly vary from one corpus to another. Such research generally focuses on hyperonymy (is-a) and meronymy (part-of) relations, which provide the basic structure of ontologies. Our aim is thus similar, with the additional difficulty that some of the relations we focus on—such as the telic or agentive ones—have never been extensively studied on corpora, and are more difficult to identify than more conventional semantic relations. Previous research concerned with the acquisition of elements of GL (Pustejovsky et al., 1993) has looked at some solutions for identifying words linked by prespecified syntactic relations in texts, such as object relations between verbs and nouns, or certain types of N-N compounds. This research is not deeply evaluated and is however quite different from ours: contrary to this approach, we have indeed no a priori assumptions about the kind of structures in which telic, agentive or formal N-V pairs may be found. We are thus faced with a set of nine clauses that we now try to interpret in terms of linguistic rules: (1) is qualia(A,B) :- precedes(B,A), near verb(A,B), infinitive(B), action verb(B). (2) is qualia(A,B) :- contiguous(A,B). (3) is qualia(A,B) :- precedes(B,A), near word(A,B), near verb(A,B), suc(B,C), preposition(C). (4) is qualia(A,B) :- near word(A,B), pred(A,C), void(C). (5) is qualia(A,B) :- precedes(B,A), suc(B,C), pred(A,D), punctuation(D), singular common noun(A), colon(C). 512 L EARNING S EMANTIC L EXICONS U SING ILP (6) is qualia(A,B) :- near word(A,B), suc(B,C), suc(C,D), action verb(D). (7) is qualia(A,B) :- precedes(A,B), near word(A,B), pred(A,C), punctuation(C). (8) is qualia(A,B) :- near verb(A,B), pred(B,C), pred(C,D), pred(D,E), preposition(E), pred(A,F), void(F). (9) is qualia(A,B) :- precedes(A,B), near verb(A,B), pred(A,C), subordinating conjunction(C). where near word(X,Y) means that X and Y are separated by at least one word and at most two words, and near verb(X,Y) that there is no verb between X and Y. What is most striking is the fact that, at this level of generalization, few linguistic features are retained. Previous learning on the same corpus with no semantic tagging using PROGOL and a poorer contextual information (Sébillot et al., 2000) had led to less generalized rules containing more linguistic elements; these rules were however less relevant for acquiring correct qualia pairs. The 9 clauses learnt here seem to provide very general indications and tell us very little about verb types (action verb is the only information we get), nouns (common noun) or prepositions that are likely to fit into such structures. But the clauses contain other information, related to several aspects of linguistic descriptions, like: - proximity: this is a major criterion. Most clauses indicate that the noun and the verb must be either contiguous (Clause 2) or separated by at most one element (Clauses 3, 4, 6 and 7) and that no verb must appear between N and V (Clauses 1, 3, 8 and 9). - position: Clauses 4 and 7 indicate that one of the two elements is found at the beginning of a sentence or right after a punctuation mark, whereas the relative position of N and V ( precedes/2) is given in Clauses 1, 3, 5, 7 and 9. - punctuation: punctuation marks, more specifically colons, are mentioned in Clauses 5 and 7. - morpho-syntactic categorization: the first clause detects a very important structure in the text, corresponding to action verbs in the infinitive form. These features shed light on linguistic patterns that are very specific to the corpus, a text falling within the instructional genre. We find in this text many examples in which a verb at the infinitive form occurs at the beginning of a proposition and is followed by a noun phrase. Such lists of instructions are very typical of the corpus: - débrancher la prise (disconnect the plug) - enclencher le disjoncteur (engage the circuit breaker) - déposer les obturateurs (remove the stopcocks) To further evaluate these findings, we have compared what we find by means of the learning process to linguistic observations obtained manually on the same corpus (Galy, 2000). Galy has listed a set of canonical verbal structures that convey telic information: infinitive verb + det + noun (visser le bouchon) (to tighten the cap) verb + det + noun (ferment le circuit ) (close the circuit) noun + past participle (bouchon maintenu ) (held cap) noun + be + past participle (circuits sont raccordés) (circuits are connected) noun + verb (un bouchon obture) (a cap blocks up) be + past participle + par + det + noun (sont obturées par les bouchons) (are blocked up by caps) The two types of results show some overlap: both experiments demonstrate the significance of infinitive structures and highlight patterns in which the verb and noun are very close to each other. Yet the results are quite different since the learning method proposes a generalization of the struc513 C LAVEAU , S ÉBILLOT, FABRE AND B OUILLON tures discovered by Galy. In particular, the opposition between passive and active constructions is merged in Clause 2 by the indication of mere contiguity (V can occur before or after N). Conversely, some clues, like punctuation marks and position in the sentence, have not been observed by manual analysis because they are related to levels of linguistic information that are usually neglected by linguistic observation, even if they are known to be good pattern markers (Jones, 1994). Consequently, when we look at the results of the learning process from a linguistic point of view, it appears that the clauses give very general surface clues about the structures that are favored in the corpus for the expression of qualia relations. Yet, these clues are sufficient to give access to some corpus-specific patterns, which is a very interesting result. 6. Conclusions and Future Work The acquisition method of N-V qualia pairs—as defined in Pustejovsky’s generative lexicon formalism— that we have developed leads to very promising results. Concerning the ILP learning system itself, we have defined and made the most of a well-suited generality notion extending object identity subsumption, which has led to obtaining only well-formed hypotheses that can be linguistically interpreted. The speed of the learning step is improved by safely pruning the search of the best rules on certain conditions expressed as private properties. The rules that are learnt lead to very good results for the N-V qualia pairs acquisition task: 94% of all relevant pairs are detected for seven significant common nouns; these results have to be compared with the 50% results of χ2 . Moreover, from a practical point of view, the linguistic validation of the inferred rules confirms the ability of our method to help a linguist detect linguistic patterns dedicated to the expression of qualia roles. One next step of our research will consist in repeating the experiment on new textual data, in order to see what types of specific structures will be detected in a less technical corpus; and we will also focus on N-N pairs, which very frequently exhibit telic relations in texts (as in: bouchon de protection, protective cap). Another potential avenue is to try to learn separately each qualia semantic relation (telic, agentive, formal) instead of all together as it is done up to now. Even if such a distinction is maybe not useful for an information retrieval application, it could result in linguistically interesting rules. Other future studies should also be undertaken to improve the portability of the full method. In particular, the semantic tagging of a corpus needs an expert’s supervision to build the semantic classification of all the words. Even if the determination of the relevant classes for one domain can be partly automated (see Agarwal, 1995; Grefenstette, 1994b, for example), it still remains too costly to be carried out on any new corpus. The last phase of the project will deal with the real use of the N-V (and possibly N-N) pairs obtained with the machine learning method within an information retrieval system (such as a textual search engine) and the evaluation of the improvement of its performances both from a theoretical (recall and precision rate) and empirical (with the help of real human users) point of view. Acknowledgments The authors wish to thank Céline Rouveirol for helpful discussions and for insightful comments on an earlier version of this paper. They would also like to thank James Cussens and the anonymous reviewers for their excellent advice. 514 L EARNING S EMANTIC L EXICONS U SING ILP Appendix A. Background Knowledge Here is the listing of the background knowledge part describing the linguistic relations as used in the experiments described in Section 5. %%%%%%%%%%%%%%%%%%%%%%%%% % background knowledge % common noun %%%%%%%%%%%%% common noun( W ) :- plural common noun( W ). common noun( W ) :- singular common noun( W ). common noun( W ) :- abstraction( W ). common noun( W ) :- event( W ). common noun( W ) :- group( W ). common noun( W ) :- psychological feature( W ). common noun( W ) :- state( W ). common noun( W ) :- entity( W ). common noun( W ) :- location( W ). plural common noun(W):- tagcat(W,tc noun pl). singular common noun(W):- tagcat(W,tc noun sg). abstraction( W ) :- attribute( W ). abstraction( W ) :- measure( W ). abstraction( W ) :- relation( W ). event( W ) :- natural event( W ). event( W ) :- act(W). event( W ) :- phenomenon(W ). natural event( W ) :- tagsem(W, ts hap ). phenomenon( W ) :- tagsem(W, ts phm ). phenomenon( W ) :- process( W ). process( W ) :- tagsem(W, ts pro ). act( W ) :- tagsem(W, ts act ). act( W ) :- human activity( W ). human activity( W ) :- tagsem(W, ts acy ). group( W ) :- tagsem(W, ts grp ). group( W ) :- social group( W ). social group( W ) :- tagsem(W, ts grs ). psychological feature( W ) :- tagsem(W, ts psy ). state( W ) :- tagsem(W, ts sta ). entity( W ) :- tagsem(W, ts ent ). entity( W ) :- body part( W ). entity( W ) :- causal agent( W ). entity( W ) :- object( W ). body part( W ) :- tagsem(W, ts prt ). object( W ) :- tagsem(W, ts pho ). object( W ) :- artefact( W ). object( W ) :- part( W ). object( W ) :- substance( W ). part( W ) :- tagsem(W, ts por ). location( W ) :- tagsem(W, ts loc ). location( W ) :- point(W). point( W ) :- tagsem(W, ts pnt ). 515 C LAVEAU , S ÉBILLOT, FABRE point( W ) :- position( W ). position( W ) :- tagsem(W, ts pos ). attribute( W ) :- tagsem(W, ts atr ). attribute( W ) :- form( W ). attribute( W ) :- property( W ). form( W ) :- tagsem(W, ts frm ). property( W ) :- tagsem(W, ts pty ). measure( W ) :- tagsem(W, ts mea ). measure( W ) :- definite quantity( W ). measure( W ) :- unit( W ). time unit( W ) :- tagsem(W, ts tme ). definite quantity( W ) :- tagsem(W, ts qud ). unit( W ) :- tagsem(W, ts unt ). unit( W ) :- time unit( W ). relation( W ) :- tagsem(W, ts rel ). relation( W ) :- communication( W ). communication( W ) :- tagsem(W, ts com ). causal agent( W ) :- tagsem(W, ts agt ). causal agent( W ) :- human( W ). human( W ) :- tagsem(W, ts hum ). artefact( W ) :- tagsem(W, ts art ). artefact( W ) :- instrument(W). instrument( W ) :- tagsem(W, ts ins ). instrument( W ) :- container( W ). container( W ) :- tagsem(W, ts cnt ). substance( W ) :- tagsem(W, ts sub ). substance( W ) :- chemical compound( W ). substance( W ) :- stuff( W ). chemical compound( W ) :- tagsem(W, ts chm ). stuff( W ) :- tagsem(W, ts stu ). % verb %%%%%%%%%%%%%%%%% verb( W ) :- infinitive( W ). verb( W ) :- participle( W ). verb( W ) :- conjugated( W ). verb( W ) :- action verb( W ). verb( W ) :- state verb( W ). verb( W ) :- modal verb( W ). verb( W ) :- temporality verb( W ). verb( W ) :- possesion verb( W ). verb( W ) :- auxiliary( W ). infinitive( W ) :- tagcat(W, tc verb inf). participle( W ) :- present participle( W ). participle( W ) :- past participle( W ). present participle( W ) :- tagcat(W, tc verb prp). past participle( W ) :- tagcat(W, tc verb pap). conjugated( W ) :- conjugated plural(W). conjugated( W ) :- conjugated singular(W). conjugated plural( W ) :- tagcat(W, tc verb pl). conjugated singular( W ) :- tagcat(W, tc verb sg). action verb( W ) :- cognitive action verb( W ). 516 AND B OUILLON L EARNING S EMANTIC L EXICONS U SING ILP action verb( W ) :- physical action verb( W ). cognitive action verb( W ) :- tagsem(W, ts acc ). physical action verb( W ) :- tagsem(W, ts acp ). state verb( W ) :- tagsem(W, ts eta ). modal verb( W ) :- tagsem(W, ts mod ). temporality verb( W ) :- tagsem(W, ts tem ). possesion verb( W ) :- tagsem(W, ts posv ). auxiliary( W ) :- tagsem(W, ts aux ). % preposition %%%%%%%% preposition( W ) :- tagcat(W, tc prep). preposition( W ) :- spat preposition( W ). preposition( W ) :- goal preposition( W ). preposition( W ) :- temp preposition( W ). preposition( W ) :- manner preposition( W ). preposition( W ) :- rel preposition( W ). preposition( W ) :- caus preposition( W ). preposition( W ) :- neg preposition( W ). preposition( W ) :- en preposition( W ). preposition( W ) :- sous preposition( W ). preposition( W ) :- a preposition( W ). preposition( W ) :- de preposition( W ). spat preposition( W ) :- tagsem(W, ts rspat ). goal preposition( W ) :- tagsem(W, ts rpour ). temp preposition( W ) :- tagsem(W, ts rtemp ). manner preposition( W ) :- tagsem(W, ts rman ). rel preposition( W ) :- tagsem(W, ts rrel ). caus preposition( W ) :- tagsem(W, ts rcaus ). neg preposition( W ) :- tagsem(W, ts rneg ). en preposition( W ) :- tagsem(W, ts ren ). sous preposition( W ) :- tagsem(W, ts rsous ). a preposition( W ) :- tagsem(W, ts ra ). de preposition( W ) :- tagsem(W, ts rde ). % adjective %%%%%%%%%%% adjective( W ) :- singular adjective( W ). adjective( W ) :- plural adjective( W ). adjective( W ) :- verbal adjective( W ). adjective( W ) :- comparison adjective( W ). adjective( W ) :- concrete prop adjective( W ). adjective( W ) :- abstract prop adjective( W ). adjective( W ) :- nominal adjective( W ). singular adjective( W ) :- tagcat(W, tc adj sg). plural adjective( W ) :- tagcat(W, tc adj pl). verbal adjective( W ) :- tagcat(W, tc verb adj). comparison adjective( W ) :- tagsem(W, ts acomp ). concrete prop adjective( W ) :- tagsem(W, ts apty ). abstract prop adjective( W ) :- tagsem(W, ts apa ). nominal adjective( W ) :- tagsem(W, ts anom ). 517 C LAVEAU , S ÉBILLOT, FABRE AND % pronoun %%%%%%%%%%%%% pronoun(W):- rel pronoun(W). pronoun(W):- non rel pronoun(W). pronoun(W):- tagsem(W, ts pron).*** rel pronoun(W) :- tagcat(W, tc pron rel). non rel pronoun(W) :- tagcat(W, tc pron). % others %%%%%%%%%%%%% proper noun( W ) :- tagsem(W, ts nompropre ). proper noun( W ) :- tagsem(W, ts numero ). coordinating conjunction(W) :- tagsem(W, ts rconj). subordinating conjunction(W) :- tagsem(W, ts subconj). bracket( W ) :- tagsem(W, ts paro ). bracket( W ) :- tagsem(W, ts parf ). ponctuation( W ) :- comma( W ). ponctuation( W ) :- colon( W ). ponctuation( W ) :- dot( W ). ponctuation( W ) :- tagcat(W, tc wpunct). comma( W ) :- tagsem(W, ts virg ). colon( W ) :- tagsem(W, ts ponct ). dot( W ) :- tagsem(W, ts punct ). void(W) :- tagcat(M,tc vide). figures( W ) :- tagsem(W, ts quant ). %%%%%%%%%%%%%%%%%%%%%%%%%%%%% % order % precedes(V,N) :- distances(N,V,X, ), 0<X. precedes(N,V) :- distances(N,V,X, ), 0>X. %%%%%%%%%%%%%%%%%%%%%%%%%%%%% % distances in verbs % near verb(N,V) :- distances(N,V, ,1). near verb(N,V) :- distances(N,V, ,-1). far verb( N,V ) :- distances(N,V, ,X), -1>X , -3<X. far verb( N,V ) :- distances(N,V, ,X), 1<X , X<3. very far verb( N,V ) :- distances(N,V, ,X), -2>X. very far verb( N,V ) :- distances(N,V, ,X), X>2. %%%%%%%%%%%%%%%%%%%%%%%%%%%%% % distances in words % contiguous(N,V) :- distances(N,V,1, ). contiguous(N,V) :- distances(N,V,-1, ). near word(N,V) :- distances(N,V,X, ), -1>X , -4<X. near word(N,V) :- distances(N,V,X, ), 1<X , X<4. far word(N,V) :- distances(N,V,X, ), -3>X, -8<X. far word(N,V) :- distances(N,V,X, ), X>3, X<8. very far word(N,V) :- distances(N,V,X, ), -7>X. 518 B OUILLON L EARNING S EMANTIC L EXICONS U SING ILP very far word(N,V) :- distances(N,V,X, ), X>7. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % other predicates suc(X,Y) :- pred(Y,X). tagcat(Word, POStag) :- tags(Word, POStag, ). tagsem(Word, Semtag) :- tags(Word, , Semtag). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % information about examples tags(m15278 1 deb,tc vide,ts vide). tags(m15278 1,tc verb inf,ts tem). pred(m15278 1,m15278 1 deb). ... Appendix B. Hypothesis Search Space A clause space ordered by θOI -subsumption (see Definition 2, page 506) is in general not a lattice whereas this is the case under θ-subsumption (Semeraro et al., 1994). However, we show in the first section of this appendix that such a clause space can be a lattice when particular assumptions concerning the clauses that it contains are made. A similar proof for our application framework, that is, for the hypothesis search space presented in Section 4.2 with a θNV -subsumption quasi-ordering, is proposed in the second section. B.1 Hypothesis Lattice under θOI -subsumption A quasi-ordered set under θOI -subsumption is in general not a lattice since the infimum and supremum are generally not unique in these sets. However, let us consider determinate linked clauses (see Section 4.2) and a space bounded below by a bottom clause (⊥). All these conditions ensure the infimum and supremum of two clauses in our hypothesis space to be unique. In this first section, C  D (respectively C ∼ D) means C is more general (equivalent) than D with respect to the θOI -subsumption order. Proposition 4 For any C and D in the space of linked determinate clauses ordered by θOI -subsumption, if C  D then the injective substitution θ such that Cθ ⊆ D is unique. Proof Reductio ad absurdum. Let us consider that there exist two different injective substitutions θ1 and θ2 such that Cθ1 ⊆ D and Cθ2 ⊆ D. Since θ1 and θ2 are injective, Cθ1 and Cθ2 only differ in variable naming. C and D are linked clauses, this means that there exists a literal l ∈ C such that lθ1 ∈ D, lθ2 ∈ D and lθ1 6= lθ2 where input variables of l are identical in lθ1 and lθ2 and output variables are different. This contradicts the fact that all literals are determinate. Proposition 5 In the space of linked determinate clauses ordered by θOI -subsumption and bounded below by a bottom clause ⊥, the supremum of any two clauses is unique. Proof Reductio ad absurdum. Let us consider A1 and A2 as two different suprema for C1 and C2 . A1 , A2 , C1 and C2 are more general than ⊥, so there exists a unique θA⊥1 such that A1 θA⊥1 ⊆ ⊥ 519 C LAVEAU , S ÉBILLOT, FABRE AND B OUILLON (Proposition 4). In the same way, we have unique θA⊥2 , θC⊥1 , and θC⊥2 such that A2 θA⊥2 ⊆ ⊥, C1 θC⊥1 ⊆ ⊥ and C2 θC⊥2 ⊆ ⊥. A1 is a supremum for C1 so A1  C1 θC⊥1 since C1 ∼ C1 θC⊥1 . Thus, there exists θ1 such that A1 θ1 ⊆ C1 θC⊥1 . Now, C1 θC⊥1 ⊆ ⊥ therefore A1 θ1 ⊆ ⊥, which means that θ1 = θA⊥1 (Proposition 4). Therefore, we have A1 θA⊥1 ⊆ C1 θC⊥1 and in a similar way, A1 θA⊥1 ⊆ C2 θC⊥2 , A2 θA⊥2 ⊆ C1 θC⊥1 and A2 θA⊥2 ⊆ C2 θC⊥2 . Let us note S = A1 θA⊥1 ∪ A2 θA⊥2 . Thus, S ⊆ C1 θC⊥1 and S ⊆ C2 θC⊥2 . This means that S  C1 and S  C2 since C1 θC⊥1 ∼ C1 and C2 θC⊥2 ∼ C2. Besides, A1  S, A2  S and S ≁ A1 , S ≁ A2 because A1 ≁ A2 . This contradicts the fact that A1 and A2 are suprema for C1 and C2 . Proposition 6 In the space of linked determinate clauses ordered by θOI -subsumption and bounded below by a bottom clause ⊥, the infimum of any two clauses is unique. Proof Same thing as for supremum, with C1 and C2 two infima for A1 and A2 . Then consider I = C1 θC⊥1 ∩C2 θC⊥2 . From Propositions 5 and 6, we can conclude that the space of linked determinate clauses ordered by θOI -subsumption and bounded below by a bottom clause ⊥ is a lattice. B.2 Hypothesis Lattice under θNV -subsumption As for θOI -subsumption, we show that in our application framework, the hypothesis search space ordered by the θNV -subsumption is a lattice. In the remaininder of this appendix, B represents the background knowledge used for our learning task,  and ∼ denote the θNV -subsumption order as defined in Section 4.2. Proposition 7 In the space of well-formed clauses ordered by θNV -subsumption, for any clause C and D, if C  D then the injective substitution θ such that f (C)θ ⊆ D (with f such that ∀l ∈ C, B, f (l) |= l) is unique. Proof Same proof as Propostion 4 by considering Cchain —the subset of C containing the head literal and all the pred/2 and suc/2 literals of C—and by noting that Cchain contains all the variables of C and that with respect to our particular background knowledge, for any f such that f (C)θ ⊆ D with f such that ∀l ∈ C, B, f (l) |= l, necessarily ∀l ∈ Cchain , f (l) = l. Proposition 8 In the space of well-formed clauses ordered by θNV -subsumption, the supremum of any two clauses is unique. Proof Reductio ad absurdum. Let us consider A1 and A2 as two different suprema for C1 and C2 . A1 is more general than ⊥, so ∃θA⊥1 injective and f⊥A1 such that f⊥A1 (A1 )θA⊥1 ⊆ ⊥ and θA⊥1 is unique (Proposition 7). In the same way, we have unique θA⊥2 , θC⊥1 , and θC⊥2 . A1 is a supremum for C1 so there exist θ1 and f1 such that f1 (A1 )θ1 ⊆ C1 θC⊥1 . Now, with Achain 1 θ1 ⊆ C1chain θC⊥1 since f (Achain ) = Achain . In the same way C1chain θC⊥1 ⊆ ⊥. as defined above, Achain 1 Therefore, we have Achain θ1 ⊆ C1chain θC⊥1 ⊆ ⊥ and then, from Proposition 7, θ1 = θA⊥1 . Finally, we 1 520 L EARNING S EMANTIC L EXICONS U SING ILP have f1 (A1 )θA⊥1 ⊆ C1 θC⊥1 and in a similar way, there exist f2 , f3 and f4 such that f2 (A1 )θA⊥1 ⊆ C2 θC⊥2 , f3 (A2 )θA⊥2 ⊆ C1 θC⊥1 and f4 (A2 )θA⊥2 ⊆ C2 θC⊥2 . Let us note that S = A1 θA⊥1 ∪ A2 θA⊥2 \ {l1 | l1 , l2 ∈ (A1 θA⊥1 ∪ A2 θA⊥2 ), l1 6= l2 , and B, l2 |= l1 }. S is a well-formed hypothesis and by construction S  A1 θA⊥1 and S  A2 θA⊥2 and since A1 θA⊥1 ∼ A1 and A2 θA⊥2 ∼ A2 , then S  A1 and S  A2 . We define f5 such that ∀liS ∈ S, f5 (liS ) = f1 (liS ) if liS ∈ A1 θA⊥1 and f5 (liS ) = f3 (liS ) otherwise. Similarly, we define f6 such that ∀liS ∈ S, f6 (liS ) = f2 (liS ) if liS ∈ A1 θA⊥1 and f6 (liS ) = f4 (liS ) otherwise. Thus, f5 (S) ⊆ C1 θC⊥1 and f6 (S) ⊆ C2 θC⊥2 , which means that S  C1 θC⊥1 and S  C2 θC⊥2 . Therefore, S  C1 and S  C2 . This contradicts the fact that A1 and A2 are suprema for C1 and C2 . Proposition 9 In the space of well-formed clauses ordered by θNV -subsumption and with respect to our background knowledge, the infimum of any two clauses is unique. Proof Same thing as for supremum, with C1 and C2 two infima for A1 and A2 . Then consider I = (C1 θC⊥1 ∩C2 θC⊥2 ) ∪ {l1 | l1 , l2 ∈ C1 θC⊥1 ∪C2 θC⊥2 , l1 6= l2 and B, l2 |= l1 }. From Propositions 8 and 9, we can conclude that our hypothesis space ordered by θNV -subsumption is a lattice. References Rajeev Agarwal. Semantic Feature Extraction from Technical Texts with Limited Human Intervention. PhD thesis, Mississippi State University, USA, 1995. Susan Armstrong. MULTEXT: Multilingual text tools and corpora. In H. Feldweg and W. Hinrichs, editors, Lexikon und Text, pages 107–119. Max Niemeyer Verlag, Tübingen, Germany, 1996. Susan Armstrong, Pierrette Bouillon, and Gilbert Robert. Tagger Technical report, ISSCO, University of Geneva, Switzerland, 1995. http://issco-www.unige.ch/staff/robert/tatoo/tagger.html. overview. URL Liviu Badea and Monica Stanciu. Refinement operators can be (weakly) perfect. In Sašo Džeroski and Peter Flach, editors, Proceedings of the 9th International Conference on Inductive Logic Programming, ILP-99, volume 1634 of LNAI, pages 21–32, Bled, Slovenia, 1999. Springer-Verlag. Jacques Bouaud, Benoı̂t Habert, Adeline Nazarenko, and Pierre Zweigenbaum. Regroupements issus de dépendances syntaxiques en corpus: Catégorisation et confrontation avec deux modélisations conceptuelles. In Manuel Zacklad, editor, Proceedings of Ingénierie des Connaissances, pages 207–223, Roscoff, France, 1997. AFIA - Éditions INRIA Rennes. Pierrette Bouillon, Robert H. Baud, Gilbert Robert, and Patrick Ruch. Indexing by statistical tagging. In Martin Rajman and Jean-Cédric Chappelier, editors, Proceedings of Journées d’Analyse statistique des Données Textuelles, JADT’2000, pages 35–42, Lausanne, Switzerland, 2000a. Pierrette Bouillon and Federica Busa. Generativity in the Lexicon. Cambridge University Press, Cambridge, UK, 2001. 521 C LAVEAU , S ÉBILLOT, FABRE AND B OUILLON Pierrette Bouillon, Vincent Claveau, Cécile Fabre, and Pascale Sébillot. Using part-of-speech and semantic tagging for the corpus-based learning of qualia structure elements. In Pierrette Bouillon and Kyoko Kanzaki, editors, Proceedings of First International Workshop on Generative Approaches to the Lexicon, GL’2001, Geneva, Switzerland, 2001. Geneva University Press. Pierrette Bouillon, Cécile Fabre, Pascale Sébillot, and Laurence Jacqmin. Apprentissage de ressources lexicales pour l’extension de requêtes. Traitement Automatique des Langues, special issue: Traitement automatique des langues pour la recherche d’information, 41(2):367–393, 2000b. Pierrette Bouillon, Sabine Lehmann, Sandra Manzi, and Dominique Petitpierre. Développement de lexiques à grande échelle. In André Clas, Salah Mejri, and Taı̈eb Baccouche, editors, Proceedings of Colloque de Tunis 1997 “La mémoire des mots”, pages 71–80, Tunis, Tunisia, 1998. Serviced. Ted Briscoe and John Carroll. Automatic extraction of subcategorisation from corpora. In Paul Jacobs, editor, Proceedings of 5th ACL conference on Applied Natural Language Processing, pages 356–363, Washington, USA, 1997. Morgan Kaufmann. Wray Lindsay Buntine. Generalized subsumption and its application to induction and redundancy. Artificial Intelligence, 36(2):149–176, 1988. Floriana Esposito, Angela Laterza, Donato Malerba, and Giovanni Semeraro. Refinement of Datalog programs. In B. Pfahringer and J. Fürnkranz, editors, Proceedings of the MLnet Familiarization Workshop on Data Mining with Inductive Logic Programming, pages 73–94, Bari, Italy, 1996. Cécile Fabre and Pascale Sébillot. Semantic interpretation of binominal sequences and information retrieval. In Proceedings of International ICSC Congress on Computational Intelligence: Methods and Applications,CIMA’99, Symposium on Advances in Intelligent Data Analysis AIDA’99, Rochester, N.Y., USA, 1999. Cécile Fabre. Interprétation automatique des séquences binominales en anglais et en français. Application à la recherche d’informations. PhD thesis, University of Rennes 1, France, 1996. David Faure and Claire Nédellec. Knowledge acquisition of predicate argument structures from technical texts using machine learning: The system ASIUM. In Dieter Fensel and Rudi Studer, editors, Proceedings of 11th European Workshop EKAW’99, pages 329–334, Dagstuhl, Germany, 1999. Springer-Verlag. Christiane Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, USA, 1998. Edith Galy. Repérer en corpus les associations sémantiques privilégiées entre le nom et le verbe: Le cas de la fonction dénotée par le nom. Master’s thesis, Université de Toulouse - Le Mirail, France, 2000. Gregory Grefenstette. Corpus-derived first, second and third-order word affinities. In W. Martin, W. Meijs, M. Moerland, E. ten Pas, P. van Sterkenburg, and P. Vossen, editors, Proceedings of EURALEX’94, Amsterdam, The Netherlands, 1994a. 522 L EARNING S EMANTIC L EXICONS U SING ILP Gregory Grefenstette. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, Dordrecht, 1994b. Gregory Grefenstette. SQLET: Short query linguistic expansion techniques, palliating one-word queries by providing intermediate structure to text. In Luc Devroye and Claude Chrisment, editors, Proceedings of Recherche d’Informations Assistée par Ordinateur, RIAO’97, pages 500– 509, Montréal, Québec, Canada, 1997. Benoı̂t Habert, Adeline Nazarenko, and André Salem. Les linguistiques de corpus. Armand Collin/Masson, Paris, 1997. Zelig Harris, Michael Gottfried, Thomas Ryckman, Paul Mattick (Jr), Anne Daladier, Tzvee N. Harris, and Suzanna Harris. The Form of Information in Science, Analysis of Immunology Sublanguage. Kluwer Academic Publisher, Dordrecht, 1989. Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Christian Boitet, editor, Proceedings of 14th International Conference on Computational Linguistics, COLING-92, pages 539–545, Nantes, France, 1992. Nicolas Helft. Inductive Generalization: A logical framework. In Ivan Bratko and Nada Lavrac, editors, Proceedings of the 2nd European Working Session on Learning, EWSL, pages 149–157, Bled, Yugoslavia, 1987. Sigma Press. Nancy Ide and Jean Véronis. MULTEXT (multilingual tools and corpora). In Proceedings of 15th International Conference on Computational Linguistics, COLING-94, pages 90–96, Kyoto, Japan, 1994. Morgan Kaufmann. Bernard Jones. Can punctuation help parsing? Technical Report 29, Centre for Cognitive Science, University of Edinburgh, UK, 1994. Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Chris S. Mellish, editor, Proceedings of the 14th International Joint Conference on Artificial Intelligence, IJCAI 95, pages 1137–1145, Montréal, Québec, Canada, 1995. Morgan Kaufmann. Emmanuel Morin. Extraction de liens sémantiques entre termes à partir de corpus de textes techniques. PhD thesis, Université de Nantes, France, 1999. Stephen Muggleton. Inverse entailment and Progol. New Generation Computing, 13(3-4):245–286, 1995. Stephen Muggleton and Luc De Raedt. Inductive logic programming: Theory and methods. Journal of Logic Programming, 19-20:629–679, 1994. Stephen Muggleton and Cao Feng. Efficient induction of logic programs. In Setsuo Arikawa, S. Goto, Setsuo Ohsuga, and Takashi Yokomori, editors, Proceedings of the 1st Conference on Algorithmic Learning Theory, pages 368–381, Tokyo, Japan, 1990. Springer-Verlag - Ohmsha. Claire Nédellec, Céline Rouveirol, Hilde Adé, Francesco Bergadano, and Birgit Tausend. Declarative bias in inductive logic programming. In Luc De Raedt, editor, Advances in Inductive Logic Programming, pages 82–103. IOS Press, Amsterdam, 1996. 523 C LAVEAU , S ÉBILLOT, FABRE AND B OUILLON Shan-Hwei Nienhuys-Cheng and Ronald de Wolf. Least generalizations and greatest specializations of sets of clauses. Journal of Artificial Intelligence Research, 4:341–363, 1996. Dominique Petitpierre and Graham Russell. MMORPH - the multext morphology program. Technical report, ISSCO, University of Geneva, Switzerland, 1998. Ronan Pichon and Pascale Sébillot. Acquisition automatique d’informations lexicales à partir de corpus: Un bilan. Research report 3321, INRIA, Rennes, France, 1997. Ronan Pichon and Pascale Sébillot. From corpus to lexicon: From contexts to semantic features. In Barbara Lewandowska-Tomaszczyk and Patrick James Melia, editors, Proceedings of Practical Applications in Language Corpora, PALC’99, volume 1 of Lodz studies in Language, pages 375– 389. Peter Lang, 2000. Gordon D. Plotkin. A note on inductive generalization. In B. Meltzer and D. Michie, editors, Machine Intelligence 5, pages 153–163, Edinburgh, 1970. Edinburgh University Press. James Pustejovsky. The Generative Lexicon. MIT Press, Cambridge, MA, USA, 1995. James Pustejovsky, Peter Anick, and Sabine Bergler. Lexical semantic techniques for corpus analysis. Computational Linguistics, 19(2):331–358, 1993. John Ross Quinlan. Learning logical definitions from relations. Machine Learning, 5(3):239–266, 1990. Anke Rieger. Optimizing chain Datalog programs and their inference procedures. LS-8 Report 20, University of Dortmund, Lehrstuhl Informatik VIII, Dortmund, Germany, 1996. Gerard Salton. Automatic Text Processing. Addison-Wesley, 1989. Pascale Sébillot, Pierrette Bouillon, and Cécile Fabre. Inductive logic programming for corpusbased acquisition of semantic lexicons. In Claire Cardie, Walter Daelemans, Claire Nédellec, and Erik Tjong Kim Sang, editors, Proceedings of the Fourth Conference on Computational Natural Language Learning (CoNLL-2000) and of the Second Learning Language in Logic Workshop (LLL-2000), pages 199–208, Lisbon, Portugal, September 2000. Giovanni Semeraro, Floriana Esposito, Donato Malerba, Clifford Brunk, and Michael J. Pazzani. Avoiding non-termination when learning logic programs: A case study with FOIL and FOCL. In L. Fribourg and F. Turini, editors, Proceedings of Logic Program Synthesis and Transformation - MetaProgramming in Logic, LOPSTR 1994, volume 883 of LNCS, pages 183–198. SpringerVerlag, 1994. Alan F. Smeaton. Using NLP or NLP resources for information retrieval tasks. In Tomek Strzalkowski, editor, Natural Language Information Retrieval, pages 99–111. Kluwer Academic Publishers, Dordrecht, 1999. Karen Spärck Jones. What is the role of NLP in text retrieval? In Tomek Strzalkowski, editor, Natural Language Information Retrieval, pages 1–24. Kluwer Academic Publishers, Dordrecht, 1999. 524 L EARNING S EMANTIC L EXICONS U SING ILP Tomek Strzalkowski. Natural language information retrieval. Information Processing and Management, 31(3):397–417, 1995. Fabien Torre and Céline Rouveirol. Natural ideal operators in inductive logic programming. In M. van Someren and Widmer G., editors, Proceedings of 9th European Conference on Machine Learning (ECML’97), volume 1224 of LNAI, pages 274–289, Prague, Czech Republic, April 1997a. Springer-Verlag. Fabien Torre and Céline Rouveirol. Opérateurs naturels en programmation logique inductive. In Henri Soldano, editor, 12èmes Journées Françaises d’Apprentissage (JFA’97), pages 53–64, Roscoff, France, 1997b. AFIA - Éditions INRIA Rennes. Fabien Torre and Céline Rouveirol. Private properties and natural relations in inductive logic programming. Technical Report 1118, Laboratoire de Recherche en Informatique d’Orsay (LRI), France, July 1997c. Laurence Vandenbroucke. Indexation automatique par couples nom-verbe pertinents. DES information and documentation report, Faculté de Philosophie et Lettres, Université Libre de Bruxelles, Belgium, 2000. Ellen M. Voorhees. Query expansion using lexical-semantic relations. In W. Bruce Croft and C. J. van Rijsbergen, editors, Proceedings of ACM SIGIR’94, Dublin, Ireland, 1994. ACM - SpringerVerlag. Stefan Wermter, Ellen Riloff, and Gabriele Scheler, editors. Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing, volume 1040 of LNCS. SpringerVerlag, 1996. Yorick Wilks and Mark Stevenson. The grammar of sense: Is word-sense tagging much more than part-of-speech tagging? Technical report, University of Sheffield, UK, 1996. 525