Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Acquisition of Turkish meronym based on classification of patterns

Pattern Analysis and Applications, 2015
...Read more
SHORT PAPER Acquisition of Turkish meronym based on classification of patterns Tug ˇba Yıldız 1 Banu Diri 2 Savas ¸Yıldırım 1 Received: 26 December 2014 / Accepted: 14 August 2015 Ó Springer-Verlag London 2015 Abstract The identification of semantic relations from a raw text is an important problem in Natural Language Processing. This paper provides semi-automatic pattern- based extraction of part–whole relations. We utilized and adopted some lexico-syntactic patterns to disclose mer- onymy relation from a Turkish corpus. We applied two different approaches to prepare patterns; one is based on pre-defined patterns that are taken from the literature, second automatically produces patterns by means of bootstrapping method. While pre-defined patterns are directly applied to corpus, other patterns need to be dis- covered first by taking manually prepared unambiguous seeds. Then, word pairs are extracted by their occurrence in those patterns. In addition, we used statistical selection on global data that is obtaining from all results of entire pat- terns. It is a whole-by-part matrix on which several asso- ciation metrics such as information gain, T-score, etc., are applied. We examined how all these approaches improve the system accuracy especially within corpus-based approach and distributional feature of words. Finally, we conducted a variety of experiments with a comparison analysis and showed advantage and disadvantage of the approaches with promising results. Keywords Corpus-based method Á Lexico-syntactic pattern Á Meronym Á Part–whole 1 Introduction Semantic relation refers to the relation between words, phrases, sentences, and documents. One of the important semantic relations is meronymy that represents the rela- tionship between a part and its corresponding whole. The meronym is also mentioned in the literature with other references such as part–whole, mereological parthood relations, or partonomy [13]. Meronymic relationship has been a subject of some disciplines such as cognitive lin- guistics [1, 4], logic [5], psycholinguistics [68], linguistics [9, 10], and so far. Having many aspects of meronym relations turns out to be quite difficult and seems as a complex relation. Because it is hard to differentiate mer- onym relation from other semantic relations. Besides that there is no agreement on how to distinguish various kinds of meronymic relations. For example, the concept ‘‘part of’’ relation is used to denote a family of meronymic relations in many studies. Because ‘‘part of’’ does not always refer to a specific meronymy and represents a variety of meronym relations with test frame such as X is a part of Y. So that studies often provided insights about the several different types of meronymic relations in the lit- erature [2, 69, 11]. In this study, we presented a model for semi-automati- cally extracting part–whole relations from a Turkish raw text. For this purpose, three different clusters of patterns & Tug ˇba Yıldız tdalyan@bilgi.edu.tr Banu Diri banu@ce.yildiz.edu.tr Savas ¸Yıldırım savas.yildirim@bilgi.edu.tr 1 Department of Computer Engineering, Faculty of Engineering, I ˙ stanbul Bilgi University, Santral Campus, Eski Silahtarag ˇa Elektrik Santralı Kazım Karabekir Cad. No: 2/13, 34060 Eyu ¨p, I ˙ stanbul, Turkey 2 Department of Computer Engineering, Faculty of Electric and Electronic, Yıldız Technical University, D Blok Davutpas ¸a Mah., Davutpas ¸a Caddesi, 34220 Esenler, I ˙ stanbul, Turkey 123 Pattern Anal Applic DOI 10.1007/s10044-015-0516-9
were analyzed in Turkish corpus; General, Dictionary- based, and Bootstrapped patterns. First cluster is based on general patterns which are the most widely used in the literature. These patterns were collected from some pioneer studies [2, 8, 12] and analyzed in Turkish. 240K cases were obtained from general patterns. Second one is based on dictionary patterns that were extracted from TDK 1 and Wiktionary 2 . The number of cases is 509K for dictionary- based patterns. We adopted both types of patterns to extract the sentences that include part–whole relations from a Turkish corpus. Some patterns which are not suitable and applicable for Turkish language are eliminated. The most frequent wholes were selected for each lexico-syntactic patterns (LSPs). Each whole and its potential parts were ranked according to their frequencies. Third cluster is based on bootstrapping of the unambiguous seeds. Some manually prepared seeds were used to induce and score LSPs. Six reliable patterns were extracted, some were eliminated according to experiments. We compared the strength of some association measures with respect to their precisions. Variety of statistical methods were applied on the global data obtained from the entire patterns to improve system performance, especially recall. For the evaluation, we selected first 10, 20, and 30 candidates ranked by the association measures such as Dice, T-score, etc. The pro- posed parts of a given whole were manually evaluated by looking at their semantic roles. The rest of this paper is organized as follows: Sect. 2 presents related works in computational linguistics that are listed. Methodology part is presented in Sect. 3. Statistical measurements that are used in this study are proposed in Sect. 4. Challenges are listed in Sect. 5. Evaluation part of the study is explained in Sect. 6. Production capacity and success rate are given in Sect. 7. 2 Meronym studies in computational linguistics In computational linguistics, comprehensive list of studies has been done for automatically discovering semantic relations because of variety of needs such as enriching ontology or building lexical database. Although manually defined lexical resources such as WordNet and FrameNet are very valuable for Natural Language Processing (NLP) problems, they could have limited capacity and might not work against evolving systems, e.g., social media language or so-called texting language. Recent studies emphasized the importance of automatic construction of such lexical database that has high importance especially for domain- specific text and open vocabulary system. For example, the study [13] designed a architecture to capture synonym, is-a and meronym relation for gene ontology. Another studies on meronym extraction are reported for the domain of college biology [14], biomedical text [15], and product development and customer services [16]. Recently, various studies have employed hand-crafted LSPs which are a useful technique especially in semantic relation extraction. Although manually crafting pattern is the most preferred method due to its simplicity and suc- cess, it could be time consuming. To cope with that drawback, bootstrapping approach using seeds was pro- posed to construct patterns [17]. In addition, machine learning techniques with using contextual information or hybrid methods were also offered as alternatives for mer- onym extraction [12, 1820]. The most precise and well-known method that relies on LSPs was applied by [21]. Hand-crafted patterns were identified and suggested for hyponym (is-a) relations from raw text. Although the same technique was applied to extract meronym relations in [21], the results were reported to be concluded with no great success. In [22], a statistical method was proposed to find parts in very large corpus. Using Hearst’s method, five lexical patterns and six seeds (book, building, car, hospital, plant, and school) for wholes were identified. Extracted part– whole relations by using patterns were ranked according to some statistical criteria with an accuracy of 70 % for the top 20 words and 55 % for the top 50 words. A semi-automatic method was presented in [23] for learning semantic constraints to detect part–whole rela- tions. The method picked up pairs from WordNet and searches them on text collection: SemCor and LA Times from TREC-9. Sentences containing pairs were extracted and manually inspected to obtain list of LSPs. Training corpus was generated by manually annotating positive and negative examples. The decision tree algorithm was used as learning procedure. The model’s accuracy was 83 %. The extended version of this study was proposed in [12]. An average precision of 80.95 % was obtained. The study [24] developed a method to discover part– whole relations from vocabularies and text. The method followed two main phases: learning part–whole patterns and learning wholes by applying the patterns. An average precision of 74 % was achieved. As another similar study, Espresso [17] used patterns to find several semantic rela- tions besides meronymic relations by bootstrapping algo- rithm. The method started with applying seed pairs to automatically detect generic patterns. Espresso ranked and filtered patterns/instances with the reliability scoring. System performance for part-of relations on TREC was 80 % precision. The other similar approach to Espresso was proposed in [16, 25]. A set of seeds for each type of part–whole 1 Tu ¨rk Dil Kurumu (The Turkish Language Association). 2 Vikiso ¨zlu ¨k: O ¨ zgu ¨r So ¨zlu ¨k. Pattern Anal Applic 123
Pattern Anal Applic DOI 10.1007/s10044-015-0516-9 SHORT PAPER Acquisition of Turkish meronym based on classification of patterns Tuǧba Yıldız1 • Banu Diri2 • Savaş Yıldırım1 Received: 26 December 2014 / Accepted: 14 August 2015  Springer-Verlag London 2015 Abstract The identification of semantic relations from a raw text is an important problem in Natural Language Processing. This paper provides semi-automatic patternbased extraction of part–whole relations. We utilized and adopted some lexico-syntactic patterns to disclose meronymy relation from a Turkish corpus. We applied two different approaches to prepare patterns; one is based on pre-defined patterns that are taken from the literature, second automatically produces patterns by means of bootstrapping method. While pre-defined patterns are directly applied to corpus, other patterns need to be discovered first by taking manually prepared unambiguous seeds. Then, word pairs are extracted by their occurrence in those patterns. In addition, we used statistical selection on global data that is obtaining from all results of entire patterns. It is a whole-by-part matrix on which several association metrics such as information gain, T-score, etc., are applied. We examined how all these approaches improve the system accuracy especially within corpus-based approach and distributional feature of words. Finally, we & Tuǧba Yıldız tdalyan@bilgi.edu.tr Banu Diri banu@ce.yildiz.edu.tr Savaş Yıldırım savas.yildirim@bilgi.edu.tr 1 Department of Computer Engineering, Faculty of Engineering, İstanbul Bilgi University, Santral Campus, Eski Silahtaraǧa Elektrik Santralı Kazım Karabekir Cad. No: 2/13, 34060 Eyüp, İstanbul, Turkey 2 Department of Computer Engineering, Faculty of Electric and Electronic, Yıldız Technical University, D Blok Davutpaşa Mah., Davutpaşa Caddesi, 34220 Esenler, İstanbul, Turkey conducted a variety of experiments with a comparison analysis and showed advantage and disadvantage of the approaches with promising results. Keywords Corpus-based method  Lexico-syntactic pattern  Meronym  Part–whole 1 Introduction Semantic relation refers to the relation between words, phrases, sentences, and documents. One of the important semantic relations is meronymy that represents the relationship between a part and its corresponding whole. The meronym is also mentioned in the literature with other references such as part–whole, mereological parthood relations, or partonomy [1–3]. Meronymic relationship has been a subject of some disciplines such as cognitive linguistics [1, 4], logic [5], psycholinguistics [6–8], linguistics [9, 10], and so far. Having many aspects of meronym relations turns out to be quite difficult and seems as a complex relation. Because it is hard to differentiate meronym relation from other semantic relations. Besides that there is no agreement on how to distinguish various kinds of meronymic relations. For example, the concept ‘‘part of’’ relation is used to denote a family of meronymic relations in many studies. Because ‘‘part of’’ does not always refer to a specific meronymy and represents a variety of meronym relations with test frame such as X is a part of Y. So that studies often provided insights about the several different types of meronymic relations in the literature [2, 6–9, 11]. In this study, we presented a model for semi-automatically extracting part–whole relations from a Turkish raw text. For this purpose, three different clusters of patterns 123 Pattern Anal Applic were analyzed in Turkish corpus; General, Dictionarybased, and Bootstrapped patterns. First cluster is based on general patterns which are the most widely used in the literature. These patterns were collected from some pioneer studies [2, 8, 12] and analyzed in Turkish. 240K cases were obtained from general patterns. Second one is based on dictionary patterns that were extracted from TDK1 and Wiktionary2. The number of cases is 509K for dictionarybased patterns. We adopted both types of patterns to extract the sentences that include part–whole relations from a Turkish corpus. Some patterns which are not suitable and applicable for Turkish language are eliminated. The most frequent wholes were selected for each lexico-syntactic patterns (LSPs). Each whole and its potential parts were ranked according to their frequencies. Third cluster is based on bootstrapping of the unambiguous seeds. Some manually prepared seeds were used to induce and score LSPs. Six reliable patterns were extracted, some were eliminated according to experiments. We compared the strength of some association measures with respect to their precisions. Variety of statistical methods were applied on the global data obtained from the entire patterns to improve system performance, especially recall. For the evaluation, we selected first 10, 20, and 30 candidates ranked by the association measures such as Dice, T-score, etc. The proposed parts of a given whole were manually evaluated by looking at their semantic roles. The rest of this paper is organized as follows: Sect. 2 presents related works in computational linguistics that are listed. Methodology part is presented in Sect. 3. Statistical measurements that are used in this study are proposed in Sect. 4. Challenges are listed in Sect. 5. Evaluation part of the study is explained in Sect. 6. Production capacity and success rate are given in Sect. 7. 2 Meronym studies in computational linguistics In computational linguistics, comprehensive list of studies has been done for automatically discovering semantic relations because of variety of needs such as enriching ontology or building lexical database. Although manually defined lexical resources such as WordNet and FrameNet are very valuable for Natural Language Processing (NLP) problems, they could have limited capacity and might not work against evolving systems, e.g., social media language or so-called texting language. Recent studies emphasized the importance of automatic construction of such lexical database that has high importance especially for domainspecific text and open vocabulary system. For example, the 1 2 Türk Dil Kurumu (The Turkish Language Association). Vikisözlük: Özgür Sözlük. 123 study [13] designed a architecture to capture synonym, is-a and meronym relation for gene ontology. Another studies on meronym extraction are reported for the domain of college biology [14], biomedical text [15], and product development and customer services [16]. Recently, various studies have employed hand-crafted LSPs which are a useful technique especially in semantic relation extraction. Although manually crafting pattern is the most preferred method due to its simplicity and success, it could be time consuming. To cope with that drawback, bootstrapping approach using seeds was proposed to construct patterns [17]. In addition, machine learning techniques with using contextual information or hybrid methods were also offered as alternatives for meronym extraction [12, 18–20]. The most precise and well-known method that relies on LSPs was applied by [21]. Hand-crafted patterns were identified and suggested for hyponym (is-a) relations from raw text. Although the same technique was applied to extract meronym relations in [21], the results were reported to be concluded with no great success. In [22], a statistical method was proposed to find parts in very large corpus. Using Hearst’s method, five lexical patterns and six seeds (book, building, car, hospital, plant, and school) for wholes were identified. Extracted part– whole relations by using patterns were ranked according to some statistical criteria with an accuracy of 70 % for the top 20 words and 55 % for the top 50 words. A semi-automatic method was presented in [23] for learning semantic constraints to detect part–whole relations. The method picked up pairs from WordNet and searches them on text collection: SemCor and LA Times from TREC-9. Sentences containing pairs were extracted and manually inspected to obtain list of LSPs. Training corpus was generated by manually annotating positive and negative examples. The decision tree algorithm was used as learning procedure. The model’s accuracy was 83 %. The extended version of this study was proposed in [12]. An average precision of 80.95 % was obtained. The study [24] developed a method to discover part– whole relations from vocabularies and text. The method followed two main phases: learning part–whole patterns and learning wholes by applying the patterns. An average precision of 74 % was achieved. As another similar study, Espresso [17] used patterns to find several semantic relations besides meronymic relations by bootstrapping algorithm. The method started with applying seed pairs to automatically detect generic patterns. Espresso ranked and filtered patterns/instances with the reliability scoring. System performance for part-of relations on TREC was 80 % precision. The other similar approach to Espresso was proposed in [16, 25]. A set of seeds for each type of part–whole Pattern Anal Applic relations was defined. Espresso successfully is used to retrieve part–whole relations from corpus. For English corpora, precision was 80 % for general seeds and 82 % for structural ‘‘part of’’ seeds. Another attempt at automatic extraction of part–whole relation was for a Chinese Corpus [26]. The sentence containing part–whole relations was manually picked and then annotated to get LSPs. Patterns were employed on training corpus to find pairs of concepts. A set of heuristic rules was proposed to confirm part–whole relations. The model performance was evaluated with a precision of 86 %. Another important study was done by [27] for Chinese language. This study focused on Name Entity components and their relations. The results have shown that the overall average precision is 63.92 %. In Turkish, recent studies to harvest meronym relations were based on dictionary definition (TDK) and Wiktionary. The studies [28, 29] manually extracted various patterns from dictionary definitions (TDK) by using several features (morphological structures, noun clauses, clue words, and the order of the words in the sentence) to develop a semantic network for Turkish. In the first step, they defined some phrasal patterns that were observed in dictionary definitions to represent specific semantic relations. Second, reliable patterns were applied to dictionary to find the relations. They accounted seven different semantic relations in [29]: hyponym, synonym, antonym, amount-of, group-of, member-of, and has-a. Only last four of them can be subsumed by meronym relations. Accuracy of them was ordered as 81, 87, 96, and 82 %, respectively. In another study for Turkish [30], similar pattern-based approach was applied on TDK and Wiktionary. They listed ten different semantic relations and five of them can be accounted in meronym such as made-of, part–whole, created-by, location-of, and purpose. Accuracy rates were 48, 55, 36, 34, and 55 % for each relations. All these studies are based on dictionary definition (TDK) and Wiktionary in Turkish. The first major attempt [31] modeled a semi-automatically extraction of part– whole relations from a Turkish corpus. The model takes a list of manually prepared seeds to induce syntactic patterns and estimates their reliabilities. It then captures the variations of part–whole candidates from the corpus with 67 % precision. Another study based on LSPs [32] extracted meronym relation. It exploited and compared a list of patterns. It examined how these patterns improve the system performance especially for Turkish corpus. The main objective of all the studies is to elaborate the resources. It is obvious that less-studied languages are lack of various resources. Turkish is among the lessstudied languages and it highly needs such works and resources. It is worth elaborating such language resources by means of automatic architecture. Our corpus-driven meronym extraction architecture is considered the first comprehensive study and major attempt for Turkish meronymy. 3 Methodology The methodology here is to acquire the part–whole pairs from a Turkish corpus of 500M tokens. A morphological parser which is based on a two-level morphology has an accuracy of 98 % was used [33]. The web corpus containing four sub-corpora was used as raw text. Three of them are from major Turkish news portals and another corpus is a general sampling of web pages in the Turkish Language. In this study, meronym relation was considered as a noun-to-noun relation rather than other POS-tags. We evaluate three different clusters of patterns in different aspects; General Patterns (GP), Dictionary-based Patterns (TDK-P), and Bootstrapped Patterns (BP). While general patterns are widely used and well known especially within a huge corpus, the dictionary-based patterns are suitable and applicable to dictionary-like resources (TDK, Wikipedia, etc.). Although the latter is suitable for dictionary, we discussed that it can have a productive capacity to disclose semantic relation from a corpus. The last approach is to bootstrap patterns using a set of part–whole seeds. In addition, we conducted some experiments with different statistical measures of association such as Information Gain (IG), v2 , etc., to evaluate their performance. They are compared in terms of precision and recall scores within a variety of experiments. 3.1 General patterns (GP) The most precise acquisition methodology applied earlier by [21] relies on LSPs. We start with the same idea of using the widely used patterns to acquire part–whole relations. GP have been widely used and well-known patterns in several studies [2, 8, 12]. One of these studies [8] used frames as part of, partly and made of for six different types of meronymic relations. The other study [12] represented that some patterns always refer to part– whole relation in English text, while most of them are ambiguous. The study [2] developed a formal taxonomy, distinguishing transitive mereological (1) part–whole relations from intransitive meronymic (2) ones. All general patterns that were used in three different studies are listed in Table 1. While NPx represents ‘‘part,’’ NPy represents ‘‘whole’’ in patterns that are shown in tables. There are also various studies that have used these patterns, most of them are subsumed by the following patterns. 123 Pattern Anal Applic Table 1 Patterns that are used in three different studies Winston et al. [8] Girju et al. [12] NPx part of NPy Parts of NPy include NPx NPx member of NPy (1) NPx partly NPy NPy consist of NPx NPx constituted of NPy (1) NPy made of NPx Keet and Artale [2] NPy made of NPx NPx subquantity of NPy (1) NPx member of NPy NPx participates in NPy (1) One of NPy constituents NPx NPx involved in NPy (2) NPx located in NPy (2) NPx contained in NPy (2) NPx structural part of NPy(2) Table 2 A summary for general patterns GP #ofCases #ofWhole The most frequent wholes NPx part of NPy 19K 2.5K Life, Culture, Turkey NPx member of NPy 23K 2K Commission, Turkey, Group NPy constituted of NPx 598 293 System, Program, Project NPy made of NPx 6.3K 1.7K Questionnaire, Public opinion NPy consist of NPx 9.2K 2K Report, Material, Product NPy has/have NPx 120K 8.2K Turkey, Person, Job NPy with NPx 68.8K 8.7K Person, Government, Turkey All the patterns were manually adopted into Turkish equivalences, where syntactic and morphological difficulties were handled by suitable LSP with regular expressions. The patterns are equivalent to the English patterns in terms of translation and meaning. The process was carried out by accessing and utilizing each morpheme to extract the sentences bearing part–whole relation. As expected, some patterns which were not suitable and applicable for Turkish language were eliminated. The remaining patterns were evaluated in terms of capacity and reliability. Summary of general patterns is given in Table 2. Turkish equivalents of these patterns were constructed in regular expression forms and are listed (see Table 12 in Sect. 9). In order to evaluate the approach, we picked up the most frequent wholes for each LSPs. For each whole, its potential parts are ranked according to their frequencies. To distinguish the distinctiveness, we normalized the frequency by dividing the number of times a part occurs with given whole by number of times a part retrieved by all patterns. We selected first 30 candidates ranked by their scores for evaluation. The proposed parts were manually evaluated by looking at their semantic role. 3.2 Dictionary-based patterns (TDK-P) An efficient and reliable way of applying LSP is to extract information from Machine Readable Dictionaries (MRDs). The use of language in dictionary is generally simple, informative, and highly includes a set of syntactic patterns. Thus, many studies have exploited the dictionary definition 123 recently. For Turkish, the studies [28–30] exploited dictionary definition (TDK) and Wiktionary. They applied structural patterns in regular expressions to harvest semantic relations. We examined all meronym-related patterns from these studies and carried out for our study. We provided a summary report for dictionary-based patterns as shown in Table 3. Member-of, made-of, consist-of, and has/have can be confused with the ones in the general patterns, whereas pattern specifications are different from each other (see Table 13 in Sect. 9). All patterns were applied to Turkish corpus as in general patterns and a similar process was carried out. Even though these patterns are useful especially in dictionary, they are needed to check if they could return redundant and incorrect results or not. 3.3 Bootstrapped patterns (BP) Bootstrapped patterns are totally different from that of others described above. The approach is implemented in two phases: pattern identification and part–whole pair detection. For the pattern identification, we begin by manually preparing a set of unambiguous seed pairs that definitely convey a part–whole relation. For instance, the pair (engine, car) would be member of that set. The seed set is further divided into two subsets: an extraction set and an assessment set. Each pair in the extraction set is used as query for retrieving sentences containing that pair. Then, we generalize many LSPs by replacing part Pattern Anal Applic Table 3 A summary for TDK-P TDK-P #ofC #ofW The most frequent wholes Group-of (whole|group|all|set|flock|union) 22.7K 3.6K Game, Human, Woman Member-of (class|member|team) 20K 3.8K Turkey, Team, Newspaper Member-of (from the family of NPy) 184 47 Legumes, Rosacea, Citrus fruit Amount-of (amount|measure|unit) 3.4K 1.4K Bank, Dollar, Euro Has/Have (NPy has the suffix of l(H)) 445K 3.7K Game, Human, Woman Consist-of 12.4K 2.7K Group, Committee, Team Made-of 4.9K 1.4K Payment, Interruption, Import #ofC number of cases, #ofW number of wholes, H high vowel (ı,i,u,ü) and whole token with a wild-card or any meta-character. The second set, the assessment set, is then used to compute the usefulness or reliability scores of all captured patterns. Those patterns having low reliability were eliminated. The remaining patterns were kept, along with their reliability scores. A classic way to estimate reliability of an extraction pattern is to measure how it correctly identifies the parts of a given whole. The success rate is obtained by dividing the number of correctly extracted pairs by the number of all extracted pairs. The outcome of entire phase is a list of reliable LSPs along with their reliability scores. The remaining phase is the same with previous pattern methods. The instantiated instances (part–whole pairs) are assessed and ranked according to their reliability scores. There are several ways to compute a reliability score for both pattern and instances. We experiment with three different measures of association (Pmi, Dice, T-score) to evaluate their performance in scoring function. We also utilized idf to cover more specific parts. The motivation for use of idf is to differentiate distinctive features from common ones. Differences between distinctive and general parts are emphasized in Sect. 6.3. All findings have been already reported in our previous Table 4 Bootstrapped patterns and examples study [31]. Based on reliability scores, we decided to filter out some generated patterns and finally obtained six different significant patterns. The list of the patterns and examples can be found in Table 4. Quality of each pattern is checked against a given assessment set. Initially, setting instance reliability of all pairs in the assessment set to 1, reliability score of the patterns is computed. P1 was found the most reliable pattern with respect to all aspects. P1 is based on genitive case which many studies utilized for the problem. We roughly order the pattern as P1, P2, P3, P6, P4, and P5 by their normalized average scores. To calculate reliability of instances, following association measures are used: Pmi, Pmi-idf, Dice, Dice-idf, T-score, and T-score-idf. For a particular whole noun, all possible parts instantiated by patterns are selected as a candidate set. For each association measure, their reliability scores of both patterns and instances were calculated and further sorted. The first K candidate parts were checked against the expected parts. For the evaluation phase, we manually and randomly selected five whole words: book, computer, ship, gun, and building. For a better evaluation, we selected first 10, 20, and 30 candidates ranked by the association measure mentioned above. Patterns Examples P1. NPy ? Gen NPx ? Pos Door of the house P2. NPy ? Nom NPx ? Pos House door P3. NPy ? Gen (N-ADJ) ? NPx ? Pos Back garden gate of the house Evin kapısı Ev kapısı Evin arka bahçe kapısı P4. NPy of one-of NPxs The door of one of the houses Evlerden birinin kapısı P5. NPx whose NPy The house whose door is locked Kapısı kilitli olan ev P6. NPxs with NPy The house with garden and pool Bahçeli ve havuzlu ev 123 Pattern Anal Applic 4 Statistical selection So far, we have selected first N most frequent parts for a given whole by running a specific pattern in GP or TDK-P. We have taken each pattern individually and evaluated the results. Instead, in this part, we retrieved all candidates part–whole pairs obtained from all patterns (in GP and TDK-P) and built a big whole-by-part matrix, namely global matrix, whose cellij represents how many times wholei and partj co-occur together, no matter which patterns produce them. In order to compare the clusters GP and TDK-P, we also used two separate bunches, and a big integrated one as well. The global whole-by-pair matrix gives a chance to apply some statistical metrics such as IG, v2 test, etc. If a part particularly occurs with a specific whole, it indicates that there is a meaningful link between them. Or, if a common part mostly appears with many wholes, its global importance is lower than others as formulated in idf. By applying the formulas such as v2 value or IG, raw counts of the global matrix can be converted in that each cell can represent a weighted value. times a given whole and a given part appear together. With this, we can retrieve most frequent N parts for a given whole. This score adds up all frequency number from all patterns, hence, gives a baseline with which we can compare the models. 5 Challenges We have faced many problems so far. Here, we have discussed those problems that mostly encountered in this kind of studies along with some solutions. • 4.1 Baseline algorithm Each approach must have its own baseline algorithm because their circumference might have particular advantage or disadvantage due to many factors. To designate a baseline algorithm for bootstrapped patterns, for a given whole, its possible parts are retrieved from a list ranked by association measure between whole and part that are instantiated by a reliable pattern as formulated in Eq. (1). • • j whole; pattern; part j assoc (whole, part) ¼ j; pattern, part jj whole, pattern ; j ð1Þ We intuitively designated a baseline algorithm to compare the results and the expectation. A proposed model should outperform the baseline algorithm. The baseline function is based on most reliable and productive pattern, the genitive pattern. The capacity of pattern is about 2M part–whole pairs. For a given whole, all candidate parts of it in the genitive pattern are extracted. Taking co-occurrence frequency between the whole and part could be misleading due to some nouns frequently placed in part/head position such as side, front, behind, outside. To overcome the problem of the co-occurrence, the individual distributions of both whole and part must be taken into account as shown in Eq. (1). These final scores are ranked and their first K parts are selected as the output of baseline algorithm. For the evaluation of GP and TDK-P, we applied different baseline algorithm. The matrix shows how many 123 • Almost all studies suffer from the very basic problem of NLP: ‘‘ambiguity of sense’’?. For a given whole, proposed parts could be incorrect due to polysemous words. The study [12] represented that some of the patterns always refer to part–whole relation in English text, while most of them are ambiguous. ‘‘part of’’ pattern, genitive construction, the verb -to have, noun compounds, and prepositional construction are classified as ambiguous meronymic expressions. For Turkish domain, we could not easily do such classification and find even one unambiguous pattern to extract part– whole relation. Additional methods are needed to cope with the problem and to find more accurate results. Adoption of the general patterns from other studies to Turkish domain is difficult due to free word order language characteristics of language. The noun phrases can easily change their position in a sentence without changing the meaning of the sentence. Determining a window is crucial for the potential parts. Keeping the window size smaller can lead to losing real parts. However, a larger window leads to many irrelevant NPs extracted with large context and it deteriorates system performance. We observed the window size of 15 allows us to capture more reliable parts and sentences. The patterns can also encode other semantic relations such as hyponymy or relatedness. Although use of genitive case is very popular for detecting part–whole relations, the characteristic of genitive is ambiguous. The morphological feature of genitive is a good indicator to disclose a semantic relation between a head and its modifier. We found that the genitive has a good indicative capacity, although it can encode various semantic interpretations. Taking the example, ‘‘Ali’s team’’?, and the first interpretation could be that the team belongs to Ali, the second interpretation is that Ali’s favorite team or the team he supports. It refers such relations ‘‘Ali’s pencil/Possession’’?, ‘‘Ali’s father/Kindship’’?, and ‘‘Ali’s handsomeness/Attribute’’?. Same difficulties are valid for other patterns. Pattern Anal Applic • • • • • • • To overcome the problem, statistical evidence have been utilized so far. Even the best patterns are not safe enough all the time. The sentence ‘‘door is a part of car’’? strongly represents part–whole relation, whereas ‘‘he is part of the game’’? gives only ambiguous relation. The word ‘‘Part of’’ has nine different meanings in TDK. It means that it is nine times more difficult to disclose the relation. Some patterns tend to disclose some particular relations such as Possession, Kindship, Ownership, Attribute, Attachment, and Property which are considered as part– whole relation in this study. Some can retrieve other types of semantic relations such as hyponym, synonym, relatedness, etc. The model mostly needs background knowledge especially for domain-specific problem. For instance, when running models on football domain, the model needs an ontology covering facts such as ‘‘Manchester United is a football team’’?. Some expressions can be more informal than written language or grammar. Indeed, in any language, different kinds of expression can be appropriate in various situations. From formal to informal, from written to spoken, from jargon to slang, all type of expressions are a part of corpus. This variety can cause another bottleneck for applying regular expression or patterns. Some words are not suitable for meronymy relations. Even in WordNet, many synsets have no meronym relation. E.g., how many parts can these words ‘‘result’’? or ‘‘point’’? have? Particularly, abstract words are harder than concrete ones in terms of evaluation. Therefore, evaluation must be done depending on word characteristics. Rich morphological feature of Turkish language means a barrier for computational syntax and complicated morphology. For instance, an English phrase including more than ten words can be translated into one single Turkish word by means of morphological suffixes. Some patterns have very limited capacity. For example, içeren parçaları (parts of NPy include NPx) and kısmen Table 5 The performance of GP and TDK-P for the first N selection • (partly) have very poor results. Both are excluded because of the number of returned cases. First pattern returns 2 and latter returns 10 cases only. Some wholes have limited parts, for example, ithalat (import), baklagiller (legume family), ödeme (payment), başvuru (application), dosya (file), etc. 6 Evaluation Three types of patterns have been taken into consideration. In evaluation phase, GP and TDK-P were compared to each other due to similar approach, and bootstrapped method was analyzed individually. Furthermore, the results pooled from all patterns were evaluated by means of statistical measurements such as v2 , IG metrics, etc. 6.1 Analysis of general patterns (GP) vs. dictionarybased patterns (TDK-P) For each category, we selected top 30 words from ranked list and randomly presented them to a user for evaluation. Each category was judged by three people. Rating of user for each word is 0/1 for part–whole relation. Results in Table 5 show precision scores of patterns for first 10, 20, and 30 selections. It indicates that GP are slightly more successful and robust than TDK-P on average. While GP have 64.2, 61.8, and 56.6 % precision, TDK-P have 67.8, 48.9, and 40.7 % for first 10, 20, and 30 parts selection, respectively. Moreover, GP are considered more productive than TDK-P. The results in Table 9 show production capacity of GP as 12.5 and TDK as 11.9 on average. We will discuss production capacity (see Sect. 7). At first glance, the most successful results seem to be produced by with (from GP), and which has/have and consist-of (from TDK-P) as shown in Table 11. However, evaluation on only precision could be misleading for some cases. Although we are not able to measure recall value, we propose that recall could be evaluated over production GP N:10 N:20 N:30 TDK-P N:10 N:20 N:30 Part-of 52 52 54 Group-of 42 44 41.3 Member-of 57.5 53.8 53.3 Member-of 80 73 62.7 Constitute-of 50 46.3 21.7 Amount-of 60 52.5 41.5 0.0 Consist-of 83.3 80 74.1 Family-of 38.2 0.0 Made-of 50 52.5 50 Made-of 77.2 0.0 0.0 Has/have 70 67 62 Consist-of 97.1 91.4 61.6 With 86.7 80.8 81.1 Has/have 80 81.7 77.8 AVG-GP 64.2 61.8 56.6 AVG-TDK-P 67.8 48.9 40.7 123 Pattern Anal Applic 6.3 Distinctive parts vs. general parts capacity which is ‘‘number of cases per whole denoted by #ofCpW [ 1 (freq[whole][1).’’ The most productive patterns are has/have (TDK-P) with production ratio of 42.6. This pattern has also good precision score of 77.8 % as well. has/have pattern (GP) has production ratio of 22.7 and its precision is 62 %. The pattern member-of (GP) has production ratio of 21.1 and has 53.3 % precision value. Table 9 suggests that has/have pattern (TDK-P) gives promising result in terms of both precision and recall (production capacity), then F-measure naturally. The highest precision of 81.1 % is achieved by pattern with (GP). However, it has relatively lower capacity of 11.7 than those patterns discussed. The worst patterns are made-of (GP), made-of (TDK-P), constitute-of (GP), and family-of (TDK-P). They have production capacity of 6.5, 6.6, 4.1, and 5.6 and precision rate of 28.3, 25.7, 21.7, and 13 %, respectively. They showed very poor performance in many aspects. In this study, part is categorized and evaluated within two groups; distinctive and general parts. If a part is inheritable from hypernyms of its whole, it may be defined as ‘‘general’’ or ‘‘inheritable.’’ Otherwise, it is defined as ‘‘specific’’? or ‘‘distinctive’’? part. Here, distinctive part means that part of a whole is not hierarchically inherited. E.g., a desk has Has–Part relationship with drawer and segment as in WordNet. While drawer is distinctive part of desk, segment is a general part that inherits from its hypernym artifact. First, the parts seem to be general like point, side, segment, etc. These can be inherited from upper physical entity. Second, the parts seem to be distinctive like kitchen of the house. Thus, to distinguish such parts, we utilized idf. We observed that the most frequent part instances are, for example, top, inside, segment, side, front, head, etc., all of which has resemblance. We evaluate the distinction problem through bootstrapped patterns due to its production capacity, simplicity, and quick evaluation. Similar results can be obtained through other pre-defined patterns as well. Table 7 shows performance of Pmi, Dice, T-score, and their idf-weighted counterparts and baseline metrics in terms of distinctiveness. There are two clear observation here: (1) idf-weighted metrics are better than others as expected. Idf eventually can discriminate particular parts by definition, because low-frequent terms have higher idf value. Thus, they can represent distinctive part. (2) Dice-based formulas outperform other two metrics, Pmi and T-score. Additionally, Table 7 also indicates that only metric which can surpass baseline algorithm is Dice and its idf counterpart. 6.2 Analysis of bootstrapped patterns Table 6 shows an evaluation of patterns based on three metrics, their idf-weighted counterparts, and baselines. The most successful formula is Dice-idf and Dice. Dice-idf has precision value of 72, 67, and 64 % for selection 10, 20, and 30, respectively. While baseline algorithm achieves the same performance with Dice metric for selection 10, further selections, Dice outperforms. Pmi is the second most successful metric that can surpass baseline algorithm after selection 20 and 30. Comparing the data in Table 11, we can observe that bootstrapped patterns are comparable with pre-defined pattern clusters even though bootstrapping system does not take any patterns but only a list of correct pairs. This characteristic gives two important aspects: One is domain independent property where the system can be applied to any corpus or domain (domain can be in different language or particular field like medicine, biology, etc.). Second advantage of bootstrapping is to run the model for any arbitrary whole. The wholes are not selected by means of analyzing the potential output as in pre-defined patterns. For GP and TDK-P, on the contrary, the wholes must be selected first and then evaluated depending on capacity of patterns. Table 6 The results of three metrics Table 8 shows the performance of a list of statistical metrics on whole-by-part global data. The resulting table has three bunches, first gives the results for data obtained from GP, second is regarding TDK-P, and third one is an integrated bunches. Under each bunch, scores from IG, v2 , T-score, Dice, and Frequency (baseline) approach were represented. For the bunch GP, the ranking is T-score [ IG [ Dice [ Freq [ v2 . For TDK-P, it is T-score [ Dice [ IG [ Freq [ v2 , which is akin to GP, where IG and Dice are #ofP Pmi Pmi-idf Dice Dice-idf T-score T-score-idf Baseline Avg 10 64 50 68 72 54 52 68 61.1 20 30 63 56 50 44 67 64 67 64 49 44 47 47.3 57 50.6 57.1 52.9 #ofP number of parts 123 6.4 Statistical measurements Pattern Anal Applic Table 7 The results for distinctive parts #ofP Pmi Pmi-idf Dice Dice-idf T-score T-score-idf Baseline Avg 10 50 50 58 64 34 44 60 51.4 20 48 50 48 53 34 40 51 46.3 30 40.7 42.7 47.3 49.3 31.3 38.7 40.7 41.5 #ofP number of parts Table 8 Statistical measurements for GP vs. TDK-P Patterns GP TDK-P AVG SM 10 20 30 58.9 IG 66.7 65 v2 44.4 36.1 35.6 T-score 74.4 70.6 66.3 Dice 66.7 59.4 54.8 Freq 48.9 43.3 41.5 IG 70 62.8 58.1 v2 T-score 55.6 45 43 72.2 68.3 61.5 Dice 70 65.6 61.1 Freq 63.3 57.8 55.9 AVG-IG 68.3 63.9 58.5 AVG-v2 50 40.6 39.3 AVG-T-score 73.3 69.4 63.9 AVG-Dice 68.3 62.5 58 AVG-Freq 56.1 50.6 48.7 Freq frequency swapped. The most important observation is that T-score value showed the best performance. However, v2 does not even outperform the baseline algorithm within each bunch. T-score, IG, and Dice formula are the most successful metrics. Main advantage of statistical selection is to integrate all results coming from heterogeneous patterns, where each pattern has different success rate, production capacity, tendency to meronymy subtype, e.g., attachment, possession. Merging all output from all patterns can increase recall value of the model and cover many wholes in a broader scope because each single pattern can have its own potential whole and tendency. Some could not take the whole as a parameter. We evaluated those pre-defined patterns on whole terms which are already produced in advance. Therefore, the difference in success ratio between the patterns could be compared in terms of various aspects. Looking at Table 8, the model proposed here gives a promising result in terms of precision and recall. 7 Production capacity and recall estimation Table 9 shows number of cases, number of wholes proposed by each pattern, and their success rates. We also select those wholes whose frequency is higher than 1 to decrease error rate coming from false matching. At first glance, the most successful pattern is with (GP) when ranking them according to precision for first 30 selection. Production capacity denoted by #ofCpW [ 1 and success ratio can be combined to evaluate them within different aspects, where production capacity does not refer to how many cases matched by corresponding pattern, but number of cases matched per whole on average. By multiplying the success rate and the normalized value of #ofCpW [ 1 (number of cases for per whole whose frequency is bigger than 1), we got another ranking factor representing and combining both precision and production capacity. The priority of patterns are has/have (TDK-P), has/have (GP), member-of (GP), with (GP), and part-of (GP). The pattern has/have (TDK-P) has both good production rate of 42.6 and precision rate of 77.8 %, and therefore, it appears in first place in new combined ranking. The poorest patterns are family-of, amount-of (TDK-P) and constitute-of (GP) according to new ranking factor. Another evaluation might be done over correlation between success (precision) and some factors such as number of cases, wholes, cases per whole, and others. When looking at correlation Table 10, the success of a pattern mostly and strongly depends on number of producing unique wholes. Second is #ofwholes [ 1 and #ofCpW [ 1. This finding is worth analyzing deeply. The number of cases matched by a given pattern has secondary importance. The essential point is number of unique wholes and of cases per each whole that a pattern can extract. As shown in Table 9, for some patterns, e.g., made-of (GP, TDK-P) and amount-of (TDK-P), although they have a good capacity for matching, they have poor #ofCpW [ 1 score. Thus, this kind of scattered patterns does not show a significant performance. Briefly saying, the most successful pattern is GP-with and its precision is 81.1 % as shown in Table 11. The number of relation extracted is about 68.8K by that pattern. For a comparison, there is no any corpus-based study for Turkish language. There are a few dictionary-based studies as already mentioned in Sect. 2. [30] achieved 50 % precision on average, and the number of relation is about 1K. [29] applied four different meronym-related patterns and get 87 % and production size is about 3K. [28] performed 79.6 % precision with the total relation size of about 1.7K. As it was 123 Pattern Anal Applic Table 9 Ranked by success rate of each pattern C P #ofC #ofW #ofW [ 1 #ofC [ 1 #ofCpW [ 1 S30 GP With 68.8K 8.7K 5.6K 65.7K 11.7 81.1 TDK-P Has/have 445K 13.7K 10.3K 442K 42.6 77.8 GP Consist-of 9.2K 2K 1K 8.2K 8.1 74.1 TDK-P Member-of 20K 3.9K 2K 18.2K 8.6 GP Has/have 12K 8.2K 5.1K 117K 22.7 62.7 62 TDK-P Consist-of 12.4K 2.7K 1.4K 11K GP Part-of 19.3K 2.4K 1.3K 18.1K 13.8 7.8 54 61.6 GP Member-of 23K 2K 1K 22K 21.1 53.3 TDK-P Amount-of 3.4K 1.4K 5.5K 7.5K 1.4 41.5 TDK-P Group-of 22.7K 3.6K 2K 21K 10.9 41.3 GP Made-of 6.3K 1.7K 836 5.4K 6.5 28.3 TDK-P Made-of 4.9K 1.4K 612 4K 6.6 25.7 GP Constitute-of 598 293 97 402 4.1 21.7 TDK-P Family-of 184 47 30 167 5.6 13 C Cluster, P patterns, #ofC number of cases, #W number of whole, #ofW [ 1 number of whole whose frequency is greater than 1, #ofC [ 1 number of cases whose wholes are seen more than 1 times, #ofCpW [ 1 number of cases per whole whose frequency is greater than 1, S30 success rate for 30 candidates Table 10 Correlation table Correlation Success rate #ofCases 0.49 #ofWhole 0.71 #ofW [ 1 0.60 #ofC [ 1 0.48 #ofCpW [ 1 0.54 noticed before, the main advantage of our approach is especially production capacity where the model proposed here can capture over 68K relations from a given corpus of 500M token. For a better comparison, all the inputs and conditions must be balanced in terms of size and quality. Taking into consideration that the studies in other languages can utilize WordNet and larger lexical resources, they have some resource advantages. However, Turkish language does not have such rich language resources. This makes our study much harder and disadvantageous. When checking the performance scores of all studies done in all languages so far, we noticed that our approach showed sufficient performance in spite of disadvantageous situation. With a gold standard comparison, we need to apply our approach to the similar data in similar environment. A language independent model might be designed for such comparison. However, Table 11 Best of patterns GP vs. TDK-P 123 we can slightly look at other studies’ success rates to understand the meaning of the work proposed here without comparing the performance results. As one of the first important studies, [22] got 70 % precision with limited example set. [23] conducted the experiment on over 100K sentences and used semantic relations from WordNet and achieved 83 % success rate. [24] and [17] performed the scores of 74 and 80 %, respectively. 8 Conclusions We utilized and adopted LSPs to disclose meronymy relation from a Turkish corpus. Two different approaches were considered to prepare patterns; one is based on predefined patterns that are taken from the literature. Second approach automatically produces patterns by means of bootstrapping method. Pre-defined patterns fall into two clusters; General and Dictionary-based patterns. Bootstrapping patterns are categorized as third cluster. We also used statistical method on global data obtaining from all results of entire patterns in three clusters. After morphologically parsing a huge corpus, all patterns were realized in specific regular expressions in accordance with parsed corpus. Each pattern is designed so #ofParts GP with TDK consist-of TDK has/have GP T-score TDK T-score Bootstrap Dice-idf 10 86.7 97.1 80.5 74.4 72.2 72 20 80.8 91.4 81.7 70.6 68.3 67 30 81.1 61.6 77.8 66.3 61.5 64 Pattern Anal Applic that we can separately pick up whole and its potential candidate parts to be proposed. With a variety of experiments, we addressed some problems, concluded a list of facts, and achieved successful results for Turkish meronymy problem. As analyzing General patterns and Dictionary-based patterns, it is said that an appropriate pattern design is capable of solving the problem of meronymy. Several significant findings of the study are reported in corresponding sections. Some of them can be listed as follows: Even though dictionary-based patterns are so suitable for dictionary-like corpus by definition, they have good and comparable potential to extract part–whole pairs from a corpus. General patterns are slightly better than them. Some particular patterns from both clusters, GP and TDK-P, have a good indicative capacity in terms of production and precision as shown in paper. The approach utilizing bootstrapping first retrieves reliable patterns, then, extracts and proposes some part candidates for a given whole. The success results of that approach are comparable to other pre-defined patterns. It has also domain-independent characteristic and good production capacity as well. Therefore, it can be easily applied for other relation problems. Instead of applying each pattern one by one, all results from entire patterns are merged as input for statistical methods. The global whole-by-part matrix is measured by means of several statistics such as IG, v2 ; etc. The results indicate that it has very similar behavior with bootstrapped pattern, where the results are comparable to pre-defined list. Moreover, statistical selection and bootstrapping have Table 12 General patterns (GP) and their Turkish equivalents GP NPx is (a|-) part of NPy large scale and good production capacity. Production capacity denoted by #ofCpW [ 1 of a pattern does refer to how many cases matched per whole on average. That capacity and success ratio can be combined to evaluate proposed patterns. Even though some patterns seem to have good accuracy, they have less production capacities. Thus, the output of such patterns has limited number of wholes. We evaluated the success of patterns over not only precision but also combined ranking factor taking #ofCpW [ 1 and success rate as parameters. No matter to which cluster a pattern belongs, if a pattern can produce higher number of unique wholes, it can show better performance. We check which pattern characteristic highly correlates with success rate (precision). The correlation table indicates that success of a pattern highly depends on number of producing unique wholes. Second important attribute of a pattern is average number of cases per whole as indicated in the table. As final remark, all experiments show that proposed methods have good indicative capacity for solution of meronymy problem, because each method can outperform its corresponding baseline algorithm as shown in corresponding tables. To the best of our knowledge, there is no such comprehensive corpus-based experiment for building semantic lexicon for Turkish language. Appendix See Tables 12 and 13. Turkish equivalent of GP ...NPx...NPy ? gen...(bir)? parçasıdır/kısmıdır ...NPy ? gen...parça/kısım(ları|sı|ı)...NPx... ...NPx...NPy ? gen...parça/kısım(larından|sından|ından) biridir NPy ? gen...parça/kısım(larından|sından|ından) biri olan ...NPx NPx member of NPy ...NPx...NPy ? gen...(bir)? üyesidir ...NPx...(bir)? NPy ? nom...üyesidir ...NPx...NPy ? gen...üye(lerinden|sinden) biridir NPy ? gen ...üye(lerinden|sinden) biri olan ...NPx NPy constituted of NPx ...NPx...NPy ? gen...bileşen(lerinden|inden) biridir NPy ? gen ...bileşen(lerinden|inden) biri olan ...NPx ...NPx...NPy ? gen...(bir)? bileşenidir ...NPy ? gen...bileşen(leri|i)...NPx... NPy made of NPx NPy,...NPx ? abl yapıl(mıştır|maktadır) NPy,...NPx ? abl yapılmış olup ...NPx ? abl yapılan NPy NPy consist of NPx NPy,...NPx...içerir Has/have NPy ? gen...NPx ? p3sg ? nom...(vardır|var) NPx ? p3sg ? nom var olan NPy NPy ? pnon ? loc NPx (var|vardır) With NPx ? p3sg ? nom olan NPy 123 Pattern Anal Applic Table 13 Dictionary-based patterns (TDK-P) and their Turkish equivalents TDK-P Turkish Equivalent of TDK-P NPy,...(whole|group|all|set|flock|union) of NPx NPy,...NPx ? (gen|nom) bütünü(dür|-) NPy,...NPx ? (gen|nom) topluluğu(dur|-) NPy,...NPx ? (gen|nom) tümü(dür|-) NPy,...NPx ? (gen|nom) birliği(dir|-) NPy,...NPx ? (gen|nom) kümesi(dir|-) NPy,...NPx ? (gen|nom) sürüsü(dür|-) NPy,...(class|member|team) of NPx NPy,...NPx ? (gen|nom) sınıfı(dır|-) NPy,...NPx ? (gen|nom) üyesi(dir|-) NPy,...NPx ? (gen|nom) takımı(dır|-) NPx,...family of NPy NPx, NPy ? gillerden NPy ? gillerden...NPx NPy,...(amount|measure|unit) of NPx NPy,...NPx ? (gen|nom) miktarı(dır|-) NPy,...NPx ? (gen|nom) ölçüsü(dür|-) NPy,...NPx ? (gen|nom) birimi(dir|-) NPy consist of NPx NPy made of NPx NPx ? abl...yapıl(an|mış) NPy Has/have NPx ? nom-adj-with NPy References 1. Cruse AD (2003) The lexicon. In: Aronoff M, Ress-Miller J (eds) The handbook of linguistics. Blackwell Publisher Ltd., Oxford, pp 238–264 2. Keet CM, Artale A (2008) Representing and reasoning over a taxonomy of part–whole relations. Appl Ontol 3(1–2):91–110 3. Pribbenow S (2002) Meronymic relationships: from classical mereology to complex part–whole relations. In: Green R, Bean CA, Myaeng SH (eds) The semantics of relationships. Springer, Netherlands, pp 35–50 4. Croft W, Cruse D (2004) Cognitive linguistics. Cambridge University Press, Cambridge 5. Simons P (1987) Parts: a study in ontology. Oxford University Press, UK 6. Gerstl P, Pribbenow S (1995) Midwinters, end games, and body parts: a classification of part–whole relations. Int J Hum–Comput Stud 43(5–6):865–889 7. Iris MA, Litowitz BE, Evens M (1988) Problems of the part– whole relation. In: Evens M (ed) Relational models of the lexicon. Cambridge University Press, Cambridge, pp 261–288 8. Winston ME, Chaffin R, Herrmann D (1987) A Taxonomy of part–whole relations. Cogn Sci 11(4):417–444 9. Miller GA et al (1990) Introduction to WordNet: an on-line lexical database. Int J Lexicogr 3(4):235–244 10. Murphy ML (2003) Semantic relations and the lexicon: antonymy, synonymy, and other paradigms. Cambridge University Press, UK 11. Artale A, Franconi E, Guarino N, Pazzi L (1996) Part–whole relations in object-centered systems: an overview. Data Knowl Eng 20(3):347–383 12. Girju R, Badulescu A, Moldovan D (2006) Automatic discovery of part–whole relations. Comput Linguist 32(1):83–135 13. Hamon T, Natalia G (2008) How can the term compositionality be useful for acquiring elementary semantic relations? In: Nordström B, Ranta A (eds) Advances in natural language processing, LNCS 5221. Springer, Berlin, Heidelberg, pp 181–192 123 NPx ? abl oluş(an|muş) NPy NPy,...NPx ? abl oluşmuştur 14. Roberts A (2005) Learning meronyms from biomedical text. In: Proceedings of the ACL student research workshop (ACLstudent ’05). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 49–54 15. Ling X, Clark P, Weld DS (2013) Extracting meronyms for a biology knowledge base. In: Proceedings of the 2013 workshop on automated knowledge base construction (AKBC ’13). ACM, USA, pp 7–12 16. Ittoo A, Bouma G, Maruster L, Wortmann H (2010) Extracting meronymy relationships from domain specific, textual corporate databases. In: Hopfe CJ, Rezgui Y, Metais E, Preece AD, Li H (eds) Natural language processing and information system, LNCS 6177. Springer, Berlin, pp 48–59 17. Pantel P, Pennacchiotti M (2006) Espresso: leveraging generic patterns for automatically harvesting semantic relations. In: Proceeding of the 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics. Australia, Sydney, pp 113–120 18. Vor der Bruck T, Helbig H (2010) Meronymy extraction using an automated theorem prover. J Lang Technol Comput Linguist 25(1):57–82 19. Vor der Bruck T, Helbig H (2010) Validating meronymy hypotheses with support vector machines and graph kernels. In: Proceedings of the 2010 Ninth International Conference on Machine Learning and Applications (ICMLA ’10). IEEE Computer Society, Washington, DC, USA, pp 243–250 20. Xia F, Cungen C (2014) Extracting part–whole relations from online encyclopedia. In: Shi Z, Wu Z, Leake D, Sattler U (eds) Intelligent information processing VII. IFIP advances in information and communication technology, vol 432. Springer, Berlin, Heidelberg, pp 57–66 21. Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th international conference on computational linguistics, COLING 1992. Nantes, France, pp 539–545 22. Berland M, Charniak E (1999) Finding parts in very large corpora. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, USA, pp 57–64 Pattern Anal Applic 23. Girju R, Badulescu A, Moldovan D (2003) Learning semantic constraints for the automatic discovery of part–whole relations. In: Proceedings of the human language technology conference of the North American Chapter of the Association for Computational Linguistics. Edmonton, Canada, pp 1–8 24. Van HWR, Kolb H, Schreiber G (2006) A method for learning part–whole relations. In: Cruz IF, Decker S, Allemang D, Preist C, Schwabe D, Mika P, Uschold M, Aroyo L (eds) International semantic web, LNCS 4273. Springer, Berlin, pp 723–735 25. Ittoo A, Bouma G (2010) On learning subtypes of the part–whole relation: do not mix your seeds. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, ACL’10. Association for Computational Linguistics, Uppsala, Sweden, pp 1328–1336 26. Cao X, Cao C, Wang S, Lu H (2008) Extracting part–whole relations from unstructured Chinese Corpus. In: Proceedings of the 2008 5th international conference on fuzzy systems and knowledge discovery, pp 175–179 27. Yao T, Uszkoreit H (2005) Identifying semantic relations between named entities from Chinese texts. In: Lu R, Siekmann JH, Ullrich C (eds) Proceedings of the 2005 joint Chinese-German conference on cognitive systems, LNCS 4429. SpringerVerlag, Berlin, Heidelberg, pp 70–83 28. Orhan Z, Pehlivan I, Uslan V, Onder P (2011) Automated extraction of semantic word relations in Turkish lexicon. Math Comput Appl 16(1):13–22 29. Serbetçi A, Orhan Z, Pehlivan I (2011) Extraction of semantic word relations in Turkish from dictionary definitions. In: Proceedings of the ACL 2011 workshop on relational models of semantics, RELMS 2011. Portland, Oregon, USA, pp 11–18 30. Yazıcı E, Amasyalı MF (2011) Automatic extraction of semantic relationships using Turkish dictionary definitions. EMO Bilimsel Dergi, İstanbul 31. Yıldız T, Yıldırım S, Diri B (2013) Extraction of part–whole relations from Turkish corpora. In: Gelbukh A (ed) Computational linguistics and intelligent text processing, LNCS 7816. Springer, Berlin, Heidelberg, pp 126–138 32. Yıldız T, Diri B, Yıldırım S (2014) Analysis of lexico-syntactic patterns for meronym extraction from a Turkish corpus. 6th Language and technology conference. Human language technologies as a challenge for computer science and linguistics, LTC, Poland, pp 429–433 33. Sak H, Güngör T, Saraçlar M (2008) Turkish language resources: morphological parser, morphological disambiguator and web corpus. In: Nordström B, Ranta A (eds) Advances in natural language processing, LNCS 5221. Springer-Verlag, Berlin, Heidelberg, pp 417–427 123
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Christopher Crick
Oklahoma State University
John-Mark Agosta
Microsoft
Qusay F. Hassan
Yoo Hwan Soo
Gacheon University