Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Lexicon-Grammar based open information extraction from natural language sentences in Italian

Published: 01 April 2020 Publication History
  • Get Citation Alerts
  • Highlights

    An OIE approach for Italian language, based on verb behavior patterns.
    Verb behavior patterns built combining Lexicon-Grammar and distributional profiles.
    Extraction of n-ary propositions from elementary sentences exploiting verb patterns.
    Preservation of grammaticality and a first level of acceptability in extractions.
    Validation of effectiveness on a gold standard dataset built for OIE in Italian.

    Abstract

    In the last decade, the quantity of readily accessible text has grown rapidly and enormously, long exceeding the capacity of humans to read and understand it. One of the most interesting strategies proposed to fulfill this need is known as Open Information Extraction (OIE). It is essentially devised to read in sentences and rapidly extract one or more domain-independent coherent propositions, each represented by a verb relation and its arguments. Even though many OIE approaches exist for English, no significant research has been conducted about OIE on Italian texts. Due to the usage of language-specific features, OIE systems operating in other languages are not directly applicable for Italian. Therefore, this paper proposes, as first contribution, a novel approach to perform OIE for Italian language, based on standard linguistic structures to analyze sentences and on a set of verbal behavior patterns to extract information from them. These patterns are built combining a solid linguistic theoretical framework, i.e. Lexicon-Grammar (LG), and distributional profiles extracted from a contemporary Italian corpus, i.e. itWaC. Starting from simple sentences, the approach is able to determine elementary tuples, then, all their permutations, by adding complements and adverbials, and, finally, n-ary propositions, by granting syntactic invariance, preserving the overall grammaticality and also respecting some syntactic constraints and selection preferences, thus approximating a first level of semantic acceptability. As second contribution of this work, a gold standard dataset for the Italian language has been built from the itWaC corpus, aimed at being widely used to enable the experimental validation of OIE solutions. It has been manually and independently labeled by four Italian native speakers with all the n-ary propositions that can be extracted, following the criteria of grammaticality and acceptability, i.e. granting syntactic well-formedness and meaningfulness in the context. Finally, the proposed approach has been experimented and quantitatively validated on this gold standard dataset, also in comparison with an indirect approach translating input sentences and output propositions from Italian to English and vice versa and embedding an OIE approach for English, as well as with an OIE system for Italian previously presented by the authors. The results obtained have shown the effectiveness of the proposed approach in generating propositions with respect to these criteria of grammaticality and acceptability. Even if the approach has been evaluated for the Italian language, it is essentially based on linguistic resources produced by LG, which exist for many languages besides Italian and a representative corpus for the language under consideration. Given these premises, it has a general basis from a methodological perspective and can be proficiently extended also to other languages.

    References

    [1]
    G. Angeli, M.J.J. Premkumar, C.D. Manning, Leveraging linguistic structure for open domain information extraction, in: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, 1, 2015, pp. 344–354.
    [2]
    Aprosio, A. P., & Moretti, G. (2016).Italy goes to Stanford: a A collection of Corenlp modules for Italian. arXiv preprinthttps://arxiv.org/pdf/1609.06204.pdf.
    [3]
    L. Aroyo, C. Welty, Truth is a lie: Crowd truth and the seven myths of human annotation, AI Magazine 3 (1) (2015) 15–24.
    [4]
    M. Banko, M.J. Cafarella, S. Soderland, M. Broadhead, O. Etzioni, Open information extraction from the web, in: Proceeding of the IIJCAI, 7, 2007, pp. 2670–2676.
    [5]
    J. Baptista, Viper: A lexicon-grammar of European Portuguese verbs, in: Proceedings of the 31e colloque international sur le lexique et la grammaire, 2012.
    [6]
    M. Baroni, M. Ueyama, Building general-and special-purpose corpora by web crawling, in: Proceedings of the 13th NIJL international symposium, language corpora: Their compilation and application, 2006, pp. 31–40.
    [7]
    A. Bassa, M. Kröll, R. Kern, GerIE-An open information extraction system for the German language, Journal of Universal Computer Science 24 (1) (2018) 2–24.
    [8]
    Bertinetto, P. M., Burani, C., Laudanna, A., Marconi, L., Ratti, D., Rolando, C., & Thornton, A. M. (2005). Corpus e lessico di frequenza dell’italiano scritto (CoLFIS). Scuola Normale Superiore di Pisa.
    [9]
    V. Bobicev, M. Sokolova, Inter-Annotator agreement in sentiment analysis: Machine learning perspective, in: Proceedings of the international conference recent advances in natural language processing, 2017, pp. 97–102.
    [10]
    Boons, J. P., .Guillet, A., & Leclère, C. (1992).La structure des phrases simples en français. Librairie Droz, vol. 26.
    [11]
    I. Chiari, T. De Mauro, The New Basic Vocabulary of Italian as a linguistic resource, in: 1th Italian Conference on Computational Linguistics (CLiC-it), Vol. 1, 2014, pp. 93–97. December.
    [12]
    T. Chklovski, R. Mihalcea, Exploiting agreement and disagreement of human annotators for word sense disambiguation, in: Proceedings of international conference recent advances in natural language processing, 2003.
    [13]
    J. Christensen, S. Soderland, O. Etzioni, Towards coherent multi-document summarization, in: Proceedings of the Conference of the North American chapter of the association for computational linguistics: Human language technologies, Atlanta, GA, USA, 2013, pp. 1163–1173.
    [14]
    Ciocanea, C. (2011).Lexicon-grammar of converse constructions in a da/a primi in Romanian. Linguistique. Université Paris-Est.
    [15]
    Claro, D. B., .Souza, M., Xavier, C. C., & .& Oliveira, L. (2019).Multilingual open information extraction: challenges Challenges and opportunities, https://doi.org/10.3390/info10070228.
    [16]
    D'Agostino, E. (1992).Analisi del discorso. Napoli: Loffredo.
    [17]
    E. Damiano, A. Minutolo, M. Esposito, Open information extraction for Italian sentences, in: Proceedings of 32nd international conference on advanced information networking and applications workshops, 2018, pp. 668–673.
    [18]
    L.S.& & de Oliveira, D.B. Claro, DptOIE: A Portuguese open information extraction system based on dependency analysis, Computer Speech Language (2019) (under review).
    [19]
    L.S. de Oliveira, R. Glauber, D.B.Dependentie Claro, an open information extraction system on Portuguese by a dependence analysis, in: Proceedings of the encontro nacional de inteligência artificial e computacional, 2017.
    [20]
    L. Del Corro, R. Gemulla, Clausie: Clause-based open information extraction, in: Proceedings of the 22nd international conference on world wide web, 2013, pp. 355–366.
    [21]
    J. Durand, J.P. Boons, A. Guillet, C. Leclére, La structure des phrases simples en Français: Constructions intransitives. Genéve: Droz, 1976, Journal of Linguistics 15 (1) (1979) 187–192.
    [22]
    Elia, A. (1984). Le verbe italien. Les complétives dans les phrases à un complément.
    [23]
    Elia, A., Martinelli, M., & d'Agostino, E. (1981).Lessico e strutture sintattiche: introduzione Introduzione alla sintassi del verbo italiano. Liguori.
    [24]
    O. Etzioni, A. Fader, J. Christensen, S. Soderland, M. Mausam, Open information extraction: The second generation, in: Proceeding of the IJCAI, 11, 2011, pp. 3–10.
    [25]
    A. Fader, S. Soderland, O. Etzioni, Identifying relations for open information extraction, in: Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics, 2011, pp. 1535–1545.
    [26]
    A. Fader, L. Zettlemoyer, O. Etzioni, Open question answering over curated and extracted knowledge bases, in: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, 2014, pp. 1156–1165.
    [27]
    T. Falke, G. Stanovsky, I. Gurevych, I. Dagan, Porting an open information extraction system from English to German, in: Proceedings of the conference on empirical methods in natural language processing, 2016, pp. 892–898.
    [28]
    P. Gamallo, M. Garcia, Multilingual open information extraction, in: Proceedings of the Portuguese conference on artificial intelligence, 2015, pp. 711–722.
    [29]
    P. Gamallo, M. Garcia, S. Fernandez-Lanza, Dependency-based open information extraction, in: Proceedings of the joint workshop on unsupervised and semi-supervised learning in NLP, 2012, pp. 10–18.
    [30]
    M. Garcia-Vega, Transitive phrasal verbs with the particle" out": A lexicon-grammar analysis, Southern Journal of Linguistics 35 (1) (2010) 75–110.
    [31]
    E. Gibson, E. Fedorenko, The need for quantitative methods in syntax, Language and Cognitive Processes 28 (1–2) (2010) 88–124.
    [32]
    L.A. Goodman, W.H. Kruskal, Measures of association for cross classifications, Journal of the American statistical association 49 (268) (1954) 732–764.
    [33]
    M. Gross, Constructing lexicon-grammars, Centre national de la recherche scientifique, Universités de Paris, 1994, 7 et 8.
    [34]
    P. Hanks, J. Pustejovsky, A pattern dictionary for natural language processing, Revue Française de Linguistique Appliquée 10 (2) (2005) 63–82.
    [35]
    Z.S. Harris, A grammar of English on mathematical principles, John Wiley & Sons Inc, 1982.
    [36]
    E. Jezek, B. Magnini, A. Feltracco, A. Bianchini, O. Popescu, T-pas: A resource of corpus-derived types predicate argument structures for linguistic analysis and semantic processing, in: Proceedings of the LREC, 2014, pp. 890–895.
    [37]
    T. Khot, A. Sabharwal, P. Clark, Answering complex questions using open information extraction, in: Proceedings of the 55th annual meeting of the association for computational linguistics, 2, 2017, pp. 311–316.
    [38]
    La Fauci, N., & Mirto, I. M. (2003). Fare. Elementi di sintassi. Ed. ETS.
    [39]
    J.R. Landis, G.G. Koch, The measurement of observer agreement for categorical data, Biometrics 33 (1) (1977) 159–174.
    [40]
    K. Lange Di Cesare, M. Gagnon, A. Zouaq, L. Jean-Louis, A machine learning filter for relation extraction, in: Proceedings of the 25th international conference companion on world wide web. International world wide web conferences steering committee, 2016, pp. 69–70.
    [41]
    E. Laporte, E. Tolone, M. Constant, Conversion of lexicon-grammar tables to LMF: Application to French, LMF lexical markup framework, 33, 2013, pp. 157–173. ISTE - Wiley.
    [42]
    J.H. Lau, A. Clark, S. Lappin, Grammaticality, acceptability, and probability: A probabilistic view of linguistic knowledge, Cognitive Science 41 (5) (2017) 1202–1241.
    [43]
    C. Leclère, Organization of the lexicon-grammar of French verbs, Lingvisticae Investigationes 25 (1) (2002) 29–48.
    [44]
    C. Leclère, The lexicon-grammar of French verbs, Linguistic Informatics State of the Art and the Future in: Proceedings of the first international conference on linguistic informatics, 1, 2005, pp. 29–45.
    [45]
    A. Lenci, G. Lapesa, G.LexIt Bonansinga, A computational resource on Italian argument structure, in: Proceedings of the LREC, 2012, pp. 3712–3718.
    [46]
    P. Machonis, English phrasal verbs: From lexicon-grammar to natural language processing, Southern Journal of Linguistics 34 (1) (2010).
    [47]
    A. Moro, R. Navigli, Integrating syntactic and semantic analysis into the open information extraction paradigm, in: Proceedings of the twenty-third international joint conference on artificial intelligence, 2013.
    [48]
    A. Oltramari, G. Vetere, M. Lenzerini, A. Gangemi, N. Guarino, Senso comune, in: Proceedings of the LREC, 2010, pp. 3873–3877.
    [49]
    S. Patwardhan, E. Riloff, Learning domain-specific information extraction patterns from the web, in: Proceedings of the workshop on information extraction beyond the document, Association for Computational Linguistics, 2006, pp. 66–73.
    [50]
    C. Phillips, S. Iwasaki, H. Hoji, P. Clancy, S.-O. Sohn (Eds.), Should we impeach armchair linguists?, 17, Japanese/Korean Linguistics, 2009.
    [51]
    C. Phillips, Some arguments and non-arguments for reductionist accounts of syntactic phenomena, Language and Cognitive Processes 28 (1–2) (2011) 156–187.
    [52]
    E. Pianta, L. Bentivogli, C. Girardi, Developing an aligned multilingual database, in: Proceedings of the 1st international conference on global WordNet, 2002.
    [53]
    B. Plank, D. Hovy, A. Søgaard, Learning part-of-speech taggers with inter-annotator agreement loss, in: Proceedings of the 14th conference of the European chapter of the association for computational linguistics, 2014, pp. 742–751.
    [54]
    L. Rizzi, Null objects in Italian and the theory of pro, Linguistic Inquiry 17 (3) (1986) 501–557.
    [55]
    L. Rizzi, Issues in Italian syntax, 11, Walter de Gruyter, 2013.
    [56]
    J. Ruppenhofer, J. Sunde, M. Pinkal, Generating framenets of various granularities: The framenet transformer, in: Proceedings of the LREC, 2010, pp. 2736–2743.
    [57]
    M. Schmitz, R. Bart, S. Soderland, O. Etzioni, Open language learning for information extraction, in: Proceedings of the joint conference on empirical methods in natural language processing and computational natural language learning, Association for Computational Linguistics, 2012, pp. 523–534.
    [58]
    K.K. Schuler, Verbnet (2005) Ph.D. Dissertation.
    [59]
    C.F.L. Sena, D.B. Claro, InferPortOIE: A Portuguese open information extraction system with inferences, Natural Language Engineering 25 (2) (2019) 287–306.
    [60]
    C.F.L. Sena, R. Glauber, D.B. Claro, Inference approach to enhance a Portuguese open information extraction, in: Proceedings of the ICEIS, 2017, pp. 442–451.
    [61]
    J. Sprouse, B. Yankama, S. Indurkhya, S. Fong, R.C. Berwick, Colorless green ideas do sleep furiously: Gradient acceptability and the nature of the grammar, The Linguistic Review 35 (3) (2018) 575–599.
    [62]
    G. Stanovsky, I. Dagan, Open IE as an intermediate structure for semantic tasks, in: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, 2, Beijing, China, 2015, pp. 303–308.
    [63]
    G. Stanovsky, I. Dagan, Creating a large benchmark for open information extraction, in: Proceedings of the conference on empirical methods in natural language processing, 2016, pp. 2300–2305.
    [64]
    A. Strauss, J. Corbin, Basics of qualitative research, Sage publications, 1990.
    [65]
    E. Tolone, Analyse syntaxique à l'aide des tables du lexique-grammaire du français, Lingvisticæ Investigationes 35 (1) (2012) 147–151.
    [66]
    D. Truong, D.T. Vo, U.T. Nguyen, Vietnamese open information extraction, in: Proceedings of the eighth international symposium on information and communication technology, 2017, pp. 135–142.
    [67]
    Vietri, S. (2004). Lessico-grammatica dell'italiano. Metodi, descrizioni e applicazioni.
    [68]
    D.T. Vo, E. Bagheri, Open information extraction, Encyclopedia with Semantic Computing and Robotic Intelligence 1 (1) (2017).
    [69]
    M. Wang, L. Mingyin, F. Huang, Semi-supervised Chinese open entity relation extraction, in: Proceedings of the 3rd IEEE international conference on cloud computing and intelligence systems, 2014, pp. 415–420.
    [70]
    F. Wu, D.S. Weld, Open information extraction using Wikipedia, in: Proceedings of the 48th annual meeting of the association for computational linguistics, Association for Computational Linguistics, 2010, pp. 118–127.
    [71]
    R. Yangarber, R. Grishman, P. Tapanainen, S. Huttunen, Automatic acquisition of domain knowledge for information extraction, in: Proceedings of the 18th conference on computational linguistics, 2, Association for Computational Linguistics, 2000, pp. 940–946.
    [72]
    A. Zhila, A. Gelbukh, Comparison of open information extraction for English and Spanish, in: Proceedings of the 19th annual international conference dialog, 2013, pp. 714–722.
    [73]
    J. Zhu, Z. Nie, X. Liu, B. Zhang, J.-.R. Wen, Statsnowball: A statistical approach to extracting entity relationships, in: Proceedings of the 18th international world wide web conference, 2009, pp. 101–110.

    Cited By

    View all
    • (2024)OIE4PA: open information extraction for the public administrationJournal of Intelligent Information Systems10.1007/s10844-023-00814-z62:1(273-294)Online publication date: 1-Feb-2024
    • (2022)DptOIE: a Portuguese open information extraction based on dependency analysisArtificial Intelligence Review10.1007/s10462-022-10349-456:7(7015-7046)Online publication date: 5-Dec-2022

    Index Terms

    1. Lexicon-Grammar based open information extraction from natural language sentences in Italian
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Expert Systems with Applications: An International Journal
        Expert Systems with Applications: An International Journal  Volume 143, Issue C
        Apr 2020
        425 pages

        Publisher

        Pergamon Press, Inc.

        United States

        Publication History

        Published: 01 April 2020

        Author Tags

        1. Open information extraction
        2. Lexicon-Grammar
        3. n-ary propositions
        4. Natural language processing
        5. Italian language

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 27 Jul 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)OIE4PA: open information extraction for the public administrationJournal of Intelligent Information Systems10.1007/s10844-023-00814-z62:1(273-294)Online publication date: 1-Feb-2024
        • (2022)DptOIE: a Portuguese open information extraction based on dependency analysisArtificial Intelligence Review10.1007/s10462-022-10349-456:7(7015-7046)Online publication date: 5-Dec-2022

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media