Abstract
Information Extraction (IE) is a pervasive task in the industry that allows to obtain automatically structured data from documents in natural language. Current software systems focused on this activity are able to extract a large percentage of the required information, but they do not usually focus on the quality of the extracted data. In this paper we present an approach focused on validating and improving the quality of the results of an IE system. Our proposal is based on the use of ontologies which store domain knowledge, and which we leverage to detect and solve consistency errors in the extracted data. We have implemented our approach to run against the output of the AIS system, an IE system specialized in analyzing legal documents and we have tested it using a real dataset. Preliminary results confirm the interest of our approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
A Named Entity is a unique identifier of an entity in a text, e.g.,’Marie Curie’ is a NE of a person.
- 7.
R2RML: RDB to RDF Mapping Language, https://www.w3.org/TR/r2rml/.
- 8.
This example is directly taken from the experiments dataset. Proper names and specific data have been altered for reasons of privacy.
- 9.
- 10.
- 11.
- 12.
In Spain, people have a first name, an optional middle name, and two mandatory last names, the first one if the father family name and the second one is the mother family name, although legally this order can be interchanged.
References
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Curry, E., Freitas, A., ORiáin, S.: The role of community-driven data curation for enterprises. In: Linking Enterprise Data, pp. 25–47 (2010)
Gruber, T.R.: Toward principles for the design of ontologies used for knowledge sharing. Int. J. Human Comput. Stud. 43(5–6), 907–928 (1995)
Buey, M.G., Garrido, A.L., Bobed, C., Ilarri, S.: The AIS project: boosting information extraction from legal documents by using ontologies. In: Proceedings of the 8th International Conference on Agents and Artificial Intelligence (ICAART 2016), pp. 438–445 (2016)
Wimalasuriya, D.C., Dou, D.: Ontology-based information extraction: an introduction and a survey of current approaches. J. Inf. Sci. 36(3), 306–323 (2010)
Borobia, J.R., Bobed, C., Garrido, A.L., Mena, E.: SIWAM: using social data to semantically assess the difficulties in mountain activities. In: Proceedings of 10th International Conference on Web Information Systems and Technologies (WEBIST’14), pp. 41–48 (2014)
Garrido, A.L., Buey, M.G., Muñoz, G., Casado-Rubio, J.L.: Information extraction on weather forecasts with semantic technologies. In: International Conference on Applications of Natural Language to Information Systems (NLDB 2016), pp. 140–151. Springer International Publishing, Berlin (2016)
Maletic, J.I., Marcus, A.: Data cleansing. In: Data Mining and Knowledge Discovery Handbook, pp. 21–36. Springer, Boston, MA (2005)
Sarpong, K.A.M., Arthur, J.K.: Analysis of data cleansing approaches regarding dirty data-a comparative study. Int. J. Comput. Appl. 76(7) (2013)
Yeganeh, S., Hassanzadeh, O., Miller, R. J.: Linking semistructured data on the web. In: Interface (2011)
Guo, W., Li, H., Ji, H., Diab, M.T.: Linking tweets to news: a framework to enrich short text data in social media. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), pp. 239–249 (2013)
Wang, J., Bansal, M., Gimpel, K., Ziebart, B.D., Clement, T.Y.: A sense-topic model for word sense induction with unsupervised data enrichment. Trans. Assoc. Comput. Linguist. 3, 59–71 (2015)
Sekine, S., Ranchhod, E.: Named Entities: Recognition, Classification and Use. John Benjamins Publishing Company (2009)
Hu, Y., McKenzie, G., Yang, J.A., Gao, S., Abdalla, A., Janowicz, K.: A linked-data-driven web portal for learning analytics: data enrichment, interactive visualization, and knowledge discovery. In: LAK Workshops (2014)
Yosef, M.A.: U-AIDA: a customizable system for named entity recognition, classification, and disambiguation. Ph.D thesis, Saarland University (2016)
Suárez-Figueroa, M. C., Gómez-Pérez, A., Motta, E., Gangemi, A. Ontology engineering in a networked world. Springer Science and Business Media (2012)
Euzenat, J., Valtchev, P.: Similarity-based ontology alignment in owl-lite. In: Proceedings of the 16th European Conference on Artificial Intelligence (ECAI 2004), pp. 323–327. IOS Press, Amsterdam (2004)
Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: Proceedings of the 16th International World Wide Web Conference (WWW’07), pp. 757–766 (2007)
Jiang, Y., Wang, X., Zheng, H.T.: A semantic similarity measure based on information distance for ontology alignment. Inf. Sci. 278, 76–87 (2014)
Yujian, L., Bo, L.: A normalized levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091–1095 (2007)
van Rijsbergen, C.J.: Information Retrieval, 2nd. edn. Butterworth-Heinemann (1979). ISBN 0408709294
Acknowledgements
This research work has been supported by projects TIN2013-46238-C4-4-R, TIN2016-78011-C4-3-R (AEI/FEDER, UE), and DGA/FEDER.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Buey, M.G., Roman, C., Garrido, A.L., Bobed, C., Mena, E. (2019). Automatic Legal Document Analysis: Improving the Results of Information Extraction Processes Using an Ontology. In: Bembenik, R., Skonieczny, Ł., Protaziuk, G., Kryszkiewicz, M., Rybinski, H. (eds) Intelligent Methods and Big Data in Industrial Applications. Studies in Big Data, vol 40. Springer, Cham. https://doi.org/10.1007/978-3-319-77604-0_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-77604-0_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77603-3
Online ISBN: 978-3-319-77604-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)