Abstract
Since Web resources are formatted in diverse ways for human viewing, the accuracy of extracting information is not satisfactory and, further, it is not convenient for users to query information extracted by traditional techniques. This paper proposes WebKER, a wrapper-driven system for extracting knowledge from Web pages in Chinese based on domain ontologies. Wrappers are first learned through suffix arrays. Based on HowNet, a novel approach is proposed to automatically align the raw data extracted by wrappers. Then knowledge is generated and described with Resource Description Framework (RDF) statements. After merged, knowledge is finally added to the Knowledge Base (KB). A prototype of WebKER is implemented and in the experiments, the performance of our system and the comparison between querying information stored in the KB and querying information extracted with traditional techniques are given, indicating the superiority of our system. In addition, the evaluation of the outstanding wrapper and the method for merging knowledge are also presented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Pinto, D., McCallum, A., Wei, X., Croft, W.B: Table Extraction Using Conditional Random Fields. In: Proceedings of the SIGIR 2003, pp. 235–242. ACM Press, New York (2003)
Cowie, J., Lehnert, W.: Information Extraction. Communications of the ACM 39, 80–91 (1996)
Cohen, W.W., Hurst, M., Jensen, L.S.: A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In: Proceedings of the WWW 2002, pp. 232–241. ACM Press, New York (2002)
Florescu, D., Levy, A.Y., Mendelzon, A.O.: Database Techniques for the World-Wide Web: A Survey. SIGMOD Record 27, 59–74 (1998)
Soderland, S.: Learning to Extract Text-based Information from the World Wide Web. In: Proceedings of the KDD 1997, pp. 251–254. Springer, Heidelberg (1997)
McDowell, L.K., Cafarella, M.: Ontology-driven Information Extraction with OntoSyphon. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 428–444. Springer, Heidelberg (2006)
Welty, C., Murdock, J.W.: Towards Knowledge Acquisition from Information Extraction. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 709–722. Springer, Heidelberg (2006)
Kushmerick, N.: Wrapper Induction for Information Extraction. Technical Report UW-CSE-97-11-04, University of Washington (1997)
Habegger, B., Quafafou, M.: WetDL: A Web Information Extraction Language. In: Yakhno, T. (ed.) ADVIS 2004. LNCS, vol. 3261, pp. 128–138. Springer, Heidelberg (2004)
Zhai, Y.H., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: Proceedings of the WWW 2005, pp. 76–85. ACM Press, New York (2005)
Pek, E.H., Li, X., Liu, Y.Z.: Web Wrapper Validation. In: Goos, G., Hartmanis, J., Leeuwen, J.V. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 388–393. Springer, Heidelberg (2003)
Chidlovskii, B., Ragetli, J., Rijke, M.D.: Wrapper Generation Via Grammar Induction. In: López de Mántaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 96–108. Springer, Heidelberg (2000)
Habegger, B., Debarbieux, D.: Integrating Data from the Web by Machine-Learning Tree-Pattern Queries. In: Meersman, R., Tari, Z. (eds.) On the Move to Meaningful Internet Systems 2006: CoopIS, DOA, GADA, and ODBASE. LNCS, vol. 4275, pp. 941–948. Springer, Heidelberg (2006)
Deng, X.B., Zhu, Y.Y.: L-Tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises. Computer Science and Technology 20, 763–773 (2006)
Schindler, C., Arya, P., Rath, A., Slany, W.: HtmlButler–Wrapper Usability Enhancement Through Ontology Sharing and Large Scale Cooperation. Adaptive and Personalized Semantic Web 14, 85–94 (2006)
Lewis, D.D.: Naive Bayes at Forty: the Independence Assumption in Information Retrieval. In: Carbonell, J.G., Siekmann, J. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–5. Springer, Heidelberg (1998)
Manber, U., Myers, G.: Suffix Arrays: A New Method for On-Line Search. SIAM Journal on Computing 22, 935–948 (1993)
HTML Tidy Project, http://www.w3.org/People/Raggett/tidy
Gan, K.W., Wong, P.W.: Annotating Information Structures in Chinese Texts Using HowNet. In: Palmer, M., Marcus, M., Joshi, A., Xia, F. (eds.) Proceedings of the second workshop on Chinese language processing, pp. 85–92 (2000)
RDF Primer, http://www.w3.org/TR/rdf-primer
Jena Semantic Web Toolkit, http://www.hpl.hp.com/semweb/jena.htm
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sun, J., Bai, X., Li, Z., Che, H., Liu, H. (2007). Towards a Wrapper-Driven Ontology-Based Framework for Knowledge Extraction. In: Zhang, Z., Siekmann, J. (eds) Knowledge Science, Engineering and Management. KSEM 2007. Lecture Notes in Computer Science(), vol 4798. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76719-0_25
Download citation
DOI: https://doi.org/10.1007/978-3-540-76719-0_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76718-3
Online ISBN: 978-3-540-76719-0
eBook Packages: Computer ScienceComputer Science (R0)