Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Monadic datalog and the expressive power of languages for Web information extraction

Published: 01 January 2004 Publication History

Abstract

Research on information extraction from Web pages (wrapping) has seen much activity recently (particularly systems implementations), but little work has been done on formally studying the expressiveness of the formalisms proposed or on the theoretical foundations of wrapping. In this paper, we first study monadic datalog over trees as a wrapping language. We show that this simple language is equivalent to monadic second order logic (MSO) in its ability to specify wrappers. We believe that MSO has the right expressiveness required for Web information extraction and propose MSO as a yardstick for evaluating and comparing wrappers. Along the way, several other results on the complexity of query evaluation and query containment for monadic datalog over trees are established, and a simple normal form for this language is presented. Using the above results, we subsequently study the kernel fragment Elog of the Elog wrapping language used in the Lixto system (a visual wrapper generator). Curiously, Elog exactly captures MSO, yet is easier to use. Indeed, programs in this language can be entirely visually specified.

References

[1]
Abiteboul, S., Hull, R., and Vianu, V. 1995. Foundations of Databases. Addison-Wesley, Reading, Mass.]]
[2]
Abiteboul, S., and Vianu, V. 1999. Regular path queries with constraints. J. Comput. Syst. Sci. 58, 3, 428--452.]]
[3]
Aho, A. V., Beeri, C., and Ullman, J. D. 1979. The theory of joins in relational databases. ACM Trans. Datab. Syst. 4, 3, 297--314.]]
[4]
Atzeni, P., and Mecca, G. 1997. Cut and paste. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'97) (Tucson, Az.). ACM, New York.]]
[5]
Baumgartner, R., Flesca, S., and Gottlob, G. 2001a. Declarative information extraction, web crawling, and recursive wrapping with lixto. In Proceedings of LPNMR'01 (Vienna, Austria).]]
[6]
Baumgartner, R., Flesca, S., and Gottlob, G. 2001b. Visual web information extraction with lixto. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB'01).]]
[7]
Beeri, C., and Bernstein, P. A. 1979. Computational problems related to the design of normal form relational schemas. ACM Trans. Datab. Syst. 4, 1, 30--59.]]
[8]
Brüggemann-Klein, A., Murata, M., and Wood, D. 2001. Regular tree and regular hedge languages over non-ranked alphabets: Version 1, April 3, 2001. Tech. Rep. HKUST-TCSC-2001-05, Hong Kong University of Science and Technology, Hong Kong SAR, China.]]
[9]
Brüggemann-Klein, A., and Wood, D. 2000. Caterpillars: A context specification technique. Markup Lang. 2, 1, 81--106.]]
[10]
Ceri, S., Gottlob, G., and Tanca, L. 1990. Logic Programming and Databases. Springer-Verlag, Berlin, Germany.]]
[11]
Cosmadakis, S., Gaifman, H., Kanellakis, P., and Vardi, M. 1988. Decidable optimization problems for database logic programs. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing (Chicago, Ill.). ACM, New York, 477--490.]]
[12]
Courcelle, B. 1990. Graph rewriting: An algebraic and logic approach. In Handbook of Theoretical Computer Science, J. van Leeuwen, Ed. Vol. 2. Elsevier Science Publishers B.V., Chap. 5, 193--242.]]
[13]
Doner, J. 1970. Tree acceptors and some of their applications. J. Comput. Syst. Sci. 4, 406--451.]]
[14]
Dowling, W. F., and Gallier, J. H. 1984. Linear-time algorithms for testing the satisfiability of propositional horn formulae. J. Logic Prog. 1, 3, 267--284.]]
[15]
Ebbinghaus, H.-D., and Flum, J. 1999. Finite Model Theory. 2nd ed. Springer-Verlag, New York.]]
[16]
Flum, J., Frick, M., and Grohe, M. 2001. Query evaluation via tree-decompositions. In Proceedings of the 8th International Conference on Database Theory (ICDT'01), J. Van den Bussche and V. Vianu, Eds. Lecture Notes in Computer Science, vol. 1973. Springer, New York, 22--38.]]
[17]
Frick, M., and Grohe, M. 2002. The complexity of first-order and monadic second-order logic revisited. In Proceedings of the 17th Annual IEEE Symposium on Logic in Computer Science (LICS). IEEE Computer Society Press, Los Alamitos, Calif., 215--224.]]
[18]
Frick, M., Grohe, M., and Koch, C. 2003. Query evaluation on compressed trees. In Proceedings of the 18th Annual IEEE Symposium on Logic in Computer Science (LICS) (Ottawa, Ont., Canada). IEEE Computer Society Press, Los Alamitos, Calif.]]
[19]
Gécseg, F., and Steinby, M. 1997. Tree languages. In Handbook of Formal Languages, G. Rozenberg and A. Salomaa, Eds. Vol. 3. Springer-Verlag, New York, Chap. 1, 1--68.]]
[20]
Gottlob, G., Grädel, E., and Veith, H. 2002. Datalog LITE: A deductive query language with linear-time model checking. ACM Trans. Computat. Logic 3, 1, 42--79.]]
[21]
Gottlob, G., and Koch, C. 2002. Monadic queries over tree-structured data. In Proceedings of the 17th Annual IEEE Symposium on Logic in Computer Science (LICS) (Copenhagen, Denmark). IEEE Computer Society Press, Los Alamitos, Calif., 189--202.]]
[22]
Gottlob, G., Koch, C., and Pichler, R. 2002. Efficient algorithms for processing XPath queries. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB'02) (Hong Kong, China).]]
[23]
Hopcroft, J. E., and Ullman, J. D. 1979. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company, Reading, Mass.]]
[24]
Laender, A. H. F., Ribeiro-Neto, B., and S. da Silva, A. 2002. DEByE---Data Extraction By Example. Data Knowl. Eng. 40, 2 (Feb.), 121--154.]]
[25]
Liu, L., Pu, C., and Han, W. 2000. XWRAP: An XML-enabled wrapper construction system for web information sources. In Proceedings of the 16th IEEE International Conference on Data Engineering (ICDE) (San Diego, Calif.). ACM, New York, 611--621.]]
[26]
Lixto. http://www.lixto.com.]]
[27]
Ludäscher, B., Himmeröder, R., Lausen, G., May, W., and Schlepphorst, C. 1998. Managing semistructured data with florid: A deductive object-oriented perspective. Inf. Syst. 23, 8, 1--25.]]
[28]
Maier, D., Mendelzon, A. O., and Sagiv, Y. 1979. Testing implications of data dependencies. ACM Trans. Datab. Syst. 4, 4, 455--469.]]
[29]
Miklau, G., and Suciu, D. 2003. Containment and equivalence for an fragment of XPath. J. ACM 51, 1 (Jan.), 2--45.]]
[30]
Minoux, M. 1988. LTUR: A simplified linear-time unit resolution algorithm for horn formulae and computer implementation. Inf. Proc. Lett. 29, 1, 1--12.]]
[31]
Neven, F. 2002. On the power of walking for querying tree-structured data. In Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'02). Madison, Wisconsin.]]
[32]
Neven, F., and Schwentick, T. 2000. Expressive and efficient pattern languages for tree-structured data. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'00) (Dallas, Tex.), ACM, New York, 145--156.]]
[33]
Neven, F., and Schwentick, T. 2002. Query automata on finite trees. Theoret. Comput. Sci. 275, 633--674.]]
[34]
Neven, F., and Schwentick, T. 2003. XPath containment in the presence of disjunction, DTDs, and variables. In Proceedings of the 9th International Conference on Database Theory (ICDT'03). 315--329.]]
[35]
Neven, F., and van den Bussche, J. 2002. Expressiveness of structured document query languages based on attribute grammars. J. ACM 49, 1 (Jan.), 56--100.]]
[36]
Papadimitriou, C. H. 1994. Computational Complexity. Addison-Wesley, Reading, Mass.]]
[37]
Papakonstantinou, Y., Gupta, A., Garcia-Molina, H., and Ullman, J. 1995. A query translation scheme for rapid implementation of wrappers. In Proceedings of the 4th International Conference on Deductive and Object-oriented Databases (DOOD'95) (Singapore), Springer-Verlag, New York, 161--186.]]
[38]
Sahuguet, A., and Azavant, F. 2001. Building intelligent web applications using lightweight wrappers. Data Knowl. Eng. 36, 3, 283--316.]]
[39]
Szilard, A., Yu, S., Zhang, K., and Shallit, J. 1992. Characterizing regular languages with polynomial densities. In Proceedings of the 17th International Symposium on Mathematical Foundations of Computer Science. Lecture Notes in Computer Science, vol. 629. Springer-Verlag, Berlin, Germany, 494--503.]]
[40]
Thatcher, J., and Wright, J. 1968. Generalized finite automata theory with an application to a decision problem of second-order logic. Math. Syst. Theory 2, 1, 57--81.]]
[41]
Thomas, W. 1990. Automata on infinite objects. In Handbook of Theoretical Computer Science, J. van Leeuwen, Ed. Vol. 2. Elsevier Science Publishers B.V., Chap. 4, 133--192.]]
[42]
Thomas, W. 1997. Languages, automata, and logic. In Handbook of Formal Languages, G. Rozenberg and A. Salomaa, Eds. Vol. 3. Springer-Verlag, New York, Chap. 7, 389--455.]]
[43]
World Wide Web Consortium. 1999. XML Path Language (XPath) Recommendation. http://www. w3c.org/TR/xpath/.]]
[44]
Yannakakis, M. 1981. Algorithms for acyclic database schemes. In Proceedings of the 7th International Conference on Very Large Data Bases (VLDB'81). 82--94.]]
[45]
Yu, S. 1997. Regular languages. In Handbook of Formal Languages, G. Rozenberg and A. Salomaa, Eds. Vol. 1. Springer-Verlag, New York, Chap. 2, 41--110.]]

Cited By

View all
  • (2024)Evaluating Datalog over Semirings: A Grounding-based ApproachProceedings of the ACM on Management of Data10.1145/36515912:2(1-26)Online publication date: 14-May-2024
  • (2024)The determinants of product trust in live streaming E-commerce: a hybrid method integrating SEM and fsQCAAsia Pacific Journal of Marketing and Logistics10.1108/APJML-01-2024-0048Online publication date: 14-Aug-2024
  • (2023)A Differential Datalog InterpreterSoftware10.3390/software20300202:3(427-446)Online publication date: 21-Sep-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM
Journal of the ACM  Volume 51, Issue 1
January 2004
113 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/962446
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2004
Published in JACM Volume 51, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Complexity
  2. HTML
  3. MSO
  4. expressiveness
  5. information extraction
  6. monadic datalog
  7. regular tree languages
  8. web wrapping

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Evaluating Datalog over Semirings: A Grounding-based ApproachProceedings of the ACM on Management of Data10.1145/36515912:2(1-26)Online publication date: 14-May-2024
  • (2024)The determinants of product trust in live streaming E-commerce: a hybrid method integrating SEM and fsQCAAsia Pacific Journal of Marketing and Logistics10.1108/APJML-01-2024-0048Online publication date: 14-Aug-2024
  • (2023)A Differential Datalog InterpreterSoftware10.3390/software20300202:3(427-446)Online publication date: 21-Sep-2023
  • (2022)Adventures with Datalog: Walking the Thin Line Between Theory and PracticeAIxIA 2022 – Advances in Artificial Intelligence10.1007/978-3-031-27181-6_34(489-500)Online publication date: 28-Nov-2022
  • (2021)Datalog UnchainedProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458815(57-69)Online publication date: 20-Jun-2021
  • (2020)Parallel Journeys of Patients with Cancer and Depression: Challenges and Opportunities for Technology-Enabled Collaborative CareProceedings of the ACM on Human-Computer Interaction10.1145/33928434:CSCW1(1-36)Online publication date: 29-May-2020
  • (2020)Stuck in the middle with you: The Transaction Costs of Corporate Employees Hiring FreelancersProceedings of the ACM on Human-Computer Interaction10.1145/33928424:CSCW1(1-28)Online publication date: 29-May-2020
  • (2020)Shifting forms of Engagement: Volunteer Learning in Online Citizen ScienceProceedings of the ACM on Human-Computer Interaction10.1145/33928414:CSCW1(1-19)Online publication date: 29-May-2020
  • (2020)Chinese News Data Extraction System Based on Readability AlgorithmArtificial Intelligence and Security10.1007/978-981-15-8083-3_14(153-164)Online publication date: 13-Sep-2020
  • (2018)DatalogDeclarative Logic Programming10.1145/3191315.3191317(3-100)Online publication date: 1-Sep-2018
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media