article

Reasoning about Textual Similarity in a Web-Based Information Access System

Author:

William W. CohenAuthors Info & Claims

Autonomous Agents and Multi-Agent Systems, Volume 2, Issue 1

Pages 65 - 86

https://doi.org/10.1023/A:1010031208520

Published: 01 March 1999 Publication History

Abstract

The degree to which information sources are pre-processed by Web-based information systems varies greatly. In search engines like Altavista, little pre-processing is done, while in “knowledge integration” systems, complex site-specific “wrappers” are used to integrate different information sources into a common database representation. In this paper we describe an intermediate point between these two models. In our system, information sources are converted into a highly structured collection of small fragments of text. Database-like queries to this structured collection of text fragments are approximated using a novel logic called WHIRL, which combines inference in the style of deductive databases with ranked retrieval methods from information retrieval (IR). WHIRL allows queries that integrate information from multiple Web sites, without requiring the extraction and normalization of object identifiers that can be used as keys; instead, operations that in conventional databases require equality tests on keys are approximated using IR similarity metrics for text. This leads to a reduction in the amount of human engineering required to field a knowledge integration system. Experimental evidence is given showing that many information sources can be easily modeled with WHIRL, and that inferences in the logic are both accurate and efficient.

References

[1]

1. S. Abiteboul and V. Vianu, "Regular path queries with constraints," In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS-97), Tucson, AZ, May 1997.

Digital Library

[2]

2. Y. Arens, C. A. Knoblock, and C.-N. Hsu, "Query processing in the SIMS information mediator." In Austin Tate, editor, Advanced Planning Technology. AAAI Press, Menlo Park, CA, 1996.

[3]

3. N. Ashish and C. Knoblock, "Wrapper generation for semistructured Internet sources." In Dan Suciu, editor, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. Available on-line from http://www.research.att.com/~suciu/workshop-papers.html.

[4]

4. P. Atzeni, G. Mecca, and P. Merialdo, "Semistructured and structured data on the Web: going back and forth." In Dan Suciu, editor, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. Available on-line from http://www.research.att.com/~suciu/workshop-papers.html.

[5]

5. D. Barbara, H. Garcia-Molina, and D. Porter, "The management of probabilistic data." IEEE Transactions on knowledge and data engineering, 4(5):487-501, October 1992.

Digital Library

[6]

6. R. J. Bayardo, W. Bohrer, R. Brice, A. Cichocki, J. Fowler, A. Helal, V. Kashyap, T. Ksiezyk, G. Martin, M. Nodine, M. Rashid, M. Rusinkiewicz, R. Shea, C. Unnikrishan, A. Unruh, and D. Woelk, "Infosleuth: an agent-based semantic integration of information in open and dynamic environments." In Proceedings of the 1997 ACM SIGMOD, May 1997.

Digital Library

[7]

7. W. W. Cohen, "Knowledge integration for structured information sources containing text (extended abstract)." In The SIGIR-97 Workshop on Networked Information Retrieval, 1997.

[8]

8. W. W. Cohen, "Integration of heterogeneous databases without common domains using queries based on textual similarity." In Proceedings of ACM SIGMOD-98, Seattle, WA, 1998.

Digital Library

[9]

9. T. Fiebig, J. Weiss, and G. Moerkotte, "RAW: a relational algebra for the Web." In Dan Suciu, editor, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. Available on-line from http://www.research.att.com/~suciu/workshop-papers.html.

[10]

10. N. Fuhr, "Probabilistic Datalog--a logic for powerful retrieval methods." In Proceedings of the 1995 ACM SIGIR conference on research in information retrieval, pages 282-290, New York, 1995.

Digital Library

[11]

11. H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom, "The TSIMMIS approach to mediation: Data models and languages (extended abstract)." In Next Generation Information Technologies and Systems (NGITS-95), Naharia, Israel, November 1995.

[12]

12. M. Genesereth, A. Keller, and O. Dushka, "Infomaster: an information integration system." In Proceedings of the 1997 ACM SIGMOD, May 1997.

Digital Library

[13]

13. S. Huffman and D. Steier, "Heuristic joins to integrate structured heterogeneous data." In Working notes of the AAAI spring symposium on information gathering in heterogeneous distributed environments, Palo Alto, CA, March 1995. AAAI Press.

[14]

14. D. Konopnicki and O. Schmueli, "W3QS: a query system for the world wide web." In Proceedings of the 21nd International Conference on Very Large Databases (VLDB-96), Zurich, Switzerland, 1995.

Digital Library

[15]

15. N. Kushmerick, D. S. Weld, and R. Doorenbos, "Wrapper induction for information extraction." In Proceedings of the 15th International Joint Conference on Artificial Intelligence, Osaka, Japan, 1997.

[16]

16. Z. Lacroix, A. Sahuguet, and R. Chandrasekar, "User-oriented smart-cache for the web: what you seek is what you get." In Proceedings of the 1998 ACM SIGMOD, June 1998.

Digital Library

[17]

17. A. Y. Levy, A. Rajaraman, and J. J. Ordille, "Querying heterogeneous information sources using source descriptions." In Proceedings of the 22nd International Conference on Very Large Databases (VLDB-96), Bombay, India, September 1996.

Digital Library

[18]

18. G. Mecca, P. Atzeni, A. Masci, P. Merialdo, and G. Sindoni, "The ARANEUS Web-base management system." In Proceedings of the 1998 ACM SIGMOD, June 1998.

Digital Library

[19]

19. A. Mendelzon and T. Milo, "Formal models of Web queries." In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS-97), Tucson, AZ, May 1997.

Digital Library

[20]

20. M. F. Porter, "An algorithm for suffix stripping." Program, 14(3) pp. 130-137, 1980.

[21]

21. Gerard Salton, editor, Automatic Text Processing. Addison Welsley, Reading, Massachusetts, 1989.

Digital Library

[22]

22. D. Suciu, "Query decomposition and view maintanance for query languages for unstructured data." In Proceedings of the 22nd International Conference on Very Large Databases (VLDB-96), Bombay, India, 1996.

Digital Library

[23]

23. D. Suciu, editor, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. Available on-line from http://www.research.att.com/~suciu/workshop-papers.html.

[24]

24. A. Tomasic, R. Amouroux, P. Bonnet, and O. Kapitskaia, "The distributed information search component (Disco) and the World Wide Web." In Proceedings of the 1997 ACM SIGMOD, May 1997.

Digital Library

[25]

25. H. Turtle and J. Flood, "Query evaluation: strategies and optimizations." Information processing and management, 31(6) pp. 831-850, November 1995.

Digital Library

[26]

26. J. Ullman and J. Widom, A first course in database systems. Prentice Hall, Upper Saddle River, New Jersey, 1997.

Digital Library

Cited By

Wong TLam W(2007)Adapting Web information extraction knowledge via mining site-invariant and site-dependent featuresACM Transactions on Internet Technology10.1145/1189740.11897467:1(6-es)Online publication date: 1-Feb-2007
https://dl.acm.org/doi/10.1145/1189740.1189746
Calado Pda Silva ALaender ARibeiro-Neto BVieira R(2004)A Bayesian network approach to searching Web databases through keyword-based queriesInformation Processing and Management: an International Journal10.1016/j.ipm.2004.03.00240:5(773-790)Online publication date: 1-Sep-2004
https://dl.acm.org/doi/10.1016/j.ipm.2004.03.002
da Silva ACalado PVieira RLaender ARibeiro-Neto B(2003)Keyword-based queries over web databasesEffective databases for text & document management10.5555/950765.950772(74-92)Online publication date: 1-Jan-2003
https://dl.acm.org/doi/10.5555/950765.950772
Show More Cited By

Reasoning about Textual Similarity in a Web-Based Information Access System
1. Information systems

Recommendations

A GA-based query optimization method for web information retrieval

By a different use of relevance feedback (the order in which the relevant documents are retrieved, the terms of the relevant documents, and the terms of the irrelevant documents) in the design of fitness function, and by introducing three different ...
Enhancing keyword-based botanical information retrieval with information extraction
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Keyword-based retrieval matches search terms and documents via term co-occurrence. Such an approach does not allow matching based on the specific plant characteristic descriptions that are often used in botanical text retrieval. This study applies ...
Multimedia retrieval by means of merge of results from textual and content based retrieval subsystems
CLEF'09: Proceedings of the 10th international conference on Cross-language evaluation forum: multimedia experiments

The main goal of this paper it is to present our experiments in ImageCLEF 2009 Campaign (photo retrieval task). In 2008 we proved empirically that the Text-based Image Retrieval (TBIR) methods defeats the Content-based Image Retrieval CBIR "quality" of ...

Comments

Information & Contributors

Information

Published In

cover image Autonomous Agents and Multi-Agent Systems

Autonomous Agents and Multi-Agent Systems Volume 2, Issue 1

March 1999

97 pages

ISSN:1387-2532

Issue’s Table of Contents

Copyright © Copyright © 1999 Kluwer Academic Publishers.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 March 1999

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wong TLam W(2007)Adapting Web information extraction knowledge via mining site-invariant and site-dependent featuresACM Transactions on Internet Technology10.1145/1189740.11897467:1(6-es)Online publication date: 1-Feb-2007
https://dl.acm.org/doi/10.1145/1189740.1189746
Calado Pda Silva ALaender ARibeiro-Neto BVieira R(2004)A Bayesian network approach to searching Web databases through keyword-based queriesInformation Processing and Management: an International Journal10.1016/j.ipm.2004.03.00240:5(773-790)Online publication date: 1-Sep-2004
https://dl.acm.org/doi/10.1016/j.ipm.2004.03.002
da Silva ACalado PVieira RLaender ARibeiro-Neto B(2003)Keyword-based queries over web databasesEffective databases for text & document management10.5555/950765.950772(74-92)Online publication date: 1-Jan-2003
https://dl.acm.org/doi/10.5555/950765.950772
Ohwada HMizoguchi F(2003)Integrating information visualization and retrieval for WWW information discoveryTheoretical Computer Science10.1016/S0304-3975(02)00186-X292:2(547-571)Online publication date: 27-Jan-2003
https://dl.acm.org/doi/10.1016/S0304-3975%2802%2900186-X
Cohen WRichman JZaïane OGoebel RHand DKeim DNg R(2002)Learning to match and cluster large high-dimensional data sets for data integrationProceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/775047.775116(475-480)Online publication date: 23-Jul-2002
https://dl.acm.org/doi/10.1145/775047.775116
Calado Pda Silva AVieira RLaender ARibeiro-Neto BNicholas CGrossman DKalpakis KQureshi Svan Dissel HSeligman L(2002)Searching web databases by structuring keyword-based queriesProceedings of the eleventh international conference on Information and knowledge management10.1145/584792.584801(26-33)Online publication date: 4-Nov-2002
https://dl.acm.org/doi/10.1145/584792.584801

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents