Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Reasoning about Textual Similarity in a Web-Based Information Access System

Published: 01 March 1999 Publication History

Abstract

The degree to which information sources are pre-processed by Web-based information systems varies greatly. In search engines like Altavista, little pre-processing is done, while in “knowledge integration” systems, complex site-specific “wrappers” are used to integrate different information sources into a common database representation. In this paper we describe an intermediate point between these two models. In our system, information sources are converted into a highly structured collection of small fragments of text. Database-like queries to this structured collection of text fragments are approximated using a novel logic called WHIRL, which combines inference in the style of deductive databases with ranked retrieval methods from information retrieval (IR). WHIRL allows queries that integrate information from multiple Web sites, without requiring the extraction and normalization of object identifiers that can be used as keys; instead, operations that in conventional databases require equality tests on keys are approximated using IR similarity metrics for text. This leads to a reduction in the amount of human engineering required to field a knowledge integration system. Experimental evidence is given showing that many information sources can be easily modeled with WHIRL, and that inferences in the logic are both accurate and efficient.

References

[1]
1. S. Abiteboul and V. Vianu, "Regular path queries with constraints," In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS-97), Tucson, AZ, May 1997.
[2]
2. Y. Arens, C. A. Knoblock, and C.-N. Hsu, "Query processing in the SIMS information mediator." In Austin Tate, editor, Advanced Planning Technology. AAAI Press, Menlo Park, CA, 1996.
[3]
3. N. Ashish and C. Knoblock, "Wrapper generation for semistructured Internet sources." In Dan Suciu, editor, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. Available on-line from http://www.research.att.com/~suciu/workshop-papers.html.
[4]
4. P. Atzeni, G. Mecca, and P. Merialdo, "Semistructured and structured data on the Web: going back and forth." In Dan Suciu, editor, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. Available on-line from http://www.research.att.com/~suciu/workshop-papers.html.
[5]
5. D. Barbara, H. Garcia-Molina, and D. Porter, "The management of probabilistic data." IEEE Transactions on knowledge and data engineering, 4(5):487-501, October 1992.
[6]
6. R. J. Bayardo, W. Bohrer, R. Brice, A. Cichocki, J. Fowler, A. Helal, V. Kashyap, T. Ksiezyk, G. Martin, M. Nodine, M. Rashid, M. Rusinkiewicz, R. Shea, C. Unnikrishan, A. Unruh, and D. Woelk, "Infosleuth: an agent-based semantic integration of information in open and dynamic environments." In Proceedings of the 1997 ACM SIGMOD, May 1997.
[7]
7. W. W. Cohen, "Knowledge integration for structured information sources containing text (extended abstract)." In The SIGIR-97 Workshop on Networked Information Retrieval, 1997.
[8]
8. W. W. Cohen, "Integration of heterogeneous databases without common domains using queries based on textual similarity." In Proceedings of ACM SIGMOD-98, Seattle, WA, 1998.
[9]
9. T. Fiebig, J. Weiss, and G. Moerkotte, "RAW: a relational algebra for the Web." In Dan Suciu, editor, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. Available on-line from http://www.research.att.com/~suciu/workshop-papers.html.
[10]
10. N. Fuhr, "Probabilistic Datalog--a logic for powerful retrieval methods." In Proceedings of the 1995 ACM SIGIR conference on research in information retrieval, pages 282-290, New York, 1995.
[11]
11. H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom, "The TSIMMIS approach to mediation: Data models and languages (extended abstract)." In Next Generation Information Technologies and Systems (NGITS-95), Naharia, Israel, November 1995.
[12]
12. M. Genesereth, A. Keller, and O. Dushka, "Infomaster: an information integration system." In Proceedings of the 1997 ACM SIGMOD, May 1997.
[13]
13. S. Huffman and D. Steier, "Heuristic joins to integrate structured heterogeneous data." In Working notes of the AAAI spring symposium on information gathering in heterogeneous distributed environments, Palo Alto, CA, March 1995. AAAI Press.
[14]
14. D. Konopnicki and O. Schmueli, "W3QS: a query system for the world wide web." In Proceedings of the 21nd International Conference on Very Large Databases (VLDB-96), Zurich, Switzerland, 1995.
[15]
15. N. Kushmerick, D. S. Weld, and R. Doorenbos, "Wrapper induction for information extraction." In Proceedings of the 15th International Joint Conference on Artificial Intelligence, Osaka, Japan, 1997.
[16]
16. Z. Lacroix, A. Sahuguet, and R. Chandrasekar, "User-oriented smart-cache for the web: what you seek is what you get." In Proceedings of the 1998 ACM SIGMOD, June 1998.
[17]
17. A. Y. Levy, A. Rajaraman, and J. J. Ordille, "Querying heterogeneous information sources using source descriptions." In Proceedings of the 22nd International Conference on Very Large Databases (VLDB-96), Bombay, India, September 1996.
[18]
18. G. Mecca, P. Atzeni, A. Masci, P. Merialdo, and G. Sindoni, "The ARANEUS Web-base management system." In Proceedings of the 1998 ACM SIGMOD, June 1998.
[19]
19. A. Mendelzon and T. Milo, "Formal models of Web queries." In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS-97), Tucson, AZ, May 1997.
[20]
20. M. F. Porter, "An algorithm for suffix stripping." Program, 14(3) pp. 130-137, 1980.
[21]
21. Gerard Salton, editor, Automatic Text Processing. Addison Welsley, Reading, Massachusetts, 1989.
[22]
22. D. Suciu, "Query decomposition and view maintanance for query languages for unstructured data." In Proceedings of the 22nd International Conference on Very Large Databases (VLDB-96), Bombay, India, 1996.
[23]
23. D. Suciu, editor, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. Available on-line from http://www.research.att.com/~suciu/workshop-papers.html.
[24]
24. A. Tomasic, R. Amouroux, P. Bonnet, and O. Kapitskaia, "The distributed information search component (Disco) and the World Wide Web." In Proceedings of the 1997 ACM SIGMOD, May 1997.
[25]
25. H. Turtle and J. Flood, "Query evaluation: strategies and optimizations." Information processing and management, 31(6) pp. 831-850, November 1995.
[26]
26. J. Ullman and J. Widom, A first course in database systems. Prentice Hall, Upper Saddle River, New Jersey, 1997.

Cited By

View all
  • (2007)Adapting Web information extraction knowledge via mining site-invariant and site-dependent featuresACM Transactions on Internet Technology10.1145/1189740.11897467:1(6-es)Online publication date: 1-Feb-2007
  • (2004)A Bayesian network approach to searching Web databases through keyword-based queriesInformation Processing and Management: an International Journal10.1016/j.ipm.2004.03.00240:5(773-790)Online publication date: 1-Sep-2004
  • (2003)Keyword-based queries over web databasesEffective databases for text & document management10.5555/950765.950772(74-92)Online publication date: 1-Jan-2003
  • Show More Cited By
  1. Reasoning about Textual Similarity in a Web-Based Information Access System

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Autonomous Agents and Multi-Agent Systems
    Autonomous Agents and Multi-Agent Systems  Volume 2, Issue 1
    March 1999
    97 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 01 March 1999

    Author Tags

    1. Information agent
    2. information extraction
    3. information integration
    4. information retrieval
    5. ranked retrieval
    6. similarity

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2007)Adapting Web information extraction knowledge via mining site-invariant and site-dependent featuresACM Transactions on Internet Technology10.1145/1189740.11897467:1(6-es)Online publication date: 1-Feb-2007
    • (2004)A Bayesian network approach to searching Web databases through keyword-based queriesInformation Processing and Management: an International Journal10.1016/j.ipm.2004.03.00240:5(773-790)Online publication date: 1-Sep-2004
    • (2003)Keyword-based queries over web databasesEffective databases for text & document management10.5555/950765.950772(74-92)Online publication date: 1-Jan-2003
    • (2003)Integrating information visualization and retrieval for WWW information discoveryTheoretical Computer Science10.1016/S0304-3975(02)00186-X292:2(547-571)Online publication date: 27-Jan-2003
    • (2002)Learning to match and cluster large high-dimensional data sets for data integrationProceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/775047.775116(475-480)Online publication date: 23-Jul-2002
    • (2002)Searching web databases by structuring keyword-based queriesProceedings of the eleventh international conference on Information and knowledge management10.1145/584792.584801(26-33)Online publication date: 4-Nov-2002

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media