Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Data integration using similarity joins and a word-based information representation language

Published: 01 July 2000 Publication History
  • Get Citation Alerts
  • Abstract

    The integration of distributed, heterogeneous databases, such as those available on the World Wide Web, poses many problems. Herer we consider the problem of integrating data from sources that lack common object identifiers. A solution to this problem is proposed for databases that contain informal, natural-language “names” for objects; most Web-based databases satisfy this requirement, since they usually present their information to the end-user through a veneer of text. We describe WHIRL, a “soft” database management system which supports “similarity joins,” based on certain robust, general-purpose similarity metrics for text. This enables fragments of text (e.g., informal names of objects) to be used as keys. WHIRL includes textual objects as a built-in type, similarity reasoning as a built-in predicate, and answers every query with a list of answer substitutions that are ranked according to an overall score. Experiments show that WHIRL is much faster than naive inference methods, even for short queries, and efficient on typical queries to real-world databases with tens of thousands of tuples. Inferences made by WHIRL are also surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outerperforming exact matching with a plausible global domain on a second.

    References

    [1]
    ABITEBOUL,S.AND VIANU, V. 1997. Regular path queries with constraints. In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Sys-tems (PODS-97) (Tucson, AZ, May 1997).
    [2]
    ARENS, Y., KNOBLOCK,C.A.,AND HSU, C.-N. 1996. Query processing in the SIMS informa-tion mediator. In A. Tate Ed., Advanced Planning Technology. Menlo Park, CA: AAAI Press.
    [3]
    ATZENI, P., MECCA, G., AND MERIALDO, P. 1997. Semistructured and structured data on the Web: going back and forth. In D. Suciu Ed., Proceedings of the Workshop on Management of Semistructured Data (Tucson, Arizona, May 1997). Available on-line from http://www.re-search. att.com/ ; suciu/workshop-papers.html.
    [4]
    BARBARA, D., GARCIA-MOLINA, H., AND PORTER, D. 1992. The management of probabilistic data. IEEE Transations on knowledge and data engineering 4, 5 (October), 487-501.
    [5]
    BARTELL,B.T.,COTTRELL,G.W.,AND BELEW, R. K. 1994. Automatic combination of multiple ranked retrieval systems. In Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1994).
    [6]
    BAYARDO,R.J.,BOHRER, W., BRICE, R., CICHOCKI, A., FOWLER, J., HELAL, A., KASHYAP, V., KSIEZYK, T., MARTIN, G., NODINE, M., RASHID, M., RUSINKIEWICZ, M., SHEA, R., UNNIKRISHAN, C., UNRUH, A., AND WOELK, D. 1997. Infosleuth: an agent-based semantic integration of information in open and dynamic environments. In Proceedings of the 1997 ACM SIGMOD (May 1997).
    [7]
    BORGMAN,C.L.AND SIEGFRIED, S. L. 1992. Getty's Synoname and its cousins: a survey of applications of personal name-matching algorithms. Journal of the American Society for Information Science 43, 7, 459-476.
    [8]
    BOSC,P.AND PRADE, H. 1997. An introduction to the fuzzy set and possibility theory-based treatment of queries and uncertain or imprecise databases. In Uncertainty management in information systems. Kluwer Academic Publishers.
    [9]
    BOYAN, J., FREITAG, D., AND JOACHIMS, T. 1994. A machine learning architecture for optimiz-ing web search engines. Technical Report WS-96-05, American Association of Artificial Intelligence.
    [10]
    CHAUDHURI, S., DAYAL, U., AND YAN, T. 1995. Join queries with external text sources: execution and optimization techniques. In Proceedings of the 1995 ACM SIGMOD (May 1995).
    [11]
    COHEN, W. W. 1997. Knowledge integration for structured information sources containing text (extended abstract). In The SIGIR-97 Workshop on Networked Information Retrieval (1997).
    [12]
    COHEN, W. W. 1998a. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of ACM SIGMOD-98 (Seattle, WA, 1998).
    [13]
    COHEN, W. W. 1998b. A Web-based information system that reasons with structured collections of text. In Proceedings of Autonomous Agents-98 (St. Paul, MN, 1998).
    [14]
    COHEN,W.W.AND HIRSH, H. 1998. Joins that generalize: Text categorization using WHIRL. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (New York, NY, 1998), pp. 169-173.
    [15]
    COHEN,W.W.AND SINGER, Y. 1996. Context-sensitive learning methods for text categoriza-tion. In Proceedings of the 19th Annual International ACM Conference on Research and Development in Information Retrieval (Zurich, Switzerland, 1996), pp. 307-315. ACM Press.
    [16]
    COHEN,W.W.,SCHAPIRE,R.E.,AND SINGER, Y. 1997. Learning to order things. In Advances in Neural Processing Systems 10 (Denver, CO, 1997). MIT Press.
    [17]
    CRAVEN, M., DIPASQUO, D., FREITAG, D., MCCALLUM, A., MITCHELL, T., NIGAM, K., AND SLATTERY, S. 1998. Learning to extract symbolic knowledge from the world wide web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98) (Madison, WI, 1998).
    [18]
    DUSCHKA,O.M.AND GENESERETH, M. R. 1997a. Answering recursive queries using views. In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS-97) (Tucson, AZ, May 1997).
    [19]
    DUSCHKA,O.M.AND GENESERETH, M. R. 1997b. Query planning in infomaster. In Proceed-ings of the Twelfth Annual ACM Symposium on Applied Computing (SAC97) (San Jose, CA, February 1997).
    [20]
    FAGAN, J. L. 1989. The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval. Journal of the American Society for Information Science 40, 2, 115-132.
    [21]
    FAGIN, R. 1998. Fuzzy queries in multimedia database systems. In Proc. 1998 ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'98) (1998).
    [22]
    FANG, D., HAMMER, J., AND MCLEOD, D. 1994. The identification and resolution of semantic heterogeneity in multidatabase systems. In Multidatabase Systems: An Advanced Solution for Global Information Sharing, pp. 52-60. IEEE Computer Society Press, Los Alamitos, California.
    [23]
    FELLIGI,I.P.AND SUNTER, A. B. 1969. A theory for record linkage. Journal of the American Statistical Society 64, 1183-1210.
    [24]
    FIEBIG, T., WEISS, J., AND MOERKOTTE, G. 1997. RAW: a relational algebra for the Web. In D. Suciu Ed., Proceedings of the Workshop on Management of Semistructured Data (Tucson, Arizona, May 1997). Available on-line from http://www.research.att.com/ ; suciu/workshop-papers. html.
    [25]
    FUHR, N. 1995. Probabilistic Datalog:a logic for powerful retrieval methods. In Proceed-ings of the 1995 ACM SIGIR conference on research in information retrieval (New York, 1995), pp. 282-290.
    [26]
    GARCIA-MOLINA, H., PAPAKONSTANTINOU, Y., QUASS, D., RAJARAMAN, A., SAGIV, Y., ULLMAN, J., AND WIDOM, J. 1995. The TSIMMIS approach to mediation: Data models and languages (extended abstract). In Next Generation Information Technologies and Systems (NGITS-95) (Naharia, Israel, November 1995).
    [27]
    HERNANDEZ,M.AND STOLFO, S. 1995. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD (May 1995).
    [28]
    HUFFMAN,S.AND STEIER, D. 1995. Heuristic joins to integrate structured heterogeneous data. In Working notes of the AAAI spring symposium on information gathering in heteroge-neous distributed environments (Palo Alto, CA, March 1995). AAAI Press.
    [29]
    KILSS,B.AND ALVEY, W. 1985. Record linkage techniques:1985. Statistics of Income Division, Internal Revenue Service Publication 1299-2-96. Available from http://www.bts. gov/fcsm/methodology/.
    [30]
    KNUTH, D. E. 1975. The Art of Computer Programming, Volume I: Fundamental Algorithms (second edition). Addison-Wesley, Reading, MA.
    [31]
    KONOPNICKI,D.AND SCHMUELI, O. 1995. W3QS: a query system for the world wide web. In Proceedings of the 21st International Conference on Very Large Databases (VLDB-96) (Zurich, Switzerland, 1995).
    [32]
    KORF, R. 1993. Linear-space best-first search. Artificial Intelligence 62, 1 (July), 41-78.
    [33]
    LEVY,A.Y.,RAJARAMAN, A., AND ORDILLE, J. J. 1996a. Query answering algorithms for information agents. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96) (Portland, Oregon, August 1996).
    [34]
    LEVY,A.Y.,RAJARAMAN, A., AND ORDILLE, J. J. 1996b. Querying heterogeneous information sources using source descriptions. In Proceedings of the 22nd International Conference on Very Large Databases (VLDB-96) (Bombay, India, September 1996).
    [35]
    LEWIS, D. 1992. Representation and learning in information retrieval. Technical Report 91-93, Computer Science Dept., University of Massachusetts at Amherst. PhD Thesis.
    [36]
    MENDELZON,A.AND MILO, T. 1997. Formal models of Web queries. In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS-97) (Tucson, AZ, May 1997).
    [37]
    MONGE,A.AND ELKAN, C. 1996. The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (August 1996).
    [38]
    MONGE,A.AND ELKAN, C. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 work-shop on data mining and knowledge discovery (May 1997).
    [39]
    NEWCOMBE,H.B.,KENNEDY,J.M.,AXFORD,S.J.,AND JAMES, A. P. 1959. Automatic linkage of vital records. Science 130, 954-959.
    [40]
    NILSSON, N. 1987. Principles of Artificial Intelligence. Morgan Kaufmann.
    [41]
    PEARL, J. 1984. Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley, Reading, MA.
    [42]
    PORTER, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130-137.
    [43]
    QUINLAN, J. R. 1990. Learning logical definitions from relations. Machine Learning 5,3, 239-266.
    [44]
    SALTON,G.ED. 1989. Automatic Text Processing. Addison Wesley, Reading, Massachusetts.
    [45]
    SCHA ~ UBLE, P. 1993. SPIDER: A multiuser information retrieval system for semistructured and dynamic data. In Proceedings of the 1993 ACM SIGIR conference on research in information retrieval (Pittsburgh, PA, 1993), pp. 318-327.
    [46]
    SUCIU, D. 1996. Query decomposition and view maintenance for query languages for un-structured data. In Proceedings of the 22nd International Conference on Very Large Data-bases (VLDB-96) (Bombay, India, 1996).
    [47]
    SUCIU,D.ED. 1997. Proceedings of the Workshop on Management of Semistructured Data. Available on-line from http://www.research.att.com/suciu/workshop-papers.html, Tucson, Arizona.
    [48]
    TOMASIC, A., AMOUROUX, R., BONNET, P., AND KAPITSKAIA, O. 1997. The distributed informa-tion search component (Disco) and the World Wide Web. In Proceedings of the 1997 ACM SIGMOD (May 1997).
    [49]
    TURTLE,H.AND FLOOD, J. 1995. Query evaluation: strategies and optimizations. Informa-tion processing and management 31, 6 (November), 831-850.
    [50]
    ZADEH, L. A. 1965. Fuzzy sets. Information and Control 8, 338-353.

    Cited By

    View all
    • (2024)Leveraging Pretrained Language Models for Enhanced Entity MatchingInternational Journal of Intelligent Systems10.1155/2024/19412212024Online publication date: 1-Jan-2024
    • (2022)Smart Sales Empower Small Farmers: An Integrated Matching Method between Suppliers and Consumers Based on the Information AxiomSustainability10.3390/su14241693714:24(16937)Online publication date: 16-Dec-2022
    • (2022)Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge GraphsACM Transactions on Information Systems10.1145/349520940:4(1-45)Online publication date: 11-Jan-2022
    • Show More Cited By

    Recommendations

    Reviews

    Richard S. Marcus

    A core problem in data integration is that databases from heterogeneous sources often name objects differently. Using the SMART vector-space textual retrieval system techniques of Gerard Salton, Cohen has devised an elaborate methodology for overcoming this problem by automatically identifying similar names that are likely to refer to the same real-world object. Cohen reports success in his endeavors on three fronts: an extensive formal presentation; the creation of a data retrieval system (WHIRL) that emphasizes efficiency by avoiding operations on unlikely-to-be-useful intermediate forms; and experiments using WHIRL on Web databases, demonstrating that it is fast and precise. The author is to be applauded for his efforts, which should be must reading for those interested in overcoming the problem of natural language variability of object names in data integration. Of course, more work needs to be done. First, while precision in the demonstration examples generally seems quite high (in neighborhood of 95 percent), there are cases where “noise words” reduce precision considerably (for example, to 67 percent for names of computer games). Furthermore, there is little discussion of recall-like issues, such as how often missed similarity matches cause failure to find appropriate search results. Also, much more work needs to be done in comparing WHIRL-like systems—using measures of ease-of-use and development costs, as well as precision and recall—to systems in which the name dissimilarities are resolved in database construction by manual or other means. Comparison of WHIRL's statistical methods for name matching with more structured, contextual Boolean keyword techniques would be especially welcome. In particular, some Boolean techniques recognize that increasing levels of coordination in multiword phrases generally yield higher levels of relevance and precision, as opposed to the lower levels derived from WHIRL's scoring scheme, which multiplies probability-like scores. In addition, Boolean unions could enhance the representation of concept term alternates, whereas WHIRL appears still limited to conjunctions.

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 18, Issue 3
    July 2000
    111 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/352595
    • Editor:
    • W. Bruce Croft
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 July 2000
    Published in TOIS Volume 18, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)75
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Leveraging Pretrained Language Models for Enhanced Entity MatchingInternational Journal of Intelligent Systems10.1155/2024/19412212024Online publication date: 1-Jan-2024
    • (2022)Smart Sales Empower Small Farmers: An Integrated Matching Method between Suppliers and Consumers Based on the Information AxiomSustainability10.3390/su14241693714:24(16937)Online publication date: 16-Dec-2022
    • (2022)Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge GraphsACM Transactions on Information Systems10.1145/349520940:4(1-45)Online publication date: 11-Jan-2022
    • (2022)(Almost) all of entity resolutionScience Advances10.1126/sciadv.abi80218:12Online publication date: 25-Mar-2022
    • (2022)Towards One-Size-Fits-Many: Multi-Context Attention Network for Diversity of Entity Resolution TasksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.306079034:12(6018-6032)Online publication date: 1-Dec-2022
    • (2022)Unsupervised Entity Resolution With Blocking and Graph AlgorithmsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.299106334:3(1501-1515)Online publication date: 1-Mar-2022
    • (2022)Bitmap filterInformation Systems10.1016/j.is.2019.10144988:COnline publication date: 21-Apr-2022
    • (2021)The Four Generations of Entity ResolutionSynthesis Lectures on Data Management10.2200/S01067ED1V01Y202012DTM06416:2(1-170)Online publication date: 15-Mar-2021
    • (2021)Entity Resolution Based on Boolean Matching RulesHans Journal of Data Mining10.12677/HJDM.2021.11201211:02(121-133)Online publication date: 2021
    • (2021)An Entity Resolution Algorithm Based on Attribute SalienceHans Journal of Data Mining10.12677/HJDM.2021.11200411:02(27-37)Online publication date: 2021
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media