Abstract
The problem of integrating data from multiple data sources—either on the Internet or within enterprises—has received much attention in the database and AI communities. The focus has been on building data integration systems that provide a uniform query interface to the sources. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the query interface and the source schemas. Examples of mappings are “element location maps to address” and “price maps to listed-price”. We propose a multistrategy learning approach to automatically find such mappings. The approach applies multiple learner modules, where each module exploits a different type of information either in the schemas of the sources or in their data, then combines the predictions of the modules using a meta-learner. Learner modules employ a variety of techniques, ranging from Naive Bayes and nearest-neighbor classification to entity recognition and information retrieval. We describe the LSD system, which employs this approach to find semantic mappings. To further improve matching accuracy, LSD exploits domain integrity constraints, user feedback, and nested structures in XML data. We test LSD experimentally on several real-world domains. The experiments validate the utility of multistrategy learning for data integration and show that LSD proposes semantic mappings with a high degree of accuracy.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Ashish, N., & Knoblock, C. A. (1997). Wrapper generation for semi-structured internet sources. SIGMOD Record, 26:4, 8–15.
Brazdil, P., & Muggleton, S. (1991). Learning to relate terms in a multiple agent environment. Lecture Notes in Artificial Intelligence, European Working Session on Learning, 482.
Castano, S., & De Antonellis, V. (1999). A schema analysis and reconciliation tool environment for heterogeneous databases. In proc. of the Int. Databases Engineering and Applications Symposium (IDEAS-99) (pp. 53–62).
Cohen, W., & Hirsh, H. (1998). Joints that generalize: Text classification using WHIRL. In Proc. of the Fourth Int. Conf. on Knowledge Discovery and Data Mining (KDD).
Chalupsky, H. (2000). Ontomorph: A translation system for symbolic knoledge. In Principles of Knowledge Representation and Reasoning.
Clifton, C., Housman, E., & Rosenthal, A. (1997), Experience with a combined approach to attribute-matching across heterogeneous databases. In Proc. of the IFIP Working Conference on Data Semantics (DS-7).
Doan, A., Domingos, P., & Halevy, A. (2001). Reconciling schemas of disparate data sources: A machine learning approach. In Proceedings of the ACM SIGMOD Conference.
Doan, A., Domingos, P., & Halevy, A. (2002). Learning complex mapping between structured representations. Technical Report UW-CSE-2002, University of Washington.
Duda, R. O., & Hart, P. E. (1974). Pattern classification and scene analysis. New York: John Wiley and Sons.
Doan, A., Madhavan, J., Domingos, P., & Halevy, A. (2002). Learning to map ontologies on the semantic web. In Proceedings of the World-Wide Web Conference (WWW-02).
Do, H., Melnik, S., & Rahm, E. (2002). Comparison of schema matching evaluations. In Proceeding of the 2nd Int. Workshop on Web Databases (German Informatics Society).
Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103–130.
Donoho, S., & Rendell, L. (1996). Constructive induction using fragmentary knowledge. In Proc. of the 13th Int. Conf. on Machine Learning (pp. 113–121).
Extensible markup language (XML) 1.0. www.w3.org/TR/1998/REC-xml-19980210, 1998. W3C Recommendation.
Freitag, D. (1998). Machine learning for information extraction in informal domains. Ph.D. Thesis, Dept. of Computer Science, Carnegie Mellon University.
Friedman, M., & Weld, D. (1997). Efficiently executing information-gathering plans. In Proc. of the Int. Joint Conf. of AI (IJCAI).
Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J., & Widom, J. (1997). The TSIMMIS project: Integration of heterogeneous information sources. Journal of Intelligent Inf. Systems, 8:2.
Hammer, J., Garcia-Molina, H., Nestorov, S., Yerneni, R., Breunig, M., & Vassalos, V. (1998). Template-based wrappers in the TSIMMIS system (system demonstration). In ACM Sigmod Record, Tucson, Arizona.
Hart, P., Nilsson, N., & Raphael, B. (1972). Correction to “a formal basis for the heuristic determination of minimum cost paths”. SIGART Newsletter, 37, 28–29.
Ives, Z., Florescu, D., Friedman, M., Levy, A., & Weld, D. (1999). An adaptive query execution system for data integration. In Proc. of SIGMOD.
Knoblock, C., Minton, S., Ambite, J., Ashish, N., Modi, P., Muslea, I., Philpot, A., & Tejada, S. (1998). Modeling web sources for information integration. In Proc. of the National Conference on Artificial Intelligence (AAAI).
Keim, G., Shazeer, N., Littman, M., Agarwal, S., Cheves, C., Fitzgerald, J., Grosland, J., Jiang, F., Pollard, S., & Weinmeister, K. (1999). PROVERB: The probabilistic cruciverbalist. In Proc. of the 6th National Conf. on Artificial Intelligence (AAAI-99) (pp. 710–717).
Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118:1/2, 15–68.
Kushmerick, N. (2000). Wrapper verification. World Wild Web Journal, 3:2, 79–94.
Lacher, M., & Groh, G. (2001). Facilitating the exchange of explicit knowledge through ontology mappings. In Proceedings of the 14th Int. FLAIRS Conference.
Lambrecht, E., Kambhampati, S., & Gnanaprakasam, S. (1999). Optimizing recursive information gathering plans. In Proc. of the Int. Joint Conf. on AI (IJCAI).
Levy, A. Y., Rajaraman, A., & Ordille, J. (1996). Querying heterogeneous information sources using source descriptions. In Proc. of VLDB.
Li, W., & Clifton, C. (2000). SEMINT: A tool for identifying attribute correspondence in heterogeneous databases using neural networks. Data and Knowledge Engineering, 33, 49–84.
LSD's website, accessible from www-faculty.cs.uiuc.edu/~anhai
Madhavan, J., Halevy, A., Domingos, P., & Bernstein, P. (2002). Representing and reasoning about mappings between domain models. In Proceedings of the National AI Conference (AAAI-02).
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification, In Proceedings of the AAAI-98 Workshop on Learning fot Text Categorization.
McGuinness, D., Fikes, R., Rice, J., & Wilder, S. (2000). The Chimaera ontology environment. In Proceedings of the 17th National Conference on Artificial Intelligence.
Miller, R., Haas, L., & Hernandez, M. (2000). Schema mapping as query discovery. In Proc. of VLDB.
Melnik, S., Molina-Garcia, H., & Rahm, E. (2002). Similarity flooding: A versatile graph matching algorithm. In Proceedings of the International Conference on Data Engineering (ICDE).
Michalski, R., & Tecuci, G. (Eds.) (1994). Machine learning: A multistrategy approach. San Mateo, CA: Morgan Kaufmann.
Milo, T., & Zohar, S. (1998). Using schema matching to simplify heterogeneous data translation. In Proc. of VLDB.
Mitra, P., Wiederhold, G., & Jannink, J. (1998). Semi-automatic integration of knowledge sources. In Proceedings of Fusion'99.
Noy, N. F., & Musen, M. A. (2000). PROMPT: Algorithm and tool for automated ontology merging and alignment. In Proceedings of the National Conference on Artificial Intelligence (AAAI).
Noy, N. F., & Musen, M. A. (2001). Anchor-PROMPT: Using non-local context for semantic matching. In Proceedings of the Workshop on Ontologies and Information Sharing at the International Joint Conference on Artificial Intelligence (IJCAI).
Palopoli, L., Sacca, D., & Ursino, D. (1998). Semi-automatic, semantic discovery of properties from database schemes. In Proc. of the Int. Database Engineering and Applications Symposium (IDEAS-98) (pp. 244–253).
Perkowitz, M., & Etzioni, O. (1995). Category translation: Learning to understand information on the Internet. In Proc. of Int. Joint Conf. on AI (IJCAI).
Punyakanok,V., & Roth, D. (2001). The use of classifiers in sequential inference. In Proceedings of the Conference on Neural Information Processing Systems (NIPS-00).
Rahm, E., & Bernstein, P. A. (2001). On matching schemas automatically. Technical Report MSR-TR-2001-17, 2001. Microsoft Research, Redmon, WA.
Ryutaro, I., Hideaki, T., & Shinichi, H. (2001). Rule induction for concept hierarchy alignment. In Proceedings of the 2nd Workshop on Ontology Learning at the 17th Int. Joint Conf. on AI (IJCAI).
Ting, K. M., & Witten, I. H. (1999). Issues in stacked generalization. Journal of Artificial Intelligence Research, 10, 271–289.
Wolpert, D. (1992). Stacked generalization. Neural Networks, 5, 241–259.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Doan, A., Domingos, P. & Halevy, A. Learning to Match the Schemas of Data Sources: A Multistrategy Approach. Machine Learning 50, 279–301 (2003). https://doi.org/10.1023/A:1021765902788
Issue Date:
DOI: https://doi.org/10.1023/A:1021765902788