Learning to Match the Schemas of Data Sources: A Multistrategy Approach

Doan, AnHai; Domingos, Pedro; Halevy, Alon

doi:10.1023/A:1021765902788

Learning to Match the Schemas of Data Sources: A Multistrategy Approach

Published: March 2003

Volume 50, pages 279–301, (2003)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Learning to Match the Schemas of Data Sources: A Multistrategy Approach

Download PDF

AnHai Doan¹,
Pedro Domingos² &
Alon Halevy²

4774 Accesses
3 Altmetric
Explore all metrics

Abstract

The problem of integrating data from multiple data sources—either on the Internet or within enterprises—has received much attention in the database and AI communities. The focus has been on building data integration systems that provide a uniform query interface to the sources. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the query interface and the source schemas. Examples of mappings are “element location maps to address” and “price maps to listed-price”. We propose a multistrategy learning approach to automatically find such mappings. The approach applies multiple learner modules, where each module exploits a different type of information either in the schemas of the sources or in their data, then combines the predictions of the modules using a meta-learner. Learner modules employ a variety of techniques, ranging from Naive Bayes and nearest-neighbor classification to entity recognition and information retrieval. We describe the LSD system, which employs this approach to find semantic mappings. To further improve matching accuracy, LSD exploits domain integrity constraints, user feedback, and nested structures in XML data. We test LSD experimentally on several real-world domains. The experiments validate the utility of multistrategy learning for data integration and show that LSD proposes semantic mappings with a high degree of accuracy.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Ashish, N., & Knoblock, C. A. (1997). Wrapper generation for semi-structured internet sources. SIGMOD Record, 26:4, 8–15.
Google Scholar
Brazdil, P., & Muggleton, S. (1991). Learning to relate terms in a multiple agent environment. Lecture Notes in Artificial Intelligence, European Working Session on Learning, 482.
Castano, S., & De Antonellis, V. (1999). A schema analysis and reconciliation tool environment for heterogeneous databases. In proc. of the Int. Databases Engineering and Applications Symposium (IDEAS-99) (pp. 53–62).
Cohen, W., & Hirsh, H. (1998). Joints that generalize: Text classification using WHIRL. In Proc. of the Fourth Int. Conf. on Knowledge Discovery and Data Mining (KDD).
Chalupsky, H. (2000). Ontomorph: A translation system for symbolic knoledge. In Principles of Knowledge Representation and Reasoning.
Clifton, C., Housman, E., & Rosenthal, A. (1997), Experience with a combined approach to attribute-matching across heterogeneous databases. In Proc. of the IFIP Working Conference on Data Semantics (DS-7).
Doan, A., Domingos, P., & Halevy, A. (2001). Reconciling schemas of disparate data sources: A machine learning approach. In Proceedings of the ACM SIGMOD Conference.
Doan, A., Domingos, P., & Halevy, A. (2002). Learning complex mapping between structured representations. Technical Report UW-CSE-2002, University of Washington.
Duda, R. O., & Hart, P. E. (1974). Pattern classification and scene analysis. New York: John Wiley and Sons.
Google Scholar
Doan, A., Madhavan, J., Domingos, P., & Halevy, A. (2002). Learning to map ontologies on the semantic web. In Proceedings of the World-Wide Web Conference (WWW-02).
Do, H., Melnik, S., & Rahm, E. (2002). Comparison of schema matching evaluations. In Proceeding of the 2nd Int. Workshop on Web Databases (German Informatics Society).
Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103–130.
Google Scholar
Donoho, S., & Rendell, L. (1996). Constructive induction using fragmentary knowledge. In Proc. of the 13th Int. Conf. on Machine Learning (pp. 113–121).
Extensible markup language (XML) 1.0. www.w3.org/TR/1998/REC-xml-19980210, 1998. W3C Recommendation.
Freitag, D. (1998). Machine learning for information extraction in informal domains. Ph.D. Thesis, Dept. of Computer Science, Carnegie Mellon University.
Friedman, M., & Weld, D. (1997). Efficiently executing information-gathering plans. In Proc. of the Int. Joint Conf. of AI (IJCAI).
Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J., & Widom, J. (1997). The TSIMMIS project: Integration of heterogeneous information sources. Journal of Intelligent Inf. Systems, 8:2.
Hammer, J., Garcia-Molina, H., Nestorov, S., Yerneni, R., Breunig, M., & Vassalos, V. (1998). Template-based wrappers in the TSIMMIS system (system demonstration). In ACM Sigmod Record, Tucson, Arizona.
Hart, P., Nilsson, N., & Raphael, B. (1972). Correction to “a formal basis for the heuristic determination of minimum cost paths”. SIGART Newsletter, 37, 28–29.
Google Scholar
Ives, Z., Florescu, D., Friedman, M., Levy, A., & Weld, D. (1999). An adaptive query execution system for data integration. In Proc. of SIGMOD.
Knoblock, C., Minton, S., Ambite, J., Ashish, N., Modi, P., Muslea, I., Philpot, A., & Tejada, S. (1998). Modeling web sources for information integration. In Proc. of the National Conference on Artificial Intelligence (AAAI).
Keim, G., Shazeer, N., Littman, M., Agarwal, S., Cheves, C., Fitzgerald, J., Grosland, J., Jiang, F., Pollard, S., & Weinmeister, K. (1999). PROVERB: The probabilistic cruciverbalist. In Proc. of the 6th National Conf. on Artificial Intelligence (AAAI-99) (pp. 710–717).
Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118:1/2, 15–68.
Google Scholar
Kushmerick, N. (2000). Wrapper verification. World Wild Web Journal, 3:2, 79–94.
Google Scholar
Lacher, M., & Groh, G. (2001). Facilitating the exchange of explicit knowledge through ontology mappings. In Proceedings of the 14th Int. FLAIRS Conference.
Lambrecht, E., Kambhampati, S., & Gnanaprakasam, S. (1999). Optimizing recursive information gathering plans. In Proc. of the Int. Joint Conf. on AI (IJCAI).
Levy, A. Y., Rajaraman, A., & Ordille, J. (1996). Querying heterogeneous information sources using source descriptions. In Proc. of VLDB.
Li, W., & Clifton, C. (2000). SEMINT: A tool for identifying attribute correspondence in heterogeneous databases using neural networks. Data and Knowledge Engineering, 33, 49–84.
Google Scholar
LSD's website, accessible from www-faculty.cs.uiuc.edu/~anhai
Madhavan, J., Halevy, A., Domingos, P., & Bernstein, P. (2002). Representing and reasoning about mappings between domain models. In Proceedings of the National AI Conference (AAAI-02).
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification, In Proceedings of the AAAI-98 Workshop on Learning fot Text Categorization.
McGuinness, D., Fikes, R., Rice, J., & Wilder, S. (2000). The Chimaera ontology environment. In Proceedings of the 17th National Conference on Artificial Intelligence.
Miller, R., Haas, L., & Hernandez, M. (2000). Schema mapping as query discovery. In Proc. of VLDB.
Melnik, S., Molina-Garcia, H., & Rahm, E. (2002). Similarity flooding: A versatile graph matching algorithm. In Proceedings of the International Conference on Data Engineering (ICDE).
Michalski, R., & Tecuci, G. (Eds.) (1994). Machine learning: A multistrategy approach. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Milo, T., & Zohar, S. (1998). Using schema matching to simplify heterogeneous data translation. In Proc. of VLDB.
Mitra, P., Wiederhold, G., & Jannink, J. (1998). Semi-automatic integration of knowledge sources. In Proceedings of Fusion'99.
Noy, N. F., & Musen, M. A. (2000). PROMPT: Algorithm and tool for automated ontology merging and alignment. In Proceedings of the National Conference on Artificial Intelligence (AAAI).
Noy, N. F., & Musen, M. A. (2001). Anchor-PROMPT: Using non-local context for semantic matching. In Proceedings of the Workshop on Ontologies and Information Sharing at the International Joint Conference on Artificial Intelligence (IJCAI).
Palopoli, L., Sacca, D., & Ursino, D. (1998). Semi-automatic, semantic discovery of properties from database schemes. In Proc. of the Int. Database Engineering and Applications Symposium (IDEAS-98) (pp. 244–253).
Perkowitz, M., & Etzioni, O. (1995). Category translation: Learning to understand information on the Internet. In Proc. of Int. Joint Conf. on AI (IJCAI).
Punyakanok,V., & Roth, D. (2001). The use of classifiers in sequential inference. In Proceedings of the Conference on Neural Information Processing Systems (NIPS-00).
Rahm, E., & Bernstein, P. A. (2001). On matching schemas automatically. Technical Report MSR-TR-2001-17, 2001. Microsoft Research, Redmon, WA.
Google Scholar
Ryutaro, I., Hideaki, T., & Shinichi, H. (2001). Rule induction for concept hierarchy alignment. In Proceedings of the 2nd Workshop on Ontology Learning at the 17th Int. Joint Conf. on AI (IJCAI).
Ting, K. M., & Witten, I. H. (1999). Issues in stacked generalization. Journal of Artificial Intelligence Research, 10, 271–289.
Google Scholar
Wolpert, D. (1992). Stacked generalization. Neural Networks, 5, 241–259.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Illinois, Urbana-Champaign, IL, 61801, USA
AnHai Doan
Department of Computer Science and Engineering, University of Washington, Seattle, WA, 98195, USA
Pedro Domingos & Alon Halevy

Authors

AnHai Doan
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Domingos
View author publications
You can also search for this author in PubMed Google Scholar
Alon Halevy
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Doan, A., Domingos, P. & Halevy, A. Learning to Match the Schemas of Data Sources: A Multistrategy Approach. Machine Learning 50, 279–301 (2003). https://doi.org/10.1023/A:1021765902788

Download citation

Issue Date: March 2003
DOI: https://doi.org/10.1023/A:1021765902788

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Learning to Match the Schemas of Data Sources: A Multistrategy Approach

Abstract

Article PDF

Similar content being viewed by others

Assigning Semantic Labels to Data Sources

YAM: A Step Forward for Generating a Dedicated Schema Matcher

Semantic Labeling: A Domain-Independent Approach

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Learning to Match the Schemas of Data Sources: A Multistrategy Approach

Abstract

Article PDF

Similar content being viewed by others

Assigning Semantic Labels to Data Sources

YAM: A Step Forward for Generating a Dedicated Schema Matcher

Semantic Labeling: A Domain-Independent Approach

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation