Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/872757.872784acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Statistical schema matching across web query interfaces

Published: 09 June 2003 Publication History
  • Get Citation Alerts
  • Abstract

    Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondence. This paper proposes a different approach, motivated by integrating large numbers of data sources on the Internet. On this "deep Web," we observe two distinguishing characteristics that offer a new view for considering schema matching: First, as the Web scales, there are ample sources that provide structured information in the same domains (e.g., books and automobiles). Second, while sources proliferate, their aggregate schema vocabulary tends to converge at a relatively small size. Motivated by these observations, we propose a new paradigm, statistical schema matching: Unlike traditional approaches using pairwise-attribute correspondence, we take a holistic approach to match all input schemas by finding an underlying generative schema model. We propose a general statistical framework MGS for such hidden model discovery, which consists of hypothesis modeling, generation, and selection. Further, we specialize the general framework to develop Algorithm MGSsd, targeting at synonym discovery, a canonical problem of schema matching, by designing and discovering a model that specifically captures synonym attributes. We demonstrate our approach over hundreds of real Web sources in four domains and the results show good accuracy.

    References

    [1]
    C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4):323--364, 1986.
    [2]
    M. K. Bergman. The deep web: Surfacing hidden value. Technical report, BrightPlanet LLC, Dec. 2000.
    [3]
    P. Bickel and K. Doksum. Mathematical Statistics: Basic Ideas and Selected Topics. Prentice Hall, 2001.
    [4]
    K. C.-C. Chang, B. He, C. Li, and Z. Zhang. Structured databases on the web: Observations and implications. Report UIUCDCS-R-2003-2321, Dept. of Computer Science, UIUC, Feb. 2003.
    [5]
    W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD 1998.
    [6]
    T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms (Section Edition). MIT Press, 2001.
    [7]
    A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39:1--38, 1977.
    [8]
    A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In SIGMOD 2001.
    [9]
    A. Halevy, O. Etzioni, A. Doan, Z. Ives, J. Madhavan, L. McDowell, and I. Tatarinov. Crossing the structure chasm. Conf. on Innovative Database Research, 2003.
    [10]
    B. He, T. Tao, C. Li, and K. C.-C. Chang. Clustering structured web sources: A schema-based, model-differentiation approach. Report UIUCDCS-R-2003-2322, Dept. of Computer Science, UIUC, Feb. 2003.
    [11]
    J. Larson, S. Navathe, and R. Elmasri. A theory of attributed equivalence in databases with application to schema integration. IEEE Trans. on Software Engr., 16(4):449--463, 1989.
    [12]
    C. J. Lloyd. Statistical Analysis of Categorical Data. Wiley, 1999.
    [13]
    J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid. In VLDB 2001.
    [14]
    S. Navathe and S. Gadgil. A methodology for view integration in logical data base design. In VLDB 1982.
    [15]
    J. Ponte and W. Croft. A language modelling approach to information retrieval. In SIGIR 1998.
    [16]
    E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334--350, 2001.
    [17]
    L. Seligman, A. Rosenthal, P. Lehner, and A. Smith. Data integration: Where does the time go? Bulletin of the Tech. Committee on Data Engr., 25(3), 2002.

    Cited By

    View all
    • (2024)The Proposed Framework of View-Dependent Data Integration ArchitectureThe Ethical Frontier of AI and Data Analysis10.4018/979-8-3693-2964-1.ch021(343-361)Online publication date: 12-Apr-2024
    • (2021)A study on machine learning techniques for the schema matching network problemJournal of the Brazilian Computer Society10.1186/s13173-021-00119-527:1Online publication date: 23-Nov-2021
    • (2021) ASSEMBLE: A ttribute, S tructure and S emantics Based S e rvice M apping Approach for Collaborative B usiness Process Deve l opm e nt IEEE Transactions on Services Computing10.1109/TSC.2018.280534614:2(371-385)Online publication date: 1-Mar-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data
    June 2003
    702 pages
    ISBN:158113634X
    DOI:10.1145/872757
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2003

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Conference

    SIGMOD/PODS03
    Sponsor:

    Acceptance Rates

    SIGMOD '03 Paper Acceptance Rate 53 of 342 submissions, 15%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)17
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)The Proposed Framework of View-Dependent Data Integration ArchitectureThe Ethical Frontier of AI and Data Analysis10.4018/979-8-3693-2964-1.ch021(343-361)Online publication date: 12-Apr-2024
    • (2021)A study on machine learning techniques for the schema matching network problemJournal of the Brazilian Computer Society10.1186/s13173-021-00119-527:1Online publication date: 23-Nov-2021
    • (2021) ASSEMBLE: A ttribute, S tructure and S emantics Based S e rvice M apping Approach for Collaborative B usiness Process Deve l opm e nt IEEE Transactions on Services Computing10.1109/TSC.2018.280534614:2(371-385)Online publication date: 1-Mar-2021
    • (2021)WebQuIn-LD: A Method of Integrating Web Query Interfaces Based on Linked DataIEEE Access10.1109/ACCESS.2021.31045249(115664-115675)Online publication date: 2021
    • (2021)SMAT: An Attention-Based Deep Learning Solution to the Automation of Schema MatchingAdvances in Databases and Information Systems10.1007/978-3-030-82472-3_19(260-274)Online publication date: 24-Aug-2021
    • (2020)Generic schema matching, ten years laterProceedings of the VLDB Endowment10.14778/3402707.34027104:11(695-701)Online publication date: 3-Jun-2020
    • (2019)Using an artificial neural network to map cancer common data elements to the biomedical research integrated domain group model in a semi-automated mannerBMC Medical Informatics and Decision Making10.1186/s12911-019-0979-519:S7Online publication date: 23-Dec-2019
    • (2019)Synthesizing N-ary Relations from Web TablesProceedings of the 9th International Conference on Web Intelligence, Mining and Semantics10.1145/3326467.3326480(1-12)Online publication date: 26-Jun-2019
    • (2019)Holistic Schema MatchingEncyclopedia of Big Data Technologies10.1007/978-3-319-77525-8_12(960-965)Online publication date: 20-Feb-2019
    • (2018)Deep Web Information Retrieval ProcessThe Dark Web10.4018/978-1-5225-3163-0.ch007(114-137)Online publication date: 2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media