Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3035918.3064010acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Synthesizing Mapping Relationships Using Table Corpus

Published: 09 May 2017 Publication History

Abstract

Mapping relationships, such as (country, country-code) or (company, stock-ticker), are versatile data assets for an array of applications in data cleaning and data integration like auto-correction and auto-join. However, today there are no good repositories of mapping tables that can enable these intelligent applications.
Given a corpus of tables such as web tables or spreadsheet tables, we observe that values of these mappings often exist in pairs of columns in same tables. Motivated by their broad applicability, we study the problem of synthesizing mapping relationships using a large table corpus. Our synthesis process leverages compatibility of tables based on co-occurrence statistics, as well as constraints such as functional dependency. Experiment results using web tables and enterprise spreadsheets suggest that the proposed approach can produce high quality mappings.

References

[1]
Google Web Tables. http://research.google.com/tables.
[2]
Microsoft Excel Power Query. http://office.microsoft.com/powerbi.
[3]
Tane: An efficient algorithm for discovering functional and approximate dependencies. In Computer Journal, 1999.
[4]
Z. Abedjan, J. Morcos, M. N. Gubanov, I. F. Ilyas, M. Stonebraker, P. Papotti, and M. Ouzzani. Dataxformer: Leveraging the web for semantic transformations. In CIDR, 2015.
[5]
Y. Bejerano, M. A. Smith, J. Naor, and N. Immorlica. Efficient location area planning for personal communication systems. In Transaction of Networking, 2006.
[6]
P. A. Bernstein, J. Madhavan, and E. Rahm. Generic schema matching, ten years later. In Proceedings of VLDB, 2011.
[7]
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: A collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247--1250, 2008.
[8]
A. Broder. On the resemblance and containment of documents. In SEQUENCES, pages 21--. IEEE Computer Society, 1997.
[9]
M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, pages 805--816, 2013.
[10]
K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, and Y. He. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 2016.
[11]
Y. Chen, S. Goldberg, D. Z. Wang, and S. S. Johri. Ontological pathfinding: Mining first-order knowledge from large knowledge bases. In SIGMOD, 2016.
[12]
F. Chierichetti, N. Dalvi, and R. Kumar. Correlation clustering in mapreduce. In KDD, 2014.
[13]
L. Chitnis, A. Das Sarma, A. Machanavajjhala, and V. Rastogi. Finding connected components in map-reduce in logarithmic rounds. In ICDE, pages 50--61, 2013.
[14]
K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1):22--29, 1990.
[15]
E. Cortez, P. A. Bernstein, Y. He, and L. Novik. Annotating database schemas to help enterprise search. Proceedings of VLDB, 2015.
[16]
E. Dahlhaus, D. S. Johnson, C. H. Papadimitriou, P. D. Seymour, and M. Yannakakis. The complexity of multiterminal cuts. SIAM J. Comput., 23(4):864--894, 1994.
[17]
E. D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Correlation clustering in general weighted graphs. Theor. Comput. Sci., 361(2):172--187, 2006.
[18]
L. A. Galárraga, C. Teflioudi, K. Hose, and F. Suchanek. Amie: Association rule mining under incomplete evidence in ontological knowledge bases. In WWW, 2013.
[19]
N. Garg, V. V. Vazirani, and M. Yannakakis. Multiway cuts in directed and node weighted graphs. In ICALP, pages 487--498, 1994.
[20]
R. Gupta, A. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. PVLDB, pages 505--516, 2014.
[21]
B. He and K. C.-C. Chang. Statistical schema matching across web query interfaces. In Proceedings of SIGMOD, 2003.
[22]
H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. VLDB Journal, 2004.
[23]
H. He, W. Meng, C. T. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In PVLDB, 2003.
[24]
Y. He, K. Ganjam, and X. Chu. Sema-join: joining semantically-related tables using big table corpora. In Proceedings of VLDB, 2015.
[25]
J. E. Hopcroft and J. D. Ullman. Set merging algorithms. SIAM Journal on Computing, 2(4):294--303, 1973.
[26]
T. C. Hu. Multi-commodity network flows. Operations Research, 11(3):344--360, 1963.
[27]
R. Kimball and M. Ross. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling. 2002.
[28]
Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han. A survey on truth discovery. In SIGKDD Exploration, 2016.
[29]
T. Lin, Mausam, and O. Etzioni. Identifying functional relations in web text. In Proceedings of EMNLP, 2010.
[30]
X. Ling, A. Halevy, F. Wu, and C. Yu. Synthesizing union tables from the web. In IJCAI, pages 2677--2683, 2013.
[31]
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10(4):334--350, Dec. 2001.
[32]
A. Ritter, D. Downey, S. Soderland, and O. Etzioni. It's a contradiction--no, it's not: A case study using functional relations. In EMNLP, pages 11--20, 2008.
[33]
W. Su, J. Wang, and F. Lochovsky. Holistic schema matching for web query interfaces. In Proceedings of EDBT, 2006.
[34]
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A core of semantic knowledge. In WWW, pages 697--706, 2007.
[35]
E. Ukkonen. Algorithms for approximate string matching. Inf. Control, 64(1--3):100--118, 1985.
[36]
V. Varizani. Approximation algorithms. Springer Verlag, 2001.
[37]
Y. Wang and Y. He. Synthesizing mapping relationships using table corpus. https://www.microsoft.com/en-us/research/wp-content/uploads/2017/03/mapping-synthesis-full.pdf. Technical report.
[38]
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: Entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD, 2012.
[39]
M. Yannakakis, P. C. Kanellakis, S. S. Cosmadakis, and C. H. Papadimitriou. Cutting and partitioning a graph after a fixed pattern. In J. Diaz, editor, ICALP, pages 712--722, 1983.
[40]
M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M. Procopiuc, and D. Srivastava. Automatic discovery of attributes in relational databases. In SIGMOD, pages 109--120, 2011.

Cited By

View all
  • (2024)DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language ModelsProceedings of the ACM on Management of Data10.1145/36392792:1(1-24)Online publication date: 26-Mar-2024
  • (2022)An Instance-Based Data Transformation MethodHans Journal of Data Mining10.12677/HJDM.2022.12302412:03(235-245)Online publication date: 2022
  • (2021)A Domain Dictionary Extraction Algorithm Based on Mapping RelationshipsHans Journal of Data Mining10.12677/HJDM.2021.11200711:02(59-76)Online publication date: 2021
  • Show More Cited By

Index Terms

  1. Synthesizing Mapping Relationships Using Table Corpus

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
    May 2017
    1810 pages
    ISBN:9781450341974
    DOI:10.1145/3035918
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 May 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. auto-fill
    2. auto-join
    3. data cleaning
    4. data integration
    5. functional dependency
    6. mapping relationships
    7. mapping tables

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'17
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 03 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language ModelsProceedings of the ACM on Management of Data10.1145/36392792:1(1-24)Online publication date: 26-Mar-2024
    • (2022)An Instance-Based Data Transformation MethodHans Journal of Data Mining10.12677/HJDM.2022.12302412:03(235-245)Online publication date: 2022
    • (2021)A Domain Dictionary Extraction Algorithm Based on Mapping RelationshipsHans Journal of Data Mining10.12677/HJDM.2021.11200711:02(59-76)Online publication date: 2021
    • (2021)Merging Web Tables for Relation Extraction with Knowledge GraphsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.3101479(1-1)Online publication date: 2021
    • (2020)Amplifying Domain Expertise in Clinical Data PipelinesJMIR Medical Informatics10.2196/196128:11(e19612)Online publication date: 5-Nov-2020
    • (2020)Auto-transformProceedings of the VLDB Endowment10.14778/3407790.340783113:12(2368-2381)Online publication date: 14-Sep-2020
    • (2020)Web Table Extraction, Retrieval, and AugmentationACM Transactions on Intelligent Systems and Technology10.1145/337211711:2(1-35)Online publication date: 25-Jan-2020
    • (2020)Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science NotebooksProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389738(1539-1554)Online publication date: 11-Jun-2020
    • (2020)Interactive Cleaning for Progressive Visualization through Composite Questions2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00069(733-744)Online publication date: Apr-2020
    • (2019)Synthesizing N-ary Relations from Web TablesProceedings of the 9th International Conference on Web Intelligence, Mining and Semantics10.1145/3326467.3326480(1-12)Online publication date: 26-Jun-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media