research-article

Open data integration

Author:

Renée J. MillerAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 11, Issue 12

Pages 2130 - 2139

https://doi.org/10.14778/3229863.3240491

Published: 01 August 2018 Publication History

Abstract

Open data plays a major role in supporting both governmental and organizational transparency. Many organizations are adopting Open Data Principles promising to make their open data complete, primary, and timely. These properties make this data tremendously valuable to data scientists. However, scientists generally do not have a priori knowledge about what data is available (its schema or content). Nevertheless, they want to be able to use open data and integrate it with other public or private data they are studying. Traditionally, data integration is done using a framework called query discovery where the main task is to discover a query (or transformation) that translates data from one form into another. The goal is to find the right operators to join, nest, group, link, and twist data into a desired form. We introduce a new paradigm for thinking about integration where the focus is on data discovery, but highly efficient internet-scale discovery that is driven by data analysis needs. We describe a research agenda and recent progress in developing scalable data-analysis or query-aware data discovery algorithms that provide high recall and accuracy over massive data repositories.

References

[1]

P. Agrawal, A. Arasu, and R. Kaushik. On indexing error-tolerant set containment. In ACM SIGMOD, pages 927--938, 2010.

Digital Library

[2]

A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008.

Digital Library

[3]

P. C. Arocena, B. Glavic, R. Ciucanu, and R. J. Miller. The iBench integration metadata generator. PVLDB, 9(3):108--119, 2015.

Digital Library

[4]

P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, and D. Santoro. Benchmarking data curation systems. IEEE Data Eng. Bull., 39(2):47--62, 2016.

[5]

B. G. Bashardoost, C. Christodoulakis, S. H. Yeganeh, R. J. Miller, K. Lyons, and O. Hassanzadeh. VizCurator: A visual tool for curating open data. In WWW, pages 195--198, 2015.

Digital Library

[6]

C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4):323--364, 1986.

Digital Library

[7]

M. Bawa, T. Condie, and P. Ganesan. Lsh forest: Self-tuning indexes for similarity search. In WWW, pages 651--660, 2005.

Digital Library

[8]

P. Beckman, T. J. Skluzacek, K. Chard, and I. T. Foster. Skluma: A statistical learning pipeline for taming unkempt data repositories. In Scientific and Statistical Database Management, pages 41:1--41:4, 2017.

Digital Library

[9]

A. Behm, C. Li, and M. J. Carey. Answering approximate string queries on large data sets using external memory. In IEEE ICDE, pages 888--899, 2011.

Digital Library

[10]

C. Bizer, T. Heath, and T. Berners-Lee. Linked data - the story so far. Int. J. Semantic Web Inf. Syst., 5(3):1--22, 2009.

[11]

P. Buneman, S. B. Davidson, K. Hart, G. C. Overton, and L. Wong. A data transformation system for biological data sources. In VLDB, pages 158--169, 1995.

Digital Library

[12]

P. Buneman, S. B. Davidson, and A. Kosky. Semantics of database transformations. In Semantics in Databases, pages 55--91, 1995.

Digital Library

[13]

D. Burdick, M. A. Hernández, H. Ho, G. Koutrika, R. Krishnamurthy, L. Popa, I. Stanoi, S. Vaithyanathan, and S. R. Das. Extracting, linking and integrating data from public sources: A financial case study. IEEE Data Eng. Bull., 34(3):60--67, 2011.

[14]

M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008.

Digital Library

[15]

M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2(1):1090--1101, 2009.

Digital Library

[16]

P. Christen. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Springer, 2012.

Digital Library

[17]

D. Crockford. The application/json media type for javascript object notation (JSON). Request for Comment, 4627:1--10, 2006.

[18]

CrowdFlower. 2017 Data Scientist Report. http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf, Date accessed: July 15, 2019.

[19]

D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In CIDR, 2017.

[20]

D. Deng, G. Li, J. Feng, and W. Li. Top-k string similarity search with edit-distance constraints. In IEEE ICDE, pages 925--936, 2013.

Digital Library

[21]

H. Elmeleegy, A. K. Elmagarmid, and J. Lee. Leveraging query logs for schema mapping generation in U-MAP. In ACM SIGMOD, pages 121--132, 2011.

Digital Library

[22]

R. Fagin, L. M. Haas, M. A. Hernández, R. J. Miller, L. Popa, and Y. Velegrakis. Clio: Schema mapping creation and data exchange. In Conceptual Modeling: Foundations and Applications - Essays in Honor of John Mylopoulos, pages 198--236, 2009.

Digital Library

[23]

R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: Semantics and query answering. In ICDT, pages 207--224, 2003.

Digital Library

[24]

R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theoretical Computer Science, 336(1):89--124, 2005.

[25]

M. J. Franklin, A. Y. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4):27--33, 2005.

Digital Library

[26]

L. M. Haas, D. Kossmann, E. L. Wimmers, and J. Yang. Optimizing queries across diverse data sources. In VLDB, pages 276--285, 1997.

Digital Library

[27]

A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In ACM SIGMOD, pages 795--806, 2016.

Digital Library

[28]

O. Hassanzadeh, A. Kementsietsidis, L. Lim, R. J. Miller, and M. Wang. A framework for semantic link discovery over relational data. In CIKM, pages 1027--1036, 2009.

Digital Library

[29]

O. Hassanzadeh, A. Kementsietsidis, L. Lim, R. J. Miller, and M. Wang. LinkedCT: A linked data space for clinical trials. CoRR, abs/0908.0567, 2009.

[30]

O. Hassanzadeh and R. J. Miller. Automatic curation of clinical trials data in LinkedCT. In ISWC, pages 270--278, 2015.

[31]

O. Hassanzadeh, K. Q. Pu, S. H. Yeganeh, R. J. Miller, L. Popa, M. A. Hernández, and H. Ho. Discovering linkage points over web data. PVLDB, 6(6):444--456, 2013.

Digital Library

[32]

B. He and K. C. Chang. Statistical schema matching across web query interfaces. In ACM SIGMOD, pages 217--228, 2003.

Digital Library

[33]

D. Heimbigner and D. McLeod. A federated architecture for information management. ACM Trans. Inf. Syst., 3(3):253--278, 1985.

Digital Library

[34]

M. A. Hernádez, G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky. HIL: a high-level scripting language for entity integration. In EDBT, pages 549--560, 2013.

Digital Library

[35]

R. Hull. Relative information capacity of simple relational database schemata. SIAM J. Comput., 15(3):856--886, 1986.

Digital Library

[36]

P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In ACM STOC, pages 604--613, 1998.

Digital Library

[37]

A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efficient text classification. ACL, 2017.

[38]

E. Kandogan, M. Roth, P. M. Schwarz, J. Hui, I. G. Terrizzano, C. Christodoulakis, and R. J. Miller. Labbook: Metadata-driven social collaborative data analysis. In IEEE Big Data, pages 431--440, 2015.

Digital Library

[39]

J. Kang and J. F. Naughton. On schema matching with opaque column names and data values. In ACM SIGMOD, pages 205--216, 2003.

Digital Library

[40]

G. Kasneci, M. Ramanath, F. M. Suchanek, and G. Weikum. The YAGO-NAGA approach to knowledge discovery. SIGMOD Record, 37(4):41--47, 2008.

Digital Library

[41]

A. Kimmig, A. Memory, R. J. Miller, and L. Getoor. A collective, probabilistic approach to schema mapping. In IEEE ICDE, pages 921--932, 2017.

[42]

P. G. Kolaitis. Reflections on schema mappings, data exchange, and metadata management. In ACM PODS, pages 107--109, 2018.

Digital Library

[43]

O. Lehmberg and C. Bizer. Stitching web tables for improving matching quality. PVLDB, 10(11):1502--1513, 2017.

Digital Library

[44]

O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer. A large public corpus of web tables containing time and context metadata. In WWW, pages 75--76, 2016.

Digital Library

[45]

O. Lehmberg, D. Ritze, P. Ristoski, R. Meusel, H. Paulheim, and C. Bizer. The mannheim search join engine. J. of Web Semantics, 35:159--166, 2015.

Digital Library

[46]

A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources using source descriptions. In VLDB, pages 251--262, 1996.

Digital Library

[47]

C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In IEEE ICDE, pages 257--266, 2008.

Digital Library

[48]

G. Li. Human-in-the-loop data integration. PVLDB, 10(12):2006--2017, 2017.

Digital Library

[49]

X. Ling, A. Y. Halevy, F. Wu, and C. Yu. Synthesizing union tables from the web. In IJCAI, pages 2677--2683, 2013.

Digital Library

[50]

W. Mann, N. Augsten, and P. Bouros. An empirical evaluation of set similarity join techniques. PVLDB, 9(9):636--647, 2016.

Digital Library

[51]

B. Marnette, G. Mecca, P. Papotti, S. Raunich, and D. Santoro. ++Spicy: an OpenSource Tool for Second-Generation Schema Mapping and Data Exchange. PVLDB, 4(12):1438--1441, 2011.

Digital Library

[52]

G. Mecca and P. Papotti. Schema mapping and data exchange tools: Time for the golden age. it - Information Technology, 54(3):105--113, 2012.

[53]

R. J. Miller. Using schematically heterogeneous structures. In ACM SIGMOD, pages 189--200, 1998.

Digital Library

[54]

R. J. Miller, L. M. Haas, and M. A. Hernández. Schema mapping as query discovery. In VLDB, pages 77--88, 2000.

Digital Library

[55]

R. J. Miller, Y. E. Ioannidis, and R. Ramakrishnan. The use of information capacity in schema integration and translation. In VLDB, pages 120--133, 1993.

Digital Library

[56]

R. J. Miller, Y. E. Ioannidis, and R. Ramakrishnan. Schema equivalence in heterogeneous systems: bridging theory and practice. Inf. Syst., 19(1):3--31, 1994.

Digital Library

[57]

A. Nandi and P. A. Bernstein. HAMSTER: using search clicklogs for schema and taxonomy matching. PVLDB, 2(1):181--192, 2009.

Digital Library

[58]

F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller. Table union search on open data. PVLDB, 11(7):813--825, 2018.

Digital Library

[59]

R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. PVLDB, 5(10):908--919, 2012.

Digital Library

[60]

E. Rahm. Towards large-scale schema and ontology matching. In Schema Matching and Mapping, pages 3--27. 2011.

[61]

D. Ritze, O. Lehmberg, and C. Bizer. Matching HTML tables to dbpedia. In Web Intelligence, pages 10:1--10:6, 2015.

Digital Library

[62]

S. W. Sadiq, T. Dasu, X. L. Dong, J. Freire, I. F. Ilyas, S. Link, R. J. Miller, F. Naumann, X. Zhou, and D. Srivastava. Data quality: The role of empiricism. SIGMOD Record, 46(4):35--43, 2017.

Digital Library

[63]

A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In ACM SIGMOD, pages 817--828, 2012.

Digital Library

[64]

Y. Shafranovich. Common format and MIME type for comma-separated values (CSV) files. Request for Comment, 4180:1--8, 2005.

[65]

W. Su, J. Wang, and F. H. Lochovsky. Holistic schema matching for web query interfaces. In EDBT, pages 77--94, 2006.

Digital Library

[66]

F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697--706, 2007.

Digital Library

[67]

J. Tauberer. Open Government Data (The Book). https://opengovdata.io/, 2014. Second Edition. Date accessed: July 15, 2018.

[68]

B. ten Cate, P. G. Kolaitis, and W. C. Tan. Schema mappings and data examples. In EDBT, pages 777--780, 2013.

Digital Library

[69]

F. Tschirschnitz, T. Papenbrock, and F. Naumann. Detecting inclusion dependencies on very many tables. ACM Trans. Database Syst., 42(3):18:1--18:29, 2017.

Digital Library

[70]

O. Udrea, L. Getoor, and R. J. Miller. Leveraging data and structure in ontology integration. In ACM SIGMOD, pages 449--460, 2007.

Digital Library

[71]

J. Wang, G. Li, D. Deng, Y. Zhang, and J. Feng. Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search. In ICDE, pages 519--530, 2015.

[72]

J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In ACM SIGMOD, pages 85--96, 2012.

Digital Library

[73]

X. Wang, L. M. Haas, and A. Meliou. Explaining data integration. IEEE Data Eng. Bull., 41(2):47--58, 2018.

[74]

C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In IEEE ICDE, pages 916--927, 2009.

Digital Library

[75]

M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: Entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD, pages 97--108, 2012.

Digital Library

[76]

E. Zhu, Y. He, and S. Chaudhuri. Auto-join: Joining tables by leveraging transformations. PVLDB, 10(10):1034--1045, 2017.

Digital Library

[77]

E. Zhu, F. Nargesian, K. Q. Pu, and R. J. Miller. LSH ensemble: Internet-scale domain search. PVLDB, 9(12):1185--1196, 2016.

Digital Library

[78]

E. Zhu, K. Q. Pu, F. Nargesian, and R. J. Miller. Interactive navigation of open data linkages. PVLDB, 10(12):1837--1840, 2017.

Digital Library

Cited By

Luo XPei JXu CZhang WXu J(2024)Fast Shapley Value Computation in Data Assemblage Tasks as Cooperative Simple GamesProceedings of the ACM on Management of Data10.1145/36393112:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639311
Leventidis AChristensen MLissandrini MDi Rocco LHose KMiller RHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657877
Putrama IMartinek P(2024)Self-supervised data lakes discovery through unsupervised metadata-driven weighted similarityInformation Sciences: an International Journal10.1016/j.ins.2024.120242662:COnline publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.ins.2024.120242
Show More Cited By

Index Terms

Open data integration
1. Information systems
  1. Data management systems
    1. Database design and models
    2. Database management system engines
  2. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Open government data: beyond policy & portal, a study in Indian context
ICEGOV '13: Proceedings of the 7th International Conference on Theory and Practice of Electronic Governance

Open data is expected to enhance transparency, accountability and collaboration with citizens for government. Governments at all levels across all continents are therefore taking Initiatives to release their data in open domain. Open government data ...
Open data policy-making: A review of the state-of-the-art and an emerging research agenda
Selection of Open Data Policy-making Papers from DG.O. 2018

This section presents a selection of papers on open data policy-making from the 19th Annual International Conference on Digital Government Research 2018 (dg.o 2018). To position the research discussed in this section meaningfully, our introductory ...
Viscous Open Data: The Roles of Intermediaries in an Open Data Ecosystem
ICT Ecosystems

Open data have the potential to improve the governance of universities as public institutions. In addition, open data are likely to increase the quality, efficacy and efficiency of the research and analysis of higher education systems by providing a ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 11, Issue 12

August 2018

426 pages

ISSN:2150-8097

Editors:
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS
,
Jian Pei
Simon Fraser University

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2018

Published in PVLDB Volume 11, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
881
Total Downloads

Downloads (Last 12 months)149
Downloads (Last 6 weeks)25

Reflects downloads up to 23 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Luo XPei JXu CZhang WXu J(2024)Fast Shapley Value Computation in Data Assemblage Tasks as Cooperative Simple GamesProceedings of the ACM on Management of Data10.1145/36393112:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639311
Leventidis AChristensen MLissandrini MDi Rocco LHose KMiller RHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657877
Putrama IMartinek P(2024)Self-supervised data lakes discovery through unsupervised metadata-driven weighted similarityInformation Sciences: an International Journal10.1016/j.ins.2024.120242662:COnline publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.ins.2024.120242
Fan GWang JLi YZhang DMiller R(2023)Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation LearningProceedings of the VLDB Endowment10.14778/3587136.358714616:7(1726-1739)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.14778/3587136.3587146
Shraga RMiller R(2023)Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-VProceedings of the VLDB Endowment10.14778/3583140.358316916:6(1587-1600)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.14778/3583140.3583169
Sun YXin HChen L(2023)RECA: Related Tables Enhanced Column Semantic Type Annotation FrameworkProceedings of the VLDB Endowment10.14778/3583140.358314916:6(1319-1331)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.14778/3583140.3583149
Leventidis ADi Rocco LGatterbauer WMiller RRiedewald M(2023)DomainNet: Homograph Detection and Understanding in Data Lake DisambiguationACM Transactions on Database Systems10.1145/361291948:3(1-40)Online publication date: 12-Sep-2023
https://dl.acm.org/doi/10.1145/3612919
Khatiwada AFan GShraga RChen ZGatterbauer WMiller RRiedewald M(2023)SANTOS: Relationship-based Semantic Table Union SearchProceedings of the ACM on Management of Data10.1145/35886891:1(1-25)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588689
Fan GWang JLi YMiller RDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Table Discovery in Data Lakes: State-of-the-art and Future DirectionsCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589409(69-75)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589409
Chai CTang NFan JLuo YDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Demystifying Artificial Intelligence for Data PreparationCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589406(13-20)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589406
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents