Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Open data integration

Published: 01 August 2018 Publication History

Abstract

Open data plays a major role in supporting both governmental and organizational transparency. Many organizations are adopting Open Data Principles promising to make their open data complete, primary, and timely. These properties make this data tremendously valuable to data scientists. However, scientists generally do not have a priori knowledge about what data is available (its schema or content). Nevertheless, they want to be able to use open data and integrate it with other public or private data they are studying. Traditionally, data integration is done using a framework called query discovery where the main task is to discover a query (or transformation) that translates data from one form into another. The goal is to find the right operators to join, nest, group, link, and twist data into a desired form. We introduce a new paradigm for thinking about integration where the focus is on data discovery, but highly efficient internet-scale discovery that is driven by data analysis needs. We describe a research agenda and recent progress in developing scalable data-analysis or query-aware data discovery algorithms that provide high recall and accuracy over massive data repositories.

References

[1]
P. Agrawal, A. Arasu, and R. Kaushik. On indexing error-tolerant set containment. In ACM SIGMOD, pages 927--938, 2010.
[2]
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008.
[3]
P. C. Arocena, B. Glavic, R. Ciucanu, and R. J. Miller. The iBench integration metadata generator. PVLDB, 9(3):108--119, 2015.
[4]
P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, and D. Santoro. Benchmarking data curation systems. IEEE Data Eng. Bull., 39(2):47--62, 2016.
[5]
B. G. Bashardoost, C. Christodoulakis, S. H. Yeganeh, R. J. Miller, K. Lyons, and O. Hassanzadeh. VizCurator: A visual tool for curating open data. In WWW, pages 195--198, 2015.
[6]
C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4):323--364, 1986.
[7]
M. Bawa, T. Condie, and P. Ganesan. Lsh forest: Self-tuning indexes for similarity search. In WWW, pages 651--660, 2005.
[8]
P. Beckman, T. J. Skluzacek, K. Chard, and I. T. Foster. Skluma: A statistical learning pipeline for taming unkempt data repositories. In Scientific and Statistical Database Management, pages 41:1--41:4, 2017.
[9]
A. Behm, C. Li, and M. J. Carey. Answering approximate string queries on large data sets using external memory. In IEEE ICDE, pages 888--899, 2011.
[10]
C. Bizer, T. Heath, and T. Berners-Lee. Linked data - the story so far. Int. J. Semantic Web Inf. Syst., 5(3):1--22, 2009.
[11]
P. Buneman, S. B. Davidson, K. Hart, G. C. Overton, and L. Wong. A data transformation system for biological data sources. In VLDB, pages 158--169, 1995.
[12]
P. Buneman, S. B. Davidson, and A. Kosky. Semantics of database transformations. In Semantics in Databases, pages 55--91, 1995.
[13]
D. Burdick, M. A. Hernández, H. Ho, G. Koutrika, R. Krishnamurthy, L. Popa, I. Stanoi, S. Vaithyanathan, and S. R. Das. Extracting, linking and integrating data from public sources: A financial case study. IEEE Data Eng. Bull., 34(3):60--67, 2011.
[14]
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008.
[15]
M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2(1):1090--1101, 2009.
[16]
P. Christen. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Springer, 2012.
[17]
D. Crockford. The application/json media type for javascript object notation (JSON). Request for Comment, 4627:1--10, 2006.
[18]
CrowdFlower. 2017 Data Scientist Report. http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf, Date accessed: July 15, 2019.
[19]
D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In CIDR, 2017.
[20]
D. Deng, G. Li, J. Feng, and W. Li. Top-k string similarity search with edit-distance constraints. In IEEE ICDE, pages 925--936, 2013.
[21]
H. Elmeleegy, A. K. Elmagarmid, and J. Lee. Leveraging query logs for schema mapping generation in U-MAP. In ACM SIGMOD, pages 121--132, 2011.
[22]
R. Fagin, L. M. Haas, M. A. Hernández, R. J. Miller, L. Popa, and Y. Velegrakis. Clio: Schema mapping creation and data exchange. In Conceptual Modeling: Foundations and Applications - Essays in Honor of John Mylopoulos, pages 198--236, 2009.
[23]
R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: Semantics and query answering. In ICDT, pages 207--224, 2003.
[24]
R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theoretical Computer Science, 336(1):89--124, 2005.
[25]
M. J. Franklin, A. Y. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4):27--33, 2005.
[26]
L. M. Haas, D. Kossmann, E. L. Wimmers, and J. Yang. Optimizing queries across diverse data sources. In VLDB, pages 276--285, 1997.
[27]
A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In ACM SIGMOD, pages 795--806, 2016.
[28]
O. Hassanzadeh, A. Kementsietsidis, L. Lim, R. J. Miller, and M. Wang. A framework for semantic link discovery over relational data. In CIKM, pages 1027--1036, 2009.
[29]
O. Hassanzadeh, A. Kementsietsidis, L. Lim, R. J. Miller, and M. Wang. LinkedCT: A linked data space for clinical trials. CoRR, abs/0908.0567, 2009.
[30]
O. Hassanzadeh and R. J. Miller. Automatic curation of clinical trials data in LinkedCT. In ISWC, pages 270--278, 2015.
[31]
O. Hassanzadeh, K. Q. Pu, S. H. Yeganeh, R. J. Miller, L. Popa, M. A. Hernández, and H. Ho. Discovering linkage points over web data. PVLDB, 6(6):444--456, 2013.
[32]
B. He and K. C. Chang. Statistical schema matching across web query interfaces. In ACM SIGMOD, pages 217--228, 2003.
[33]
D. Heimbigner and D. McLeod. A federated architecture for information management. ACM Trans. Inf. Syst., 3(3):253--278, 1985.
[34]
M. A. Hernádez, G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky. HIL: a high-level scripting language for entity integration. In EDBT, pages 549--560, 2013.
[35]
R. Hull. Relative information capacity of simple relational database schemata. SIAM J. Comput., 15(3):856--886, 1986.
[36]
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In ACM STOC, pages 604--613, 1998.
[37]
A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efficient text classification. ACL, 2017.
[38]
E. Kandogan, M. Roth, P. M. Schwarz, J. Hui, I. G. Terrizzano, C. Christodoulakis, and R. J. Miller. Labbook: Metadata-driven social collaborative data analysis. In IEEE Big Data, pages 431--440, 2015.
[39]
J. Kang and J. F. Naughton. On schema matching with opaque column names and data values. In ACM SIGMOD, pages 205--216, 2003.
[40]
G. Kasneci, M. Ramanath, F. M. Suchanek, and G. Weikum. The YAGO-NAGA approach to knowledge discovery. SIGMOD Record, 37(4):41--47, 2008.
[41]
A. Kimmig, A. Memory, R. J. Miller, and L. Getoor. A collective, probabilistic approach to schema mapping. In IEEE ICDE, pages 921--932, 2017.
[42]
P. G. Kolaitis. Reflections on schema mappings, data exchange, and metadata management. In ACM PODS, pages 107--109, 2018.
[43]
O. Lehmberg and C. Bizer. Stitching web tables for improving matching quality. PVLDB, 10(11):1502--1513, 2017.
[44]
O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer. A large public corpus of web tables containing time and context metadata. In WWW, pages 75--76, 2016.
[45]
O. Lehmberg, D. Ritze, P. Ristoski, R. Meusel, H. Paulheim, and C. Bizer. The mannheim search join engine. J. of Web Semantics, 35:159--166, 2015.
[46]
A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources using source descriptions. In VLDB, pages 251--262, 1996.
[47]
C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In IEEE ICDE, pages 257--266, 2008.
[48]
G. Li. Human-in-the-loop data integration. PVLDB, 10(12):2006--2017, 2017.
[49]
X. Ling, A. Y. Halevy, F. Wu, and C. Yu. Synthesizing union tables from the web. In IJCAI, pages 2677--2683, 2013.
[50]
W. Mann, N. Augsten, and P. Bouros. An empirical evaluation of set similarity join techniques. PVLDB, 9(9):636--647, 2016.
[51]
B. Marnette, G. Mecca, P. Papotti, S. Raunich, and D. Santoro. ++Spicy: an OpenSource Tool for Second-Generation Schema Mapping and Data Exchange. PVLDB, 4(12):1438--1441, 2011.
[52]
G. Mecca and P. Papotti. Schema mapping and data exchange tools: Time for the golden age. it - Information Technology, 54(3):105--113, 2012.
[53]
R. J. Miller. Using schematically heterogeneous structures. In ACM SIGMOD, pages 189--200, 1998.
[54]
R. J. Miller, L. M. Haas, and M. A. Hernández. Schema mapping as query discovery. In VLDB, pages 77--88, 2000.
[55]
R. J. Miller, Y. E. Ioannidis, and R. Ramakrishnan. The use of information capacity in schema integration and translation. In VLDB, pages 120--133, 1993.
[56]
R. J. Miller, Y. E. Ioannidis, and R. Ramakrishnan. Schema equivalence in heterogeneous systems: bridging theory and practice. Inf. Syst., 19(1):3--31, 1994.
[57]
A. Nandi and P. A. Bernstein. HAMSTER: using search clicklogs for schema and taxonomy matching. PVLDB, 2(1):181--192, 2009.
[58]
F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller. Table union search on open data. PVLDB, 11(7):813--825, 2018.
[59]
R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. PVLDB, 5(10):908--919, 2012.
[60]
E. Rahm. Towards large-scale schema and ontology matching. In Schema Matching and Mapping, pages 3--27. 2011.
[61]
D. Ritze, O. Lehmberg, and C. Bizer. Matching HTML tables to dbpedia. In Web Intelligence, pages 10:1--10:6, 2015.
[62]
S. W. Sadiq, T. Dasu, X. L. Dong, J. Freire, I. F. Ilyas, S. Link, R. J. Miller, F. Naumann, X. Zhou, and D. Srivastava. Data quality: The role of empiricism. SIGMOD Record, 46(4):35--43, 2017.
[63]
A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In ACM SIGMOD, pages 817--828, 2012.
[64]
Y. Shafranovich. Common format and MIME type for comma-separated values (CSV) files. Request for Comment, 4180:1--8, 2005.
[65]
W. Su, J. Wang, and F. H. Lochovsky. Holistic schema matching for web query interfaces. In EDBT, pages 77--94, 2006.
[66]
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697--706, 2007.
[67]
J. Tauberer. Open Government Data (The Book). https://opengovdata.io/, 2014. Second Edition. Date accessed: July 15, 2018.
[68]
B. ten Cate, P. G. Kolaitis, and W. C. Tan. Schema mappings and data examples. In EDBT, pages 777--780, 2013.
[69]
F. Tschirschnitz, T. Papenbrock, and F. Naumann. Detecting inclusion dependencies on very many tables. ACM Trans. Database Syst., 42(3):18:1--18:29, 2017.
[70]
O. Udrea, L. Getoor, and R. J. Miller. Leveraging data and structure in ontology integration. In ACM SIGMOD, pages 449--460, 2007.
[71]
J. Wang, G. Li, D. Deng, Y. Zhang, and J. Feng. Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search. In ICDE, pages 519--530, 2015.
[72]
J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In ACM SIGMOD, pages 85--96, 2012.
[73]
X. Wang, L. M. Haas, and A. Meliou. Explaining data integration. IEEE Data Eng. Bull., 41(2):47--58, 2018.
[74]
C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In IEEE ICDE, pages 916--927, 2009.
[75]
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: Entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD, pages 97--108, 2012.
[76]
E. Zhu, Y. He, and S. Chaudhuri. Auto-join: Joining tables by leveraging transformations. PVLDB, 10(10):1034--1045, 2017.
[77]
E. Zhu, F. Nargesian, K. Q. Pu, and R. J. Miller. LSH ensemble: Internet-scale domain search. PVLDB, 9(12):1185--1196, 2016.
[78]
E. Zhu, K. Q. Pu, F. Nargesian, and R. J. Miller. Interactive navigation of open data linkages. PVLDB, 10(12):1837--1840, 2017.

Cited By

View all
  • (2024)Fast Shapley Value Computation in Data Assemblage Tasks as Cooperative Simple GamesProceedings of the ACM on Management of Data10.1145/36393112:1(1-28)Online publication date: 26-Mar-2024
  • (2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
  • (2024)Self-supervised data lakes discovery through unsupervised metadata-driven weighted similarityInformation Sciences: an International Journal10.1016/j.ins.2024.120242662:COnline publication date: 1-Mar-2024
  • Show More Cited By

Index Terms

  1. Open data integration
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 11, Issue 12
        August 2018
        426 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        Published: 01 August 2018
        Published in PVLDB Volume 11, Issue 12

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)149
        • Downloads (Last 6 weeks)25
        Reflects downloads up to 23 Sep 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Fast Shapley Value Computation in Data Assemblage Tasks as Cooperative Simple GamesProceedings of the ACM on Management of Data10.1145/36393112:1(1-28)Online publication date: 26-Mar-2024
        • (2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
        • (2024)Self-supervised data lakes discovery through unsupervised metadata-driven weighted similarityInformation Sciences: an International Journal10.1016/j.ins.2024.120242662:COnline publication date: 1-Mar-2024
        • (2023)Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation LearningProceedings of the VLDB Endowment10.14778/3587136.358714616:7(1726-1739)Online publication date: 8-May-2023
        • (2023)Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-VProceedings of the VLDB Endowment10.14778/3583140.358316916:6(1587-1600)Online publication date: 20-Apr-2023
        • (2023)RECA: Related Tables Enhanced Column Semantic Type Annotation FrameworkProceedings of the VLDB Endowment10.14778/3583140.358314916:6(1319-1331)Online publication date: 1-Feb-2023
        • (2023)DomainNet: Homograph Detection and Understanding in Data Lake DisambiguationACM Transactions on Database Systems10.1145/361291948:3(1-40)Online publication date: 12-Sep-2023
        • (2023)SANTOS: Relationship-based Semantic Table Union SearchProceedings of the ACM on Management of Data10.1145/35886891:1(1-25)Online publication date: 30-May-2023
        • (2023)Table Discovery in Data Lakes: State-of-the-art and Future DirectionsCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589409(69-75)Online publication date: 4-Jun-2023
        • (2023)Demystifying Artificial Intelligence for Data PreparationCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589406(13-20)Online publication date: 4-Jun-2023
        • Show More Cited By

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media