Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3389726acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Finding Related Tables in Data Lakes for Interactive Data Science

Published: 31 May 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts.

    Supplementary Material

    MP4 File (3318464.3389726.mp4)
    Presentation Video

    References

    [1]
    Khalid Belhajjame, Norman W Paton, Alvaro AA Fernandes, Cornelia Hedeler, and Suzanne M Embury. 2011. User Feedback as a First Class Citizen in Information Integration Systems. In CIDR. 175--183.
    [2]
    William J. Bolosky, John R. Douceur, David Ely, and Marvin Theimer.2000. Feasibility of a Serverless Distributed File System Deployed onan Existing Set of Desktop PCs. In Proc. Measurement and Modeling of Computer Systems, 2000. 34--43.
    [3]
    Dan Brickley, Matthew Burgess, and Natasha Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In The World Wide Web Conference. 1365--1375.
    [4]
    Michael Cafarella, Alon Halevy, Hongrae Lee, Jayant Madhavan, CongYu, Daisy Zhe Wang, and Eugene Wu. 2018. Ten years of Webtables. Proceedings of the VLDB Endowment 11, 12 (2018), 2140--2149.
    [5]
    Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu,and Yang Zhang. 2008. WebTables: exploring the power of tables on the web. PVLDB 1, 1 (2008), 538--549.
    [6]
    Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. Relaxed functional dependencies - a survey of approaches. IEEE Transactions on Knowledge and Data Engineering 28, 1 (2016), 147--165.
    [7]
    Lucas AMC Carvalho, Regina Wang, Yolanda Gil, and Daniel Garijo.2017. NiW: Converting Notebooks into Workflows to Capture Dataflow and Provenance. In Proceedings of Workshops and Tutorials of the 9th International Conference on Knowledge Capture (K-CAP2017).
    [8]
    James Cheney, Laura Chiticariu, and Wang Chiew Tan. 2009. Provenance in Databases: Why, How, and Where. Foundations and Trendsin Databases1, 4 (2009), 379--474.
    [9]
    Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In CIDR.
    [10]
    Ronald Fagin, Amnon Lotem, and Moni Naor. 2003. Optimal aggregation algorithms for middleware. J. Comput. System Sci.66(4) (June2003), 614--656.
    [11]
    Ju Fan, Meiyu Lu, Beng Chin Ooi, Wang-Chiew Tan, and Meihui Zhang. 2014. A hybrid machine-crowdsourcing system for matching web tables. In 2014 IEEE 30th International Conference on Data Engineering. IEEE, 976--987.
    [12]
    Wenfei Fan, Floris Geerts, Jianzhong Li, and Ming Xiong. 2011. Dis-covering conditional functional dependencies. IEEE Transactions on Knowledge and Data Engineering 23, 5 (2011), 683--698.
    [13]
    Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan,Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1001--1012.
    [14]
    Raul Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden.2019. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1190--1201.
    [15]
    Michael Franklin, Alon Halevy, and David Maier. 2005. From databases to dataspaces: a new abstraction for information management. SIGMOD Rec.34, 4 (2005), 27--33.
    [16]
    Avigdor Gal. 2011. Uncertain Schema Matching. Morgan and Claypool.
    [17]
    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford.2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010(2018).
    [18]
    Jeremy Goecks, Anton Nekrutenko, and James Taylor. 2010. Galaxy:a comprehensive approach for supporting accessible, reproducible,and transparent computational research in the life sciences.Genomebiology11, 8 (2010), R86.
    [19]
    Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An Intelligent Data Lake System. In SIGMOD. ACM, New York, NY, USA,2097--2100. https://doi.org/10.1145/2882903.2899389
    [20]
    Alon Halevy, Flip Korn, Natalya F Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google's datasets. In Proceedings of the 2016 International Conference on Management of Data. ACM, 795--806.
    [21]
    Alon Y Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Managing Google's data lake: an overview of the Goods system. IEEE Data Eng. Bull. 39, 3 (2016), 5--14.
    [22]
    Yka Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen.1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. The computer journal 42, 2 (1999), 100--111.
    [23]
    Ihab F. Ilyas, Walid G. Aref, and Ahmed K. Elmagarmid. 2003. Supporting Top-k Join Queries in Relational Databases. In VLDB. 754--765.
    [24]
    Ihab F Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf Aboulnaga. 2004. CORDS: automatic discovery of correlations and soft functional dependencies. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, 647--658.
    [25]
    Ihab F. Ilyas and Mohamed Soliman. 2011. Probabilistic Ranking Techniques in Relational Databases. Morgan and Claypool.
    [26]
    Jaewook Kim, Yun Peng, Nenad Ivezik, Junho Shin, et al.2010.Semantic-based Optimal XML Schema Matching: A Mathematical Programming Approach. In The Proceedings of International Conference on E-business, Management and Economics.
    [27]
    Pradap Konda, Sanjib Das, AnHai Doan, Adel Ardalan, Jeffrey R Ballard,Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, Shishir Prasad, et al. 2016. Magellan: toward building entity matching management systems over data science stacks. Proceedings of the VLDB Endowment 9, 13 (2016), 1581--1584.
    [28]
    David Koop and Jay Patel. 2017. Dataflow notebooks: encoding and tracking dependencies of cells. In 9th USENIX Workshop on the Theory and Practice of Provenance (TaPP 17). USENIX Association.
    [29]
    Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu. 2016. To Join or Not to Join? Thinking Twice about Joins beforeFeature Selection. In Proceedings of the 2016 International Conference on Management of Data. Association for Computing Machinery, New York, NY, USA, 19--34. https://doi.org/10.1145/2882903.2882952
    [30]
    Chengkai Li, Kevin Chen-Chuan Chang, Ihab F. Ilyas, and Sumin Song. 2005. RankSQL: Query Algebra and Optimization for Relational Top-k Queries. In SIGMOD. 131--142.
    [31]
    Bertram Ludäscher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A. Lee, Jing Tao, and Yang Zhao. 2006.Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience(2006), 1039--1065.
    [32]
    Fatemeh Nargesian, Erkang Zhu, Renée J Miller, Ken Q Pu, and Patricia C Arocena. 2019. Data Lake Management: Challenges and Opportunities. Proceedings of the VLDB Endowment 12, 12 (2019).
    [33]
    Fatemeh Nargesian, Erkang Zhu, Ken Q Pu, and Renée J Miller. 2018.Table union search on open data.Proceedings of the VLDB Endowment11, 7 (2018), 813--825.
    [34]
    T. Oinn, M. Greenwood, M. Addis, N. Alpdemir, J. Ferris, K. Glover,C. Goble, A. Goderis, D. Hull, D. Marvin, P. Li, P. Lord, M. Pocock, M. Senger, R. Stevens, A. Wipat, and C. Wroe. 2006. Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience 18, 10 (2006), 1067--1100.
    [35]
    Christos H Papadimitriou. 1981. On the complexity of integer programming. Journal of the ACM (JACM)28, 4 (1981), 765--768.
    [36]
    Fernando Perez and Brian E Granger. 2015. Project Jupyter: Computational narratives as the engine of collaborative data science. Retrieved September 11 (2015), 207.
    [37]
    Tomas Petricek, James Geddes, and Charles Sutton. 2018. Wrattler: Reproducible, live and polyglot notebooks. In 10th USENIX Workshop on the Theory and Practice of Provenance (TaPP 2018). USENIX Association.
    [38]
    Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering Table Queries on the Web using Column Keywords. PVLDB 5, 10 (2012), 908--919.
    [39]
    Erhard Rahm and Philip A. Bernstein. 2001. A Survey of Approaches to Automatic Schema Matching. VLDB J.10, 4 (2001), 334--350.
    [40]
    Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira, and Sudipto Guha. 2008. Learning to create data-integrating queries. PVLDB 1, 1 (2008),785--796.
    [41]
    Petros Venetis, Alon Y Halevy, Jayant Madhavan, Marius Pasca, WarrenShen, Fei Wu, and Gengxin Miao. 2011. Recovering semantics of tables on the web. (2011).
    [42]
    Daisy Zhe Wang, Xin Luna Dong, Anish Das Sarma, Michael J Franklin, and Alon Y Halevy. 2009. Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems. In WebDB.
    [43]
    Zhiping Zeng, Anthony KH Tung, Jianyong Wang, Jianhua Feng, and Lizhu Zhou. 2009. Comparing stars: On approximating graph edit distance. Proceedings of the VLDB Endowment 2, 1 (2009), 25--36.
    [44]
    Yi Zhang and Zachary G. Ives. 2019. Juneau: Data Lake Management for Jupyter. Proceedings of the VLDB Endowment 12, 7 (2019).
    [45]
    Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J Miller. 2019.JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data. ACM, 847--864.
    [46]
    Erkang Zhu, Fatemeh Nargesian, Ken Q Pu, and Renée J Miller. 2016. LSH ensemble: internet-scale domain search. Proceedings of the VLDB Endowment 9, 12 (2016), 1185--1196.
    [47]
    Moshé M. Zloof. 1975. Query-by-example: the invocation and definition of tables and forms. InVLDB '75: Proceedings of the 1st International Conference on Very Large Data Bases. 1--24.
    [48]
    Moshé M. Zloof. 1977. Query By Example: A Data Base Language. IBM Systems Journal 16(4) (1977), 324--343.

    Cited By

    View all
    • (2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
    • (2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
    • (2024)A Study on Efficient Indexing for Table Search in Data Lakes2024 IEEE 18th International Conference on Semantic Computing (ICSC)10.1109/ICSC59802.2024.00046(245-252)Online publication date: 5-Feb-2024
    • Show More Cited By

    Index Terms

    1. Finding Related Tables in Data Lakes for Interactive Data Science

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
      June 2020
      2925 pages
      ISBN:9781450367356
      DOI:10.1145/3318464
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 31 May 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data lakes
      2. interactive data science
      3. notebooks
      4. table search

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      SIGMOD/PODS '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)272
      • Downloads (Last 6 weeks)35
      Reflects downloads up to

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
      • (2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
      • (2024)A Study on Efficient Indexing for Table Search in Data Lakes2024 IEEE 18th International Conference on Semantic Computing (ICSC)10.1109/ICSC59802.2024.00046(245-252)Online publication date: 5-Feb-2024
      • (2024)ARTS: A System for Aggregate Related Table Search2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00428(5461-5464)Online publication date: 13-May-2024
      • (2024)Efficient Approximate Maximum Inner Product Search Over Sparse Vectors2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00303(3961-3974)Online publication date: 13-May-2024
      • (2024)Gen-T: Table Reclamation in Data Lakes2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00272(3532-3545)Online publication date: 13-May-2024
      • (2024)AutoFeat: Transitive Feature Discovery over Join Paths2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00150(1861-1873)Online publication date: 13-May-2024
      • (2024)Model Selection with Model Zoo via Graph Learning2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00088(1296-1309)Online publication date: 13-May-2024
      • (2024)KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00021(179-192)Online publication date: 13-May-2024
      • (2024)In Situ Neural Relational Schema Matcher2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00018(138-150)Online publication date: 13-May-2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media