Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2882903.2882957acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment

Published: 14 June 2016 Publication History
  • Get Citation Alerts
  • Editorial Notes

    Computationally Replicable. The experimental results of this paper were replicated by a SIGMOD Review Committee and were found to support the central results reported in the paper. Details of the review process are found here

    Abstract

    We analyze the workload from a multi-year deployment of a database-as-a-service platform targeting scientists and data scientists with minimal database experience. Our hypothesis was that relatively minor changes to the way databases are delivered can increase their use in ad hoc analysis environments. The web-based SQLShare system emphasizes easy dataset-at-a-time ingest, relaxed schemas and schema inference, easy view creation and sharing, and full SQL support. We find that these features have helped attract workloads typically associated with scripts and files rather than relational databases: complex analytics, routine processing pipelines, data publishing, and collaborative analysis. Quantitatively, these workloads are characterized by shorter dataset "lifetimes", higher query complexity, and higher data complexity. We report on usage scenarios that suggest SQL is being used in place of scripts for one-off data analysis and ad hoc data sharing. The workload suggests that a new class of relational systems emphasizing short-term, ad hoc analytics over engineered schemas may improve uptake of database technology in data science contexts. Our contributions include a system design for delivering databases into these contexts, a description of a public research query workload dataset released to advance research in analytic data systems, and an initial analysis of the workload that provides evidence of new use cases under-supported in existing systems.

    Supplementary Material

    ReadMe (readme.txt)
    Rights information
    Query Workload Analysis Master (query-workload-analysis-master.zip)
    Graphs, Plots, Results

    References

    [1]
    Apache hadoop. https://hadoop.apache.org/. Accessed: 2014--10--14.
    [2]
    Big data techniques applied to media and computer graphics applications. https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf.
    [3]
    OpenRefine (formerly google refine). http://openrefine.org/. Accessed: 2014--10--14.
    [4]
    M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1383--1394. ACM, 2015.
    [5]
    A. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. arXiv preprint arXiv:1409.0798, 2014.
    [6]
    J. Clark, S. DeRose, et al. Xml path language (xpath). W3C recommendation, 16, 1999.
    [7]
    S. Cohen-Boulakia and U. Leser. Search, adapt, and reuse: the future of scientific workflows. ACM SIGMOD Record, 40(2):6--16, 2011.
    [8]
    T. P. P. Council. TPC-H benchmark specification. http://www.tpc.org/tpch/, 2008.
    [9]
    E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci. Program., 13(3):219--237, July 2005.
    [10]
    A. Doan and A. Y. Halevy. Semantic integration research in the database community: A brief survey. AI magazine, 26(1):83, 2005.
    [11]
    M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. ACM Sigmod Record, 34(4):27--33, 2005.
    [12]
    H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon. Google fusion tables: web-centered data management and collaboration. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 1061--1066. ACM, 2010.
    [13]
    D. Halperin, V. Teixeira de Almeida, L. L. Choo, S. Chu, P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker, et al. Demonstration of the myria big data management service. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, Sigmod '14, pages 881--884. ACM, 7 2014.
    [14]
    B. Howe, G. Cole, E. Souroush, P. Koutris, A. Key, N. Khoussainova, and L. Battle. Database-as-a-service for long-tail science. In Scientific and Statistical Database Management, pages 480--489. Springer, 2011.
    [15]
    B. Howe, F. Ribalet, D. Halperin, S. Chitnis, and E. V. Armbrust. Sqlshare: Scientific workflow via relational view sharing. Computing in Science & Engineering, Special Issue on Science Data Management, 15(2), 2013.
    [16]
    S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 3363--3372. ACM, 2011.
    [17]
    S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. In IEEE Visual Analytics Science & Technology (VAST), 2012.
    [18]
    S. M. Kent. Sloan digital sky survey. In Science with Astronomical Near-Infrared Sky Surveys, pages 27--30. Springer, 1994.
    [19]
    N. Khoussainova, M. Balazinska, W. Gatterbauer, Y. Kwon, and D. Suciu. A case for a collaborative query management system. arXiv preprint arXiv:0909.1778, 2009.
    [20]
    N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu. Snipsuggest: Context-aware autocompletion for sql. Proceedings of the VLDB Endowment, 4(1):22--33, 2010.
    [21]
    M. Kim, V. Sazawal, D. Notkin, and G. Murphy. An empirical study of code clone genealogies. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 187--196. ACM, 2005.
    [22]
    F. Li, T. Pan, and H. V. Jagadish. Schema-free sql. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 1051--1062, New York, NY, USA, 2014. ACM.
    [23]
    B. Mozafari, E. Z. Y. Goh, and D. Y. Yoon. Cliffguard: A principled framework for finding robust database designs. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1167--1182. ACM, 2015.
    [24]
    E. Ogasawara, J. Dias, F. Porto, P. Valduriez, and M. Mattoso. An algebraic approach for data-centric scientific workflows. Proc. of VLDB Endowment, 4(12):1328--1339, 2011.
    [25]
    K. Ren, Y. Kwon, M. Balazinska, and B. Howe. Hadoop's adolescence: an analysis of hadoop usage in scientific workloads. Proceedings of the VLDB Endowment, 6(10):853--864, 2013.
    [26]
    M. Rosson and J. Carroll. Active programming strategies in reuse. In O. Nierstrasz, editor, ECOOP '93 -- Object-Oriented Programming, volume 707 of Lecture Notes in Computer Science, pages 4--20. Springer Berlin Heidelberg, 1993.
    [27]
    P. Roy, K. Ramamritham, S. Seshadri, P. Shenoy, and S. Sudarshan. Don't trash your intermediate results, cache'em. arXiv preprint cs/0003005, 2000.
    [28]
    V. Singh, J. Gray, A. Thakar, A. S. Szalay, J. Raddick, B. Boroski, S. Lebedeva, and B. Yanny. Skyserver traffic report-the first five years. arXiv preprint cs/0701173, 2007.
    [29]
    M. Stonebraker, J. Becla, D. J. DeWitt, K. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. Requirements for science data bases and scidb. In CIDR 2009, Fourth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4--7, 2009, Online Proceedings, 2009.
    [30]
    I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields. Workflows for e-Science: Scientific Workflows for Grids. Springer Publishing Company, Incorporated, 2014.
    [31]
    A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, 2009.

    Cited By

    View all
    • (2024)SeLeP: Learning Based Semantic Prefetching for Exploratory Database WorkloadsProceedings of the VLDB Endowment10.14778/3659437.365945817:8(2064-2076)Online publication date: 1-Apr-2024
    • (2024)SchemaPile: A Large Collection of Relational Database SchemasProceedings of the ACM on Management of Data10.1145/36549752:3(1-25)Online publication date: 30-May-2024
    • (2024)Explaining cube measures through Intentional AnalyticsInformation Systems10.1016/j.is.2023.102338121(102338)Online publication date: Mar-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
    June 2016
    2300 pages
    ISBN:9781450335317
    DOI:10.1145/2882903
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication Notes

    Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

    Publication History

    Published: 14 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. database management as a cloud service
    2. database management sytems
    3. relational databases

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGMOD/PODS'16
    Sponsor:
    SIGMOD/PODS'16: International Conference on Management of Data
    June 26 - July 1, 2016
    California, San Francisco, USA

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)88
    • Downloads (Last 6 weeks)9

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)SeLeP: Learning Based Semantic Prefetching for Exploratory Database WorkloadsProceedings of the VLDB Endowment10.14778/3659437.365945817:8(2064-2076)Online publication date: 1-Apr-2024
    • (2024)SchemaPile: A Large Collection of Relational Database SchemasProceedings of the ACM on Management of Data10.1145/36549752:3(1-25)Online publication date: 30-May-2024
    • (2024)Explaining cube measures through Intentional AnalyticsInformation Systems10.1016/j.is.2023.102338121(102338)Online publication date: Mar-2024
    • (2023)Reinforced approximate exploratory data analysisProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25929(7660-7669)Online publication date: 7-Feb-2023
    • (2023)Data Makes Better Data ScientistsProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3597465.3605228(1-3)Online publication date: 18-Jun-2023
    • (2023)Database Evolution, by Scientists, for Scientists: A Case Study2023 IEEE 19th International Conference on e-Science (e-Science)10.1109/e-Science58273.2023.10254872(1-10)Online publication date: 9-Oct-2023
    • (2023)TRANSQLATION: TRANsformer-based SQL RecommendATION2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386277(4703-4711)Online publication date: 15-Dec-2023
    • (2023)A new window Clause for SQL++The VLDB Journal10.1007/s00778-023-00830-z33:3(595-623)Online publication date: 19-Dec-2023
    • (2022)Efficient Evaluation of Arbitrarily-Framed Holistic SQL Aggregates and Window FunctionsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526184(1243-1256)Online publication date: 10-Jun-2022
    • (2022)Intelligent Automated Workload Analysis for Database ReplatformingProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526050(2273-2285)Online publication date: 10-Jun-2022
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media