Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3389779acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper

Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning

Published: 31 May 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Exploratory Data Analysis (EDA) is an essential yet highly demanding task. To get a head start before exploring a new dataset, data scientists often prefer to view existing EDA notebooks -- illustrative, curated exploratory sessions, on the same dataset, that were created by fellow data scientists who shared them online. Unfortunately, such notebooks are not always available (e.g., if the dataset is new or confidential). To address this, we present ATENA, a system that takes an input dataset and auto-generates a compelling exploratory session, presented in an EDA notebook. We shape EDA into a control problem, and devise a novel Deep Reinforcement Learning (DRL) architecture to effectively optimize the notebook generation. Though ATENA uses a limited set of EDA operations, our experiments show that it generates useful EDA notebooks, allowing users to gain actual insights.

    Supplementary Material

    MP4 File (3318464.3389779.mp4)
    Presentation Video

    References

    [1]
    Kaggle community. https://www.kaggle.com.
    [2]
    Tableau software. https://tableau.com.
    [3]
    D. Asimov. The grand tour: a tool for viewing multidimensional data. SIAM journal on scientific and statistical computing, 6(1):128--143, 1985.
    [4]
    O. Bar El, T. Milo, and A. Somech. Atena: An autonomous system for data exploration based on deep reinforcement learning. In CIKM, 2019.
    [5]
    O. Bar El, T. Milo, and A. Somech. A-eda: Automatic benchmark for auto-generated eda. https://github.com/TAU-DB/ATENA-A-EDA, 2020.
    [6]
    O. Bar El, T. Milo, and A. Somech. Towards autonomous, hands-free data exploration. In CIDR, 2020.
    [7]
    R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Muscat, and B. Plank. Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Int. Res., 55(1):409--442, Jan. 2016.
    [8]
    M. Brachmann, C. Bautista, S. Castelo, S. Feng, J. Freire, B. Glavic, O. Kennedy, H. Müeller, R. Rampin, W. Spoth, et al. Data debugging and exploration with vizier. In SIGMOD, 2019.
    [9]
    V. Chandola and V. Kumar. Summarization - compressing data into an informative representation. KAIS, 12(3), 2007.
    [10]
    F. Chirigati, H. Doraiswamy, T. Damoulas, and J. Freire. Data polygamy: the many-many relationships among urban spatio-temporal data sets. In SIGMOD, 2016.
    [11]
    K. Dimitriadou, O. Papaemmanouil, and Y. Diao. Aide: An active learning-based approach for interactive data exploration. TKDE, 2016.
    [12]
    M. Drosou and E. Pitoura. Ymaldb: exploring relational databases via result-driven recommendations. VLDBJ, 22(6), 2013.
    [13]
    G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap, J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.
    [14]
    M. Eirinaki, S. Abraham, N. Polyzotis, and N. Shaikh. Querie: Collaborative database exploration. TKDE, 2014.
    [15]
    EuroMatrix. Survey of machine translation evaluation, 2017. data retrieved from World Development Indicators, https://www.euromatrix.net/deliverables/Euromatrix_D1.3_Revised.pdf.
    [16]
    J. H. Friedman and J. W. Tukey. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on computers, 100(9):881-- 890, 1974.
    [17]
    L. Geng and H. J. Hamilton. Interestingness measures for data mining: A survey. CSUR, 2006.
    [18]
    A. Gosavi. Reinforcement learning: A tutorial survey and recent advances. INFORMS Journal on Computing, 21(2):178--192, 2009.
    [19]
    M. Hausknecht and P. Stone. Deep reinforcement learning in parameterized action space. arXiv preprint arXiv:1511.04143, 2015.
    [20]
    E. Huang, L. Peng, L. D. Palma, A. Abdelkafi, A. Liu, and Y. Diao. Optimization for active learning-based interactive database exploration. Proceedings of the VLDB Endowment, 12(1):71--84, 2018.
    [21]
    M. Joglekar, H. Garcia-Molina, and A. G. Parameswaran. Interactive data exploration with smart drill-down. In ICDE, 2016.
    [22]
    L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237--285, 1996.
    [23]
    M. B. Kery, M. Radensky, M. Arya, B. E. John, and B. A. Myers. The story in the notebook: Exploratory data science using a literate programming tool. In CHI, 2018.
    [24]
    T. Kraska. Northstar: An interactive data science system. PVLDB, 11(12), 2018.
    [25]
    Y. Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
    [26]
    L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algorithms as gradient descent. In NeurIPS, 2000.
    [27]
    W. McKinney. Data structures for statistical computing in python. In S. van der Walt and J. Millman, editors, Proceedings of the 9th Python in Science Conference, pages 51 -- 56, 2010.
    [28]
    T. Milo and A. Somech. Deep reinforcement-learning framework for exploratory data analysis. In aiDM, 2018.
    [29]
    T. Milo and A. Somech. Next-step suggestions for modern interactive data analysis platforms. In KDD, 2018.
    [30]
    V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
    [31]
    M. Newman. Power laws, pareto distributions and zipf's law. Contemporary Physics, 46(5):323--351, 2005.
    [32]
    U. S. D. of Transportation. 2015 flight delays and cancellations. https://www.kaggle.com/usdot/flight-delays, 2015.
    [33]
    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL. Association for Computational Linguistics, 2002.
    [34]
    I. Preferred Networks. The deep reinforcement learning library chainerrl. https://github.com/chainer/chainerrl, 2017.
    [35]
    A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3), 2017.
    [36]
    A. Rule, A. Tabard, and J. D. Hollan. Exploration and explanation in computational notebooks. In CHI, 2018.
    [37]
    S. Sarawagi, R.Agrawal, and N. Megiddo. Discovery-driven exploration of olap data cubes. In EDBT, 1998.
    [38]
    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
    [39]
    T. Siddiqui, A. Kim, J. Lee, K. Karahalios, and A. Parameswaran. Effortless data exploration with zenvisage: an expressive and interactive visual analytics system. PVLDB, 10(4), 2016.
    [40]
    D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484--489, 2016.
    [41]
    M. Singh, M. J. Cafarella, and H. Jagadish. Dbexplorer: Exploratory search in databases. EDBT, 2016.
    [42]
    A. Somech and T. Milo. React: Interactive data analysis recommender system. https://github.com/TAU-DB/REACT-IDA-Recommendation-benchmark, 2018.
    [43]
    L. Spitzner. The honeynet project: Trapping the hackers. IEEE Security & Privacy, 99(2), 2003.
    [44]
    M. van Leeuwen. Maximal exceptions with minimal descriptions. Data Mining and Knowledge Discovery, 21(2):259--276, 2010.
    [45]
    M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. Seedb: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13), 2015.
    [46]
    R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensusbased image description evaluation. In CVPR, 2015.
    [47]
    Z. Zhao, L. De Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. Controlling false discoveries during interactive data exploration. In SIGMOD. ACM, 2017.

    Cited By

    View all
    • (2024)Guided Exploration of Industrial Sensor DataComputer Graphics Forum10.1111/cgf.1500343:1Online publication date: 29-Jan-2024
    • (2024)Supporting Guided Exploratory Visual Analysis on Time Series Data with Reinforcement LearningIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332720030:1(1172-1182)Online publication date: 1-Jan-2024
    • (2024)Guided SQL-Based Data Exploration with User Feedback2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00372(4884-4896)Online publication date: 13-May-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
    June 2020
    2925 pages
    ISBN:9781450367356
    DOI:10.1145/3318464
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. EDA
    2. EDA notebooks
    3. auto EDA
    4. auto generated
    5. autogenerated
    6. data exploration
    7. interactive data analysis
    8. notebooks

    Qualifiers

    • Short-paper

    Funding Sources

    Conference

    SIGMOD/PODS '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)150
    • Downloads (Last 6 weeks)15
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Guided Exploration of Industrial Sensor DataComputer Graphics Forum10.1111/cgf.1500343:1Online publication date: 29-Jan-2024
    • (2024)Supporting Guided Exploratory Visual Analysis on Time Series Data with Reinforcement LearningIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332720030:1(1172-1182)Online publication date: 1-Jan-2024
    • (2024)Guided SQL-Based Data Exploration with User Feedback2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00372(4884-4896)Online publication date: 13-May-2024
    • (2024)Exploiting Metadata for Intelligent and Secure JSON REST API ServicesProceedings of World Conference on Information Systems for Business Management10.1007/978-981-99-8346-9_12(135-149)Online publication date: 1-Mar-2024
    • (2023)Yapay Zeka Algoritmaları Kullanılarak Öğrencilerin Akademik Başarısı ile Stres İlişkisinin Keşifsel Bir AnaliziJournal of Information Systems and Management Research10.59940/jismar.14044525:2(10-20)Online publication date: 30-Dec-2023
    • (2023)FEDEXProceedings of the VLDB Endowment10.14778/3565838.356584115:13(3854-3868)Online publication date: 20-Jan-2023
    • (2023)Lodestar: Supporting rapid prototyping of data science workflows through data-driven analysis recommendationsInformation Visualization10.1177/1473871623119042923:1(21-39)Online publication date: 14-Aug-2023
    • (2023)HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data PreparationProceedings of the ACM on Management of Data10.1145/35889451:1(1-26)Online publication date: 30-May-2023
    • (2023)Slide4N: Creating Presentation Slides from Computational Notebooks with Human-AI CollaborationProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580753(1-18)Online publication date: 19-Apr-2023
    • (2022)EDA4SUMProceedings of the VLDB Endowment10.14778/3554821.355485115:12(3590-3593)Online publication date: Aug-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media