short-paper

Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning

Authors:

Amit SomechAuthors Info & Claims

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 1527 - 1537

https://doi.org/10.1145/3318464.3389779

Published: 31 May 2020 Publication History

Abstract

Exploratory Data Analysis (EDA) is an essential yet highly demanding task. To get a head start before exploring a new dataset, data scientists often prefer to view existing EDA notebooks -- illustrative, curated exploratory sessions, on the same dataset, that were created by fellow data scientists who shared them online. Unfortunately, such notebooks are not always available (e.g., if the dataset is new or confidential). To address this, we present ATENA, a system that takes an input dataset and auto-generates a compelling exploratory session, presented in an EDA notebook. We shape EDA into a control problem, and devise a novel Deep Reinforcement Learning (DRL) architecture to effectively optimize the notebook generation. Though ATENA uses a limited set of EDA operations, our experiments show that it generates useful EDA notebooks, allowing users to gain actual insights.

Supplementary Material

MP4 File (3318464.3389779.mp4)

Presentation Video

Download
90.15 MB

References

[1]

Kaggle community. https://www.kaggle.com.

[2]

Tableau software. https://tableau.com.

[3]

D. Asimov. The grand tour: a tool for viewing multidimensional data. SIAM journal on scientific and statistical computing, 6(1):128--143, 1985.

[4]

O. Bar El, T. Milo, and A. Somech. Atena: An autonomous system for data exploration based on deep reinforcement learning. In CIKM, 2019.

[5]

O. Bar El, T. Milo, and A. Somech. A-eda: Automatic benchmark for auto-generated eda. https://github.com/TAU-DB/ATENA-A-EDA, 2020.

[6]

O. Bar El, T. Milo, and A. Somech. Towards autonomous, hands-free data exploration. In CIDR, 2020.

[7]

R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Muscat, and B. Plank. Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Int. Res., 55(1):409--442, Jan. 2016.

Digital Library

[8]

M. Brachmann, C. Bautista, S. Castelo, S. Feng, J. Freire, B. Glavic, O. Kennedy, H. Müeller, R. Rampin, W. Spoth, et al. Data debugging and exploration with vizier. In SIGMOD, 2019.

Digital Library

[9]

V. Chandola and V. Kumar. Summarization - compressing data into an informative representation. KAIS, 12(3), 2007.

[10]

F. Chirigati, H. Doraiswamy, T. Damoulas, and J. Freire. Data polygamy: the many-many relationships among urban spatio-temporal data sets. In SIGMOD, 2016.

Digital Library

[11]

K. Dimitriadou, O. Papaemmanouil, and Y. Diao. Aide: An active learning-based approach for interactive data exploration. TKDE, 2016.

Digital Library

[12]

M. Drosou and E. Pitoura. Ymaldb: exploring relational databases via result-driven recommendations. VLDBJ, 22(6), 2013.

Digital Library

[13]

G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap, J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.

[14]

M. Eirinaki, S. Abraham, N. Polyzotis, and N. Shaikh. Querie: Collaborative database exploration. TKDE, 2014.

[15]

EuroMatrix. Survey of machine translation evaluation, 2017. data retrieved from World Development Indicators, https://www.euromatrix.net/deliverables/Euromatrix_D1.3_Revised.pdf.

[16]

J. H. Friedman and J. W. Tukey. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on computers, 100(9):881-- 890, 1974.

Digital Library

[17]

L. Geng and H. J. Hamilton. Interestingness measures for data mining: A survey. CSUR, 2006.

Digital Library

[18]

A. Gosavi. Reinforcement learning: A tutorial survey and recent advances. INFORMS Journal on Computing, 21(2):178--192, 2009.

Digital Library

[19]

M. Hausknecht and P. Stone. Deep reinforcement learning in parameterized action space. arXiv preprint arXiv:1511.04143, 2015.

[20]

E. Huang, L. Peng, L. D. Palma, A. Abdelkafi, A. Liu, and Y. Diao. Optimization for active learning-based interactive database exploration. Proceedings of the VLDB Endowment, 12(1):71--84, 2018.

Digital Library

[21]

M. Joglekar, H. Garcia-Molina, and A. G. Parameswaran. Interactive data exploration with smart drill-down. In ICDE, 2016.

[22]

L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237--285, 1996.

Digital Library

[23]

M. B. Kery, M. Radensky, M. Arya, B. E. John, and B. A. Myers. The story in the notebook: Exploratory data science using a literate programming tool. In CHI, 2018.

Digital Library

[24]

T. Kraska. Northstar: An interactive data science system. PVLDB, 11(12), 2018.

Digital Library

[25]

Y. Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.

[26]

L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algorithms as gradient descent. In NeurIPS, 2000.

[27]

W. McKinney. Data structures for statistical computing in python. In S. van der Walt and J. Millman, editors, Proceedings of the 9th Python in Science Conference, pages 51 -- 56, 2010.

[28]

T. Milo and A. Somech. Deep reinforcement-learning framework for exploratory data analysis. In aiDM, 2018.

[29]

T. Milo and A. Somech. Next-step suggestions for modern interactive data analysis platforms. In KDD, 2018.

Digital Library

[30]

V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016.

Digital Library

[31]

M. Newman. Power laws, pareto distributions and zipf's law. Contemporary Physics, 46(5):323--351, 2005.

[32]

U. S. D. of Transportation. 2015 flight delays and cancellations. https://www.kaggle.com/usdot/flight-delays, 2015.

[33]

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL. Association for Computational Linguistics, 2002.

[34]

I. Preferred Networks. The deep reinforcement learning library chainerrl. https://github.com/chainer/chainerrl, 2017.

[35]

A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3), 2017.

Digital Library

[36]

A. Rule, A. Tabard, and J. D. Hollan. Exploration and explanation in computational notebooks. In CHI, 2018.

Digital Library

[37]

S. Sarawagi, R.Agrawal, and N. Megiddo. Discovery-driven exploration of olap data cubes. In EDBT, 1998.

Digital Library

[38]

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.

[39]

T. Siddiqui, A. Kim, J. Lee, K. Karahalios, and A. Parameswaran. Effortless data exploration with zenvisage: an expressive and interactive visual analytics system. PVLDB, 10(4), 2016.

Digital Library

[40]

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484--489, 2016.

[41]

M. Singh, M. J. Cafarella, and H. Jagadish. Dbexplorer: Exploratory search in databases. EDBT, 2016.

[42]

A. Somech and T. Milo. React: Interactive data analysis recommender system. https://github.com/TAU-DB/REACT-IDA-Recommendation-benchmark, 2018.

[43]

L. Spitzner. The honeynet project: Trapping the hackers. IEEE Security & Privacy, 99(2), 2003.

[44]

M. van Leeuwen. Maximal exceptions with minimal descriptions. Data Mining and Knowledge Discovery, 21(2):259--276, 2010.

Digital Library

[45]

M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. Seedb: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13), 2015.

[46]

R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensusbased image description evaluation. In CVPR, 2015.

[47]

Z. Zhao, L. De Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. Controlling false discoveries during interactive data exploration. In SIGMOD. ACM, 2017.

Digital Library

Cited By

Langer TMeyes RMeisen T(2024)Guided Exploration of Industrial Sensor DataComputer Graphics Forum10.1111/cgf.1500343:1Online publication date: 29-Jan-2024
https://doi.org/10.1111/cgf.15003
Shi YChen BChen YJin ZXu KJiao XGao TCao N(2024)Supporting Guided Exploratory Visual Analysis on Time Series Data with Reinforcement LearningIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332720030:1(1172-1182)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TVCG.2023.3327200
Mandamadiotis AKoutrika GAmer-Yahia S(2024)Guided SQL-Based Data Exploration with User Feedback2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00372(4884-4896)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00372
Show More Cited By

Index Terms

Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning
1. Mathematics of computing
  1. Probability and statistics
    1. Statistical paradigms
      1. Exploratory data analysis
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Reinforcement learning

Recommendations

Automating Exploratory Data Analysis via Machine Learning: An Overview
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Exploratory Data Analysis (EDA) is an important initial step for any knowledge discovery process, in which data scientists interactively explore unfamiliar datasets by issuing a sequence of analysis operations (e.g. filter, aggregation, and ...
Visual exploration of machine learning results using data cube analysis
HILDA '16: Proceedings of the Workshop on Human-In-the-Loop Data Analytics

As complex machine learning systems become more widely adopted, it becomes increasingly challenging for users to understand models or interpret the results generated from the models. We present our ongoing work on developing interactive and visual ...
Interactive exploration of population scale pharmacoepidemiology datasets
BCB '20: Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Population-scale drug prescription data linked with adverse drug reaction (ADR) data supports the fitting of models large enough to detect drug use and ADR patterns that are not detectable using traditional methods on smaller datasets. However, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

June 2020

2925 pages

ISBN:9781450367356

DOI:10.1145/3318464

General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Israel Science Foundation
Israel Innovation Authority
Binational US-Israel Science foundation

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 19, 2020

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
1,028
Total Downloads

Downloads (Last 12 months)150
Downloads (Last 6 weeks)15

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Langer TMeyes RMeisen T(2024)Guided Exploration of Industrial Sensor DataComputer Graphics Forum10.1111/cgf.1500343:1Online publication date: 29-Jan-2024
https://doi.org/10.1111/cgf.15003
Shi YChen BChen YJin ZXu KJiao XGao TCao N(2024)Supporting Guided Exploratory Visual Analysis on Time Series Data with Reinforcement LearningIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332720030:1(1172-1182)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TVCG.2023.3327200
Mandamadiotis AKoutrika GAmer-Yahia S(2024)Guided SQL-Based Data Exploration with User Feedback2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00372(4884-4896)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00372
Yoon J(2024)Exploiting Metadata for Intelligent and Secure JSON REST API ServicesProceedings of World Conference on Information Systems for Business Management10.1007/978-981-99-8346-9_12(135-149)Online publication date: 1-Mar-2024
https://doi.org/10.1007/978-981-99-8346-9_12
YÜKSEL H(2023)Yapay Zeka Algoritmaları Kullanılarak Öğrencilerin Akademik Başarısı ile Stres İlişkisinin Keşifsel Bir AnaliziJournal of Information Systems and Management Research10.59940/jismar.14044525:2(10-20)Online publication date: 30-Dec-2023
https://doi.org/10.59940/jismar.1404452
Deutch DGilad AMilo TMualem ASomech A(2023)FEDEXProceedings of the VLDB Endowment10.14778/3565838.356584115:13(3854-3868)Online publication date: 20-Jan-2023
https://doi.org/10.14778/3565838.3565841
Raghunandan DCui ZKrishnan KTirfe SShi SShrestha TBattle LElmqvist N(2023)Lodestar: Supporting rapid prototyping of data science workflows through data-driven analysis recommendationsInformation Visualization10.1177/1473871623119042923:1(21-39)Online publication date: 14-Aug-2023
https://doi.org/10.1177/14738716231190429
Chen STang NFan JYan XChai CLi GDu X(2023)HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data PreparationProceedings of the ACM on Management of Data10.1145/35889451:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588945
Wang FLiu XLiu ONeshati AMa TZhu MZhao J(2023)Slide4N: Creating Presentation Slides from Computational Notebooks with Human-AI CollaborationProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580753(1-18)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3580753
Personnaz AYoungmann BAmer-Yahia S(2022)EDA4SUMProceedings of the VLDB Endowment10.14778/3554821.355485115:12(3590-3593)Online publication date: Aug-2022
https://doi.org/10.14778/3554821.3554851
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents