Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3389779acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper

Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning

Published: 31 May 2020 Publication History

Abstract

Exploratory Data Analysis (EDA) is an essential yet highly demanding task. To get a head start before exploring a new dataset, data scientists often prefer to view existing EDA notebooks -- illustrative, curated exploratory sessions, on the same dataset, that were created by fellow data scientists who shared them online. Unfortunately, such notebooks are not always available (e.g., if the dataset is new or confidential). To address this, we present ATENA, a system that takes an input dataset and auto-generates a compelling exploratory session, presented in an EDA notebook. We shape EDA into a control problem, and devise a novel Deep Reinforcement Learning (DRL) architecture to effectively optimize the notebook generation. Though ATENA uses a limited set of EDA operations, our experiments show that it generates useful EDA notebooks, allowing users to gain actual insights.

Supplementary Material

MP4 File (3318464.3389779.mp4)
Presentation Video

References

[1]
Kaggle community. https://www.kaggle.com.
[2]
Tableau software. https://tableau.com.
[3]
D. Asimov. The grand tour: a tool for viewing multidimensional data. SIAM journal on scientific and statistical computing, 6(1):128--143, 1985.
[4]
O. Bar El, T. Milo, and A. Somech. Atena: An autonomous system for data exploration based on deep reinforcement learning. In CIKM, 2019.
[5]
O. Bar El, T. Milo, and A. Somech. A-eda: Automatic benchmark for auto-generated eda. https://github.com/TAU-DB/ATENA-A-EDA, 2020.
[6]
O. Bar El, T. Milo, and A. Somech. Towards autonomous, hands-free data exploration. In CIDR, 2020.
[7]
R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Muscat, and B. Plank. Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Int. Res., 55(1):409--442, Jan. 2016.
[8]
M. Brachmann, C. Bautista, S. Castelo, S. Feng, J. Freire, B. Glavic, O. Kennedy, H. Müeller, R. Rampin, W. Spoth, et al. Data debugging and exploration with vizier. In SIGMOD, 2019.
[9]
V. Chandola and V. Kumar. Summarization - compressing data into an informative representation. KAIS, 12(3), 2007.
[10]
F. Chirigati, H. Doraiswamy, T. Damoulas, and J. Freire. Data polygamy: the many-many relationships among urban spatio-temporal data sets. In SIGMOD, 2016.
[11]
K. Dimitriadou, O. Papaemmanouil, and Y. Diao. Aide: An active learning-based approach for interactive data exploration. TKDE, 2016.
[12]
M. Drosou and E. Pitoura. Ymaldb: exploring relational databases via result-driven recommendations. VLDBJ, 22(6), 2013.
[13]
G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap, J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.
[14]
M. Eirinaki, S. Abraham, N. Polyzotis, and N. Shaikh. Querie: Collaborative database exploration. TKDE, 2014.
[15]
EuroMatrix. Survey of machine translation evaluation, 2017. data retrieved from World Development Indicators, https://www.euromatrix.net/deliverables/Euromatrix_D1.3_Revised.pdf.
[16]
J. H. Friedman and J. W. Tukey. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on computers, 100(9):881-- 890, 1974.
[17]
L. Geng and H. J. Hamilton. Interestingness measures for data mining: A survey. CSUR, 2006.
[18]
A. Gosavi. Reinforcement learning: A tutorial survey and recent advances. INFORMS Journal on Computing, 21(2):178--192, 2009.
[19]
M. Hausknecht and P. Stone. Deep reinforcement learning in parameterized action space. arXiv preprint arXiv:1511.04143, 2015.
[20]
E. Huang, L. Peng, L. D. Palma, A. Abdelkafi, A. Liu, and Y. Diao. Optimization for active learning-based interactive database exploration. Proceedings of the VLDB Endowment, 12(1):71--84, 2018.
[21]
M. Joglekar, H. Garcia-Molina, and A. G. Parameswaran. Interactive data exploration with smart drill-down. In ICDE, 2016.
[22]
L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237--285, 1996.
[23]
M. B. Kery, M. Radensky, M. Arya, B. E. John, and B. A. Myers. The story in the notebook: Exploratory data science using a literate programming tool. In CHI, 2018.
[24]
T. Kraska. Northstar: An interactive data science system. PVLDB, 11(12), 2018.
[25]
Y. Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
[26]
L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algorithms as gradient descent. In NeurIPS, 2000.
[27]
W. McKinney. Data structures for statistical computing in python. In S. van der Walt and J. Millman, editors, Proceedings of the 9th Python in Science Conference, pages 51 -- 56, 2010.
[28]
T. Milo and A. Somech. Deep reinforcement-learning framework for exploratory data analysis. In aiDM, 2018.
[29]
T. Milo and A. Somech. Next-step suggestions for modern interactive data analysis platforms. In KDD, 2018.
[30]
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
[31]
M. Newman. Power laws, pareto distributions and zipf's law. Contemporary Physics, 46(5):323--351, 2005.
[32]
U. S. D. of Transportation. 2015 flight delays and cancellations. https://www.kaggle.com/usdot/flight-delays, 2015.
[33]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL. Association for Computational Linguistics, 2002.
[34]
I. Preferred Networks. The deep reinforcement learning library chainerrl. https://github.com/chainer/chainerrl, 2017.
[35]
A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3), 2017.
[36]
A. Rule, A. Tabard, and J. D. Hollan. Exploration and explanation in computational notebooks. In CHI, 2018.
[37]
S. Sarawagi, R.Agrawal, and N. Megiddo. Discovery-driven exploration of olap data cubes. In EDBT, 1998.
[38]
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
[39]
T. Siddiqui, A. Kim, J. Lee, K. Karahalios, and A. Parameswaran. Effortless data exploration with zenvisage: an expressive and interactive visual analytics system. PVLDB, 10(4), 2016.
[40]
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484--489, 2016.
[41]
M. Singh, M. J. Cafarella, and H. Jagadish. Dbexplorer: Exploratory search in databases. EDBT, 2016.
[42]
A. Somech and T. Milo. React: Interactive data analysis recommender system. https://github.com/TAU-DB/REACT-IDA-Recommendation-benchmark, 2018.
[43]
L. Spitzner. The honeynet project: Trapping the hackers. IEEE Security & Privacy, 99(2), 2003.
[44]
M. van Leeuwen. Maximal exceptions with minimal descriptions. Data Mining and Knowledge Discovery, 21(2):259--276, 2010.
[45]
M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. Seedb: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13), 2015.
[46]
R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensusbased image description evaluation. In CVPR, 2015.
[47]
Z. Zhao, L. De Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. Controlling false discoveries during interactive data exploration. In SIGMOD. ACM, 2017.

Cited By

View all
  • (2024)Guided Exploration of Industrial Sensor DataComputer Graphics Forum10.1111/cgf.1500343:1Online publication date: 29-Jan-2024
  • (2024)Supporting Guided Exploratory Visual Analysis on Time Series Data with Reinforcement LearningIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332720030:1(1172-1182)Online publication date: 1-Jan-2024
  • (2024)Guided SQL-Based Data Exploration with User Feedback2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00372(4884-4896)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. EDA
  2. EDA notebooks
  3. auto EDA
  4. auto generated
  5. autogenerated
  6. data exploration
  7. interactive data analysis
  8. notebooks

Qualifiers

  • Short-paper

Funding Sources

Conference

SIGMOD/PODS '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)144
  • Downloads (Last 6 weeks)6
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Guided Exploration of Industrial Sensor DataComputer Graphics Forum10.1111/cgf.1500343:1Online publication date: 29-Jan-2024
  • (2024)Supporting Guided Exploratory Visual Analysis on Time Series Data with Reinforcement LearningIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332720030:1(1172-1182)Online publication date: 1-Jan-2024
  • (2024)Guided SQL-Based Data Exploration with User Feedback2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00372(4884-4896)Online publication date: 13-May-2024
  • (2024)Exploiting Metadata for Intelligent and Secure JSON REST API ServicesProceedings of World Conference on Information Systems for Business Management10.1007/978-981-99-8346-9_12(135-149)Online publication date: 1-Mar-2024
  • (2023)Yapay Zeka Algoritmaları Kullanılarak Öğrencilerin Akademik Başarısı ile Stres İlişkisinin Keşifsel Bir AnaliziJournal of Information Systems and Management Research10.59940/jismar.14044525:2(10-20)Online publication date: 30-Dec-2023
  • (2023)FEDEXProceedings of the VLDB Endowment10.14778/3565838.356584115:13(3854-3868)Online publication date: 20-Jan-2023
  • (2023)Lodestar: Supporting rapid prototyping of data science workflows through data-driven analysis recommendationsInformation Visualization10.1177/1473871623119042923:1(21-39)Online publication date: 14-Aug-2023
  • (2023)HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data PreparationProceedings of the ACM on Management of Data10.1145/35889451:1(1-26)Online publication date: 30-May-2023
  • (2023)Slide4N: Creating Presentation Slides from Computational Notebooks with Human-AI CollaborationProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580753(1-18)Online publication date: 19-Apr-2023
  • (2022)EDA4SUMProceedings of the VLDB Endowment10.14778/3554821.355485115:12(3590-3593)Online publication date: Aug-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media