Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Enabling data science for the majority

Published: 01 August 2019 Publication History

Abstract

Despite great strides in the generation, collection, and processing of data at scale, data science is still extremely inconvenient for the vast majority of the population. The driving goal of our research, over the past half decade, has been to make it easy for individuals and teams---regardless of programming or analysis expertise---manage, analyze, make sense of, and draw insights from large datasets. In this article, we reflect on a comprehensive suite of tools that we've been building to empower everyone to perform data science more efficiently and effortlessly, including DataSpread, a scalable spreadsheet tool that combines the benefits of spreadsheets and databases, and ZenVisage, a visual exploration tool that accelerates the discovery of trends or patterns. Our tools have been developed in collaboration with experts in various disciplines, including neuroscience, battery science, genomics, astrophysics, and ad analytics. We will discuss some of the key technical challenges underlying the development of these tools, and how we addressed them, drawing from ideas in multiple disciplines. in the process, we will outline a research agenda for tool development to empower everyone to tap into the hidden potential in their datasets at scale.

References

[1]
Airline On-Time Performance Data, Bureau of Transportation Statistics, 2019. https://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&DB_URL=.
[2]
Counted B-Trees, Simon Tatham, 2019. https://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html.
[3]
Digital Insights Are The New Currency Of Business, Forrester Report, 2018. https://www.forrester.com/report/Digital+Insights+Are+The+New+Currency+Of+Business/-/E-RES119109.
[4]
India to overtake US on number of developers by 2017, Computer World, 2013. https://www.computerworld.com/article/2483690/india-to-overtake-u-s-on-number-of-developers-by-2017.html.
[5]
Maslow's hierarchy of needs, 2019. https://en.wikipedia.org/wiki/Maslow%27s_hierarchy_of_needs.
[6]
Piet Mondrian Wikipedia Page, 2019. https://en.wikipedia.org/wiki/Piet_Mondrian.
[7]
Examples of commonly used formulas. support.office.com/en-us/article/examples-of-commonly-used-formulas-b45a3946-819e-455e-ac20-770ea6aa05da, 2017.
[8]
A. Abouzied et al. Dataplay: interactive tweaking and example-driven correction of graphical database queries. In UIST, 2012.
[9]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In ACM Sigmod Record, volume 28, pages 574--576. ACM, 1999.
[10]
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: Queries with bounded errors and bounded response times on very large data. In EuroSys, pages 29--42, New York, NY, USA, 2013. ACM.
[11]
E. Bakke and D. R. Karger. Expressive query construction through direct manipulation of nested relational results. In SIGMOD. ACM, 2016.
[12]
M. Bendre, B. Sun, D. Zhang, X. Zhou, K. C.-C. Chang, and A. Parameswaran. Dataspread: Unifying databases and spreadsheets. PVLDB, 8(12):2000--2003, 2015.
[13]
M. Bendre, V. Venkataraman, X. Zhou, K. C. Chang, and A. G. Parameswaran. Towards a holistic integration of spreadsheets with databases: A scalable storage engine for presentational data management. In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16--19, 2018, pages 113--124, 2018.
[14]
M. Bendre, T. Wattanawaroon, K. Mack, K. Chang, and A. Parameswaran. Anti-freeze for large and complex spreadsheets: Asynchronous formula computation. In Proceedings of the 2019 International Conference on Management of Data, pages 1277--1294. ACM, 2019.
[15]
M. Bendre, T. Wattanawaroon, S. Rahman, K. Mack, Y. Liu, S. Zhu, Y. Lu, P. Yang, X. Zhou, K. C. Chang, K. Karahalios, and A. G. Parameswaran. Faster, higher, stronger: Redesigning spreadsheets for scale. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8--11, 2019, pages 1972--1975, 2019.
[16]
C. Binnig, L. De Stefani, T. Kraska, E. Upfal, E. Zgraggen, and Z. Zhao. Toward sustainable insights, or why polygamy is bad for you. In CIDR, 2017.
[17]
M. Budiu, P. Gopalan, L. Suresh, U. Wieder, H. Kruiger, and M. K. Aguilera. Hillview: A trillion-cell spreadsheet for big data. arXiv preprint arXiv:1907.04827, 2019.
[18]
Z. Chen and M. Cafarella. Automatic web spreadsheet data extraction. In Proceedings of the 3rd International Workshop on Semantic Search Over the Web, pages 1:1--1:8. ACM, 2013.
[19]
F. Chirigati, H. Doraiswamy, T. Damoulas, and J. Freire. Data polygamy: the many-many relationships among urban spatio-temporal data sets. In Proceedings of the 2016 International Conference on Management of Data, pages 1011--1025. ACM, 2016.
[20]
A. Crotty, A. Galakatos, E. Zgraggen, C. Binnig, and T. Kraska. Vizdom: interactive analytics through pen and touch. PVLDB, 8(12):2024--2027, 2015.
[21]
Dark Energy Survey Collaboration: Fermilab, University of Illinois at Urbana-Champaign, University of Chicago, Lawrence Berkeley National Laboratory, Cerro-Tololo Inter-American Observatory and Flaugher, Brenna. The dark energy survey. International Journal of Modern Physics A, 20(14):3121--3123, 2005.
[22]
T. Gao, M. Dontcheva, E. Adar, Z. Liu, and K. G. Karahalios. Datatone: Managing ambiguity in natural language interfaces for data visualization. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology, UIST '15, pages 489--500, New York, NY, USA, 2015. ACM.
[23]
Y. Gao, S. Huang, and A. Parameswaran. Navigating the data lake with datamaran: Automatically extracting structure from log datasets. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD '18, pages 943--958, New York, NY, USA, 2018. ACM.
[24]
M. N. Garofalakis and P. B. Gibbon. Approximate query processing: Taming the terabytes. In VLDB, pages 725--, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[25]
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. ACM SIGMOD Record, 26(2):171--182, jun 1997.
[26]
N. Henke, J. Bughin, M. Chui, J. Manyika, T. Saleh, B. Wiseman, and G. Sethupathy. The age of analytics: Competing in a data-driven world. McKinsey Global Institute, 4, 2016.
[27]
H. Hochheiser and B. Shneiderman. Interactive exploration of time series data. In The Craft of Information Visualization, pages 313--315. Elsevier, 2003.
[28]
S. Huang, L. Xu, J. Liu, A. J. Elmore, and A. Parameswaran. Orpheus db: bolt-on versioning for relational databases. PVLDB, 10(10):1130--1141, 2017.
[29]
S. Idreos et al. dbtouch: Analytics at your fingertips. In CIDR, 2013.
[30]
H. V. Jagadish et al. Making database systems usable. In SIGMOD, pages 13--24. ACM, 2007.
[31]
U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl. M4: a visualization-oriented time series data aggregation. PVLDB, 7(10):797--808, 2014.
[32]
S. Kandel, R. Parikh, A. Paepcke, J. Hellerstein, and J. Heer. Profiler: Integrated statistical analysis and visualization for data quality assessment. In Advanced Visual Interfaces, 2012.
[33]
A. Kim, E. Blais, A. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. PVLDB, 8(5):521--532, 2015.
[34]
A. Kim, L. Xu, T. Siddiqui, S. Huang, S. Madden, and A. Parameswaran. Optimally leveraging density and locality for exploratory browsing and sampling. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, page 7. ACM, 2018.
[35]
D. J. L. Lee, H. Dev, H. Hu, H. Elmeleegy, and A. G. Parameswaran. Avoiding drill-down fallacies with vispilot: assisted exploration of data subsets. In IUI, pages 186--196, 2019.
[36]
D. J. L. Lee, J. Kim, R. Wang, and A. G. Parameswaran. SCATTERSEARCH: visual querying of scatterplot visualizations. CoRR, abs/1907.11743, 2019.
[37]
D. J. L. Lee, J. Lee, T. Siddiqui, J. Kim, K. Karahalios, and A. G. Parameswaran. You can't always sketch what you want: Understanding sensemaking in visual query systems. VAST at VIS, 2019.
[38]
D. J. L. Lee and A. G. Parameswaran. The case for a visual discovery assistant: A holistic solution for accelerating visual data exploration. IEEE Data Eng. Bull., 41(3):3--14, 2018.
[39]
X. Li, J. Han, and H. Gonzalez. High-dimensional olap: a minimal cubing approach. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 528--539. VLDB Endowment, 2004.
[40]
L. D. Lins, J. T. Klosowski, and C. E. Scheidegger. Nanocubes for real-time exploration of spatiotemporal datasets. IEEE TVCG, 19(12):2456--2465, 2013.
[41]
B. Liu and H. Jagadish. A spreadsheet algebra for a direct data manipulation query interface. In ICDE, pages 417--428. IEEE, 2009.
[42]
Z. Liu, B. Jiang, and J. Heer. immens: Real-time visual querying of big data. In CGF, volume 32, pages 421--430. Wiley Online Library, 2013.
[43]
K. Mack, J. Lee, K. C. Chang, K. Karahalios, and A. G. Parameswaran. Characterizing scalability issues in spreadsheet software using online forums. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems, CHI 2018, Montreal, QC, Canada, April 21--26, 2018, 2018.
[44]
S. Macke, Y. Zhang, S. Huang, and A. Parameswaran. Fastmatch: Adaptive algorithms for rapid discovery of relevant histogram visualizations. PVLDB, 2017.
[45]
M. Mannino and A. Abouzied. Expressive time series querying with hand-drawn scale-free sketches. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI '18, pages 388:1--388:13, New York, NY, USA, 2018. ACM.
[46]
Microsoft UK Enterprise Team. How finance leaders can drive performance. https://enterprise.microsoft.com/en-gb/articles/roles/finance-leader/how-finance-leaders-can-drive-performance/, 2015.
[47]
D. Moritz, D. Fisher, B. Ding, and C. Wang. Trust, but verify: Optimistic visualizations of approximate queries for exploring big data. In CHI, pages 2904--2915. ACM, 2017.
[48]
B. A. Myers, A. J. Ko, T. D. LaToza, and Y. Yoon. Programmers are users too: Human-centered methods for improving programming tools. Computer, 49(7):44--52, 2016.
[49]
A. Nandi, L. Jiang, and M. Mandel. Gestural Query Specification. VLDB Endowment, 7(4), 2013.
[50]
B. A. Nardi and J. R. Miller. An ethnographic study of distributed problem solving in spreadsheet development. In Proceedings of the 1990 ACM conference on Computer-supported cooperative work, pages 197--208. ACM, 1990.
[51]
B. A. Nardi and J. R. Miller. The spreadsheet interface: A basis for end user programming. Hewlett-Packard Laboratories, 1990.
[52]
R. R. Panko. What we know about spreadsheet errors. Journal of Organizational and End User Computing (JOEUC), 10(2):15--21, 1998.
[53]
P. Pirolli and S. Card. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis, volume 5, pages 2--4, 2005.
[54]
S. Rahman, M. Aliakbarpour, H. K. Kong, E. Blais, K. Karahalios, A. Parameswaran, and R. Rubinfield. I've seen "enough": Incrementally improving visualizations to support rapid decision making. In PVLDB, 2017.
[55]
S. Rahman, M. Bendre, P. Yang, S. Z. Yuyang Liu, Z. Su, K. Chang, K. Karahalios, and A. Parameswaran. Extending Spreadsheets to Support Seamless Navigation at Scale. http://dataspread.github.io/papers/noah.pdf, Technical Report, 2019.
[56]
V. Raman et al. Online dynamic reordering for interactive data processing. In VLDB, volume 99, pages 709--720, 1999.
[57]
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, volume 1, pages 381--390, 2001.
[58]
V. Raman, B. Raman, and J. M. Hellerstein. Online dynamic reordering. The VLDB Journal, 9(3):247--260, Dec. 2000.
[59]
V. Setlur, S. E. Battersby, M. Tory, R. Gossweiler, and A. X. Chang. Eviza: A Natural Language Interface for Visual Analysis. Proceedings of the 29th Annual Symposium on User Interface Software and Technology - UIST '16, pages 365--377, 2016.
[60]
B. Shneiderman. Direct Manipulation: A Step Beyond Programming Languages. IEEE Computer, 16(8):57--69, 1983.
[61]
T. Siddiqui, A. Kim, J. Lee, K. Karahalios, and A. Parameswaran. Effortless data exploration with zenvisage: an expressive and interactive visual analytics system. PVLDB, 10(4):457--468, 2016.
[62]
T. Siddiqui, J. Lee, A. Kim, E. Xue, X. Yu, S. Zou, L. Guo, C. Liu, C. Wang, K. Karahalios, et al. Fast-forwarding to desired visualizations with zenvisage. In CIDR, 2017.
[63]
T. Siddiqui, P. Luh, Z. Wang, K. Karahalios, and A. Parameswaran. Shapesearch: flexible pattern-based querying of trend line visualizations. PVLDB, 11(12):1962--1965, 2018.
[64]
S. Sinha, J. Song, R. Weinshilboum, V. Jongeneel, and J. Han. Knoweng: a knowledge engine for genomics. Journal of the American Medical Informatics Association, 22(6):1115--1119, 2015.
[65]
C. Stolte, D. Tang, and P. Hanrahan. Polaris: A system for query, analysis, and visualization of multidimensional relational databases. IEEE TVCG, 8(1):52--65, 2002.
[66]
E. R. Tufte. The visual display of quantitative information, volume 2. Graphics press Cheshire, CT, 2001.
[67]
J. Tyszkiewicz. Spreadsheet as a relational database engine. In SIGMOD, pages 195--206. ACM, 2010.
[68]
H. Varian. Artificial intelligence, economics, and industrial organization. Technical report, National Bureau of Economic Research, 2018.
[69]
M. Vartak, S. Huang, T. Siddiqui, S. Madden, and A. G. Parameswaran. Towards visualization recommendation systems. SIGMOD Record, 45(4):34--39, 2016.
[70]
M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. Seedb: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13):2182--2193, 2015.
[71]
V. Viswanathan and B. M. Knapp. Potential for electric aircraft. Nature Sustainability, 2(2):88--89, 2019.
[72]
L. Wilkinson. The grammar of graphics. In Handbook of Computational Statistics, pages 375--414. Springer, 2012.
[73]
J. O. Wobbrock and J. A. Kientz. Research contributions in human-computer interaction. interactions, 23(3):38--44, 2016.
[74]
K. Wongsuphasawat, Z. Qu, D. Moritz, R. Chang, F. Ouk, A. Anand, J. Mackinlay, B. Howe, and J. Heer. Voyager 2 : Augmenting Visual Analysis with Partial View Specifications. 2017.
[75]
E. Wu and A. Nandi. Towards perception-aware interactive data visualization systems. In DSIA Workshop, IEEE VIS, 2015.
[76]
Y. Wu, B. Harb, J. Yang, and C. Yu. Efficient evaluation of object-centric exploration queries for visualization. PVLDB, 8(12):1752--1763, 2015.
[77]
D. Xin, S. Macke, L. Ma, J. Liu, S. Song, and A. Parameswaran. Helix: Holistic optimization for accelerating iterative machine learning. PVLDB, 12(4):446--460, 2018.
[78]
E. Zgraggen, A. Galakatos, A. Crotty, J.-D. Fekete, and T. Kraska. How progressive visualizations affect exploratory analysis. IEEE transactions on visualization and computer graphics, 23(8):1977--1987, 2016.
[79]
M. M. Zloof. Query-by-example: A data base language. IBM systems Journal, 16(4):324--343, 1977.

Cited By

View all
  • (2022)Synthesizing analytical SQL queries from computation demonstrationProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523712(168-182)Online publication date: 9-Jun-2022
  • (2022)Semantics and Anomaly Preserving Sampling Strategy for Large-Scale Time Series DataACM/IMS Transactions on Data Science10.1145/35119182:4(1-25)Online publication date: 30-Mar-2022
  • (2021)Designing Interactive Transfer Learning Tools for ML Non-ExpertsProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445096(1-15)Online publication date: 6-May-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 12, Issue 12
August 2019
547 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2019
Published in PVLDB Volume 12, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)2
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Synthesizing analytical SQL queries from computation demonstrationProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523712(168-182)Online publication date: 9-Jun-2022
  • (2022)Semantics and Anomaly Preserving Sampling Strategy for Large-Scale Time Series DataACM/IMS Transactions on Data Science10.1145/35119182:4(1-25)Online publication date: 30-Mar-2022
  • (2021)Designing Interactive Transfer Learning Tools for ML Non-ExpertsProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445096(1-15)Online publication date: 6-May-2021

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media