research-article

Public Access

Controlling False Discoveries During Interactive Data Exploration

Authors:

Lorenzo De Stefani,

Emanuel Zgraggen,

Carsten Binnig,

Tim KraskaAuthors Info & Claims

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Pages 527 - 540

https://doi.org/10.1145/3035918.3064019

Published: 09 May 2017 Publication History

Abstract

Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. They allow users to (visually) examine many hypotheses and make inference with simple interactions, and thus incur the issue commonly known in statistics as the "multiple hypothesis testing error." In this work, we propose a solution to integrate the control of multiple hypothesis testing into interactive data exploration systems. A key insight is that existing methods for controlling the false discovery rate (such as FDR) are not directly applicable to interactive data exploration. We therefore discuss a set of new control procedures that are better suited for this task and integrate them in our system, QUDE. Via extensive experiments on both real-world and synthetic data sets we demonstrate how QUDE can help experts and novice users alike to efficiently control false discoveries.

References

[1]

E. Aharoni and S. Rosset. Generalized-investing: definitions, optimality results and application to public databases. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):771--794, 2014.

[2]

Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), pages 289--300, 1995.

[3]

Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165--1188, 2001.

[4]

D. A. Berry et al. Bayesian perspectives on multiple comparisons. Journal of Statistical Planning and Inference, 82(1--2), 1999.

[5]

A. Blum and M. Hardt. The ladder: A reliable leaderboard for machine learning competitions. arXiv preprint arXiv:1502.04585, 2015.

[6]

C. E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilita. Libreria internazionale Seeber, 1936.

[7]

A. Burgess, R. Wagner, R. Jennings, and H. B. Barlow. Efficiency of human visual signal discrimination. Science, 214(4516):93--94, 1981.

[8]

A. Crotty, A. Galakatos, E. Zgraggen, C. Binnig, and T. Kraska. Vizdom: Interactive analytics through pen and touch. Proceedings of the VLDB Endowment, 8(12):2024--2027, 2015.

Digital Library

[9]

J. Demšar. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 7:1--30, Dec. 2006.

Digital Library

[10]

E. Dimara, A. Bezerianos, and P. Dragicevic. The attraction effect in information visualization. IEEE Trans. Vis. Comput. Graph., 23(1), 2016.

Digital Library

[11]

C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth. Preserving statistical validity in adaptive data analysis. In STOC, pages 117--126. ACM, 2015.

Digital Library

[12]

B. Efron and T. Hastie. Computer Age Statistical Inference, volume 5. Cambridge University Press, 2016.

[13]

R. Fisher. The design of experiments. Oliver and Boyd, Edinburgh, Scotland, 1935.

[14]

D. P. Foster and R. A. Stine. α-investing: a procedure for sequential control of expected false discoveries. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(2):429--444, 2008.

[15]

M. G. G'Sell et al. Sequential selection procedures and false discovery rate control. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(2), 2016.

[16]

H. Guo, S. Gomez, C. Ziemkiewicz, and D. Laidlaw. A case study using visualization interaction logs and insight. IEEE Trans. Vis. Comput. Graph., 2016.

Digital Library

[17]

P. Hanrahan. Analytic database technologies for a new kind of user: the data enthusiast. In SIGMOD, 2012.

Digital Library

[18]

Y. Hochberg. A sharper bonferroni procedure for multiple tests of significance. Biometrika, 75(4):800--802, 1988.

[19]

S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pages 65--70, 1979.

[20]

J. P. A. Ioannidis. Why most published research findings are false. Plos Med, 2(8), 2005.

[21]

H. Jeffreys. The theory of probability. OUP Oxford, 1998.

[22]

M. I. Jordan. The era of big data. ISBA Bulletin, 18(2), 2011.

[23]

N. Kamat, P. Jayachandran, K. Tunga, and A. Nandi. Distributed and interactive cube exploration. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pages 472--483. IEEE, 2014.

[24]

R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI'95, pages 1137--1143, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc.

Digital Library

[25]

M. Lichman. UCI machine learning repository, 2013.

[26]

Z. Liu, B. Jiang, and J. Heer. immens: Real-time visual querying of big data. In Computer Graphics Forum, volume 32, pages 421--430. Wiley Online Library, 2013.

Digital Library

[27]

J. H. McDonald. Handbook of Biological Statistics. Sparky House Publishing, Baltimore, Maryland, USA, second edition, 2009.

[28]

J. Neyman and E. L. Scott. Consistent estimates based on partially consistent observations. Econometrica: Journal of the Econometric Society, pages 1--32, 1948.

[29]

P. Pirolli and S. Card. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis, volume 5, pages 2--4, 2005.

[30]

P. Refaeilzadeh, L. Tang, H. Liu, and M. T. ÖZSU. Cross-Validation, pages 532--538. Springer US, Boston, MA, 2009.

[31]

M. Schemper. A survey of permutation tests for censored survival data. Communications in Statistics-Theory and Methods, 13(13):1655--1665, 1984.

[32]

J. P. Shaffer. Multiple hypothesis testing. Annual review of psychology, 46, 1995.

[33]

Y. B. Shrinivasan and J. J. van Wijk. Supporting the analytical reasoning process in information visualization. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 1237--1246. ACM, 2008.

Digital Library

[34]

Z. Šidák. Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62(318):626--633, 1967.

[35]

R. J. Simes. An improved bonferroni procedure for multiple tests of significance. Biometrika, 73(3):751--754, 1986.

[36]

E. Zgraggen, A. Galakatos, A. Crotty, J.-D. Fekete, and T. Kraska. How progressive visualizations affect exploratory analysis. IEEE Trans. Vis. Comput. Graph., 2016.

[37]

A. F. Zuur, E. N. Ieno, and C. S. Elphick. A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1):3--14, 2010.

Cited By

Sarma APu XCui YCorrell MBrown EKay M(2024)Odds and Insights: Decision Quality in Exploratory Data Analysis Under UncertaintyProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3641995(1-14)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3641995
Kale AGuo ZQiao XHeer JHullman J(2024)EVM: Incorporating Model Checking into Exploratory Visual AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332651630:1(208-218)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TVCG.2023.3326516
Kumar SDatta SSingh VSingh SSharma R(2024)Opportunities and Challenges in Data-Centric AIIEEE Access10.1109/ACCESS.2024.336941712(33173-33189)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3369417
Show More Cited By

Index Terms

Controlling False Discoveries During Interactive Data Exploration

Recommendations

Safe Visual Data Exploration
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Exploring data via visualization has become a popular way to understand complex data. Features or patterns in visualization can be perceived as relevant insights by users, even though they may actually arise from random noise. Moreover, interactive data ...
π-means: Granular Approach towards Interactive Data Exploration
Abstract
In this paper, we examine the possibility of employing the idea of progressive-inductive (π) aggregation in the k-means algorithm. We base our work on the interactive visualization framework called Skydive which is a tightly coupled system that ...
An interactive visualization environment for data exploration
KDD'97: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining

Exploratory data analysis is a process of sifting through data in search of interesting information or patterns. Analysts' current tools for exploring data include database management systems, statistical analysis packages, data mining tools, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

May 2017

1810 pages

ISBN:9781450341974

DOI:10.1145/3035918

General Chairs:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA
,
Program Chair:
Dan Suciu
University of Washington, USA

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SIGMOD/PODS'17

Sponsor:

SIGMOD

SIGMOD/PODS'17: International Conference on Management of Data

May 14 - 19, 2017

Illinois, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
1,887
Total Downloads

Downloads (Last 12 months)244
Downloads (Last 6 weeks)17

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sarma APu XCui YCorrell MBrown EKay M(2024)Odds and Insights: Decision Quality in Exploratory Data Analysis Under UncertaintyProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3641995(1-14)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3641995
Kale AGuo ZQiao XHeer JHullman J(2024)EVM: Incorporating Model Checking into Exploratory Visual AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332651630:1(208-218)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TVCG.2023.3326516
Kumar SDatta SSingh VSingh SSharma R(2024)Opportunities and Challenges in Data-Centric AIIEEE Access10.1109/ACCESS.2024.336941712(33173-33189)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3369417
Koonchanok RTawde GNarayanasamy GWalimbe SReda K(2023)Visual Belief Elicitation Reduces the Incidence of False DiscoveryProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580808(1-17)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3580808
Battle LOttley A(2023)What Exactly is an Insight? A Literature Review2023 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS54172.2023.00027(91-95)Online publication date: 21-Oct-2023
https://doi.org/10.1109/VIS54172.2023.00027
Savvides RHenelius AOikarinen EPuolamäki K(2023)Visual Data Exploration as a Statistical Testing Procedure: Within-View and Between-View Multiple ComparisonsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.317553229:9(3937-3948)Online publication date: 1-Sep-2023
https://doi.org/10.1109/TVCG.2022.3175532
Huang EDiao YLiu APeng LPalma L(2023)Efficient and robust active learning methods for interactive database explorationThe VLDB Journal10.1007/s00778-023-00816-x33:4(931-956)Online publication date: 16-Nov-2023
https://doi.org/10.1007/s00778-023-00816-x
Whang SRoh YSong HLee J(2023)Data collection and quality challenges in deep learning: a data-centric AI perspectiveThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00775-932:4(791-813)Online publication date: 3-Jan-2023
https://dl.acm.org/doi/10.1007/s00778-022-00775-9
Bouarour NBenouaret IAmer-Yahia S(2022)Significance and Coverage in Group Testing on the Social WebProceedings of the ACM Web Conference 202210.1145/3485447.3512025(3052-3060)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3512025
Bouarour NBenouaret IAmer-Yahia S(2022)Optimizing Data Coverage and Significance in Multiple Hypothesis Testing on User GroupsTransactions on Large-Scale Data- and Knowledge-Centered Systems LI10.1007/978-3-662-66111-6_3(64-96)Online publication date: 8-Oct-2022
https://doi.org/10.1007/978-3-662-66111-6_3
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents