Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3035918.3064019acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Controlling False Discoveries During Interactive Data Exploration

Published: 09 May 2017 Publication History

Abstract

Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. They allow users to (visually) examine many hypotheses and make inference with simple interactions, and thus incur the issue commonly known in statistics as the "multiple hypothesis testing error." In this work, we propose a solution to integrate the control of multiple hypothesis testing into interactive data exploration systems. A key insight is that existing methods for controlling the false discovery rate (such as FDR) are not directly applicable to interactive data exploration. We therefore discuss a set of new control procedures that are better suited for this task and integrate them in our system, QUDE. Via extensive experiments on both real-world and synthetic data sets we demonstrate how QUDE can help experts and novice users alike to efficiently control false discoveries.

References

[1]
E. Aharoni and S. Rosset. Generalized-investing: definitions, optimality results and application to public databases. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):771--794, 2014.
[2]
Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), pages 289--300, 1995.
[3]
Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165--1188, 2001.
[4]
D. A. Berry et al. Bayesian perspectives on multiple comparisons. Journal of Statistical Planning and Inference, 82(1--2), 1999.
[5]
A. Blum and M. Hardt. The ladder: A reliable leaderboard for machine learning competitions. arXiv preprint arXiv:1502.04585, 2015.
[6]
C. E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilita. Libreria internazionale Seeber, 1936.
[7]
A. Burgess, R. Wagner, R. Jennings, and H. B. Barlow. Efficiency of human visual signal discrimination. Science, 214(4516):93--94, 1981.
[8]
A. Crotty, A. Galakatos, E. Zgraggen, C. Binnig, and T. Kraska. Vizdom: Interactive analytics through pen and touch. Proceedings of the VLDB Endowment, 8(12):2024--2027, 2015.
[9]
J. Demšar. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 7:1--30, Dec. 2006.
[10]
E. Dimara, A. Bezerianos, and P. Dragicevic. The attraction effect in information visualization. IEEE Trans. Vis. Comput. Graph., 23(1), 2016.
[11]
C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth. Preserving statistical validity in adaptive data analysis. In STOC, pages 117--126. ACM, 2015.
[12]
B. Efron and T. Hastie. Computer Age Statistical Inference, volume 5. Cambridge University Press, 2016.
[13]
R. Fisher. The design of experiments. Oliver and Boyd, Edinburgh, Scotland, 1935.
[14]
D. P. Foster and R. A. Stine. α-investing: a procedure for sequential control of expected false discoveries. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(2):429--444, 2008.
[15]
M. G. G'Sell et al. Sequential selection procedures and false discovery rate control. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(2), 2016.
[16]
H. Guo, S. Gomez, C. Ziemkiewicz, and D. Laidlaw. A case study using visualization interaction logs and insight. IEEE Trans. Vis. Comput. Graph., 2016.
[17]
P. Hanrahan. Analytic database technologies for a new kind of user: the data enthusiast. In SIGMOD, 2012.
[18]
Y. Hochberg. A sharper bonferroni procedure for multiple tests of significance. Biometrika, 75(4):800--802, 1988.
[19]
S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pages 65--70, 1979.
[20]
J. P. A. Ioannidis. Why most published research findings are false. Plos Med, 2(8), 2005.
[21]
H. Jeffreys. The theory of probability. OUP Oxford, 1998.
[22]
M. I. Jordan. The era of big data. ISBA Bulletin, 18(2), 2011.
[23]
N. Kamat, P. Jayachandran, K. Tunga, and A. Nandi. Distributed and interactive cube exploration. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pages 472--483. IEEE, 2014.
[24]
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI'95, pages 1137--1143, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc.
[25]
M. Lichman. UCI machine learning repository, 2013.
[26]
Z. Liu, B. Jiang, and J. Heer. immens: Real-time visual querying of big data. In Computer Graphics Forum, volume 32, pages 421--430. Wiley Online Library, 2013.
[27]
J. H. McDonald. Handbook of Biological Statistics. Sparky House Publishing, Baltimore, Maryland, USA, second edition, 2009.
[28]
J. Neyman and E. L. Scott. Consistent estimates based on partially consistent observations. Econometrica: Journal of the Econometric Society, pages 1--32, 1948.
[29]
P. Pirolli and S. Card. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis, volume 5, pages 2--4, 2005.
[30]
P. Refaeilzadeh, L. Tang, H. Liu, and M. T. ÖZSU. Cross-Validation, pages 532--538. Springer US, Boston, MA, 2009.
[31]
M. Schemper. A survey of permutation tests for censored survival data. Communications in Statistics-Theory and Methods, 13(13):1655--1665, 1984.
[32]
J. P. Shaffer. Multiple hypothesis testing. Annual review of psychology, 46, 1995.
[33]
Y. B. Shrinivasan and J. J. van Wijk. Supporting the analytical reasoning process in information visualization. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 1237--1246. ACM, 2008.
[34]
Z. Šidák. Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62(318):626--633, 1967.
[35]
R. J. Simes. An improved bonferroni procedure for multiple tests of significance. Biometrika, 73(3):751--754, 1986.
[36]
E. Zgraggen, A. Galakatos, A. Crotty, J.-D. Fekete, and T. Kraska. How progressive visualizations affect exploratory analysis. IEEE Trans. Vis. Comput. Graph., 2016.
[37]
A. F. Zuur, E. N. Ieno, and C. S. Elphick. A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1):3--14, 2010.

Cited By

View all
  • (2024)Odds and Insights: Decision Quality in Exploratory Data Analysis Under UncertaintyProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3641995(1-14)Online publication date: 11-May-2024
  • (2024)EVM: Incorporating Model Checking into Exploratory Visual AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332651630:1(208-218)Online publication date: 1-Jan-2024
  • (2024)Opportunities and Challenges in Data-Centric AIIEEE Access10.1109/ACCESS.2024.336941712(33173-33189)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
May 2017
1810 pages
ISBN:9781450341974
DOI:10.1145/3035918
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. alpha investing
  2. bonferroni
  3. data analytics
  4. false discovery control
  5. false discovery rate
  6. family-wise error rate
  7. hypothesis testing
  8. interactive data exploration
  9. multiple comparisons problem
  10. multiple hypothesis error
  11. visualization

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS'17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)244
  • Downloads (Last 6 weeks)17
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Odds and Insights: Decision Quality in Exploratory Data Analysis Under UncertaintyProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3641995(1-14)Online publication date: 11-May-2024
  • (2024)EVM: Incorporating Model Checking into Exploratory Visual AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332651630:1(208-218)Online publication date: 1-Jan-2024
  • (2024)Opportunities and Challenges in Data-Centric AIIEEE Access10.1109/ACCESS.2024.336941712(33173-33189)Online publication date: 2024
  • (2023)Visual Belief Elicitation Reduces the Incidence of False DiscoveryProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580808(1-17)Online publication date: 19-Apr-2023
  • (2023)What Exactly is an Insight? A Literature Review2023 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS54172.2023.00027(91-95)Online publication date: 21-Oct-2023
  • (2023)Visual Data Exploration as a Statistical Testing Procedure: Within-View and Between-View Multiple ComparisonsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.317553229:9(3937-3948)Online publication date: 1-Sep-2023
  • (2023)Efficient and robust active learning methods for interactive database explorationThe VLDB Journal10.1007/s00778-023-00816-x33:4(931-956)Online publication date: 16-Nov-2023
  • (2023)Data collection and quality challenges in deep learning: a data-centric AI perspectiveThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00775-932:4(791-813)Online publication date: 3-Jan-2023
  • (2022)Significance and Coverage in Group Testing on the Social WebProceedings of the ACM Web Conference 202210.1145/3485447.3512025(3052-3060)Online publication date: 25-Apr-2022
  • (2022)Optimizing Data Coverage and Significance in Multiple Hypothesis Testing on User GroupsTransactions on Large-Scale Data- and Knowledge-Centered Systems LI10.1007/978-3-662-66111-6_3(64-96)Online publication date: 8-Oct-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media