Selection Bias Tracking and Detailed Subset Comparison for High-Dimensional Data

Borland, David; Wang, Wenyuan; Zhang, Jonathan; Shrestha, Joshua; Gotz, David

doi:10.1109/TVCG.2019.2934209

Computer Science > Human-Computer Interaction

arXiv:1906.07625 (cs)

[Submitted on 18 Jun 2019 (v1), last revised 17 Jun 2020 (this version, v3)]

Title:Selection Bias Tracking and Detailed Subset Comparison for High-Dimensional Data

Authors:David Borland, Wenyuan Wang, Jonathan Zhang, Joshua Shrestha, David Gotz

View PDF

Abstract:The collection of large, complex datasets has become common across a wide variety of domains. Visual analytics tools increasingly play a key role in exploring and answering complex questions about these large datasets. However, many visualizations are not designed to concurrently visualize the large number of dimensions present in complex datasets (e.g. tens of thousands of distinct codes in an electronic health record system). This fact, combined with the ability of many visual analytics systems to enable rapid, ad-hoc specification of groups, or cohorts, of individuals based on a small subset of visualized dimensions, leads to the possibility of introducing selection bias--when the user creates a cohort based on a specified set of dimensions, differences across many other unseen dimensions may also be introduced. These unintended side effects may result in the cohort no longer being representative of the larger population intended to be studied, which can negatively affect the validity of subsequent analyses. We present techniques for selection bias tracking and visualization that can be incorporated into high-dimensional exploratory visual analytics systems, with a focus on medical data with existing data hierarchies. These techniques include: (1) tree-based cohort provenance and visualization, with a user-specified baseline cohort that all other cohorts are compared against, and visual encoding of the drift for each cohort, which indicates where selection bias may have occurred, and (2) a set of visualizations, including a novel icicle-plot based visualization, to compare in detail the per-dimension differences between the baseline and a user-specified focus cohort. These techniques are integrated into a medical temporal event sequence visual analytics tool. We present example use cases and report findings from domain expert user interviews.

Comments:	IEEE Transactions on Visualization and Computer Graphics (TVCG), Volume 26 Issue 1, 2020. Also part of proceedings for IEEE VAST 2019
Subjects:	Human-Computer Interaction (cs.HC)
Cite as:	arXiv:1906.07625 [cs.HC]
	(or arXiv:1906.07625v3 [cs.HC] for this version)
	https://doi.org/10.48550/arXiv.1906.07625
Related DOI:	https://doi.org/10.1109/TVCG.2019.2934209

Submission history

From: David Gotz [view email]
[v1] Tue, 18 Jun 2019 14:58:52 UTC (1,574 KB)
[v2] Thu, 11 Jul 2019 18:15:00 UTC (6,237 KB)
[v3] Wed, 17 Jun 2020 19:42:14 UTC (6,238 KB)

Computer Science > Human-Computer Interaction

Title:Selection Bias Tracking and Detailed Subset Comparison for High-Dimensional Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Human-Computer Interaction

Title:Selection Bias Tracking and Detailed Subset Comparison for High-Dimensional Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators