research-article

Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

Authors:

Annika ToddAuthors Info & Claims

Journal of Data and Information Quality (JDIQ), Volume 11, Issue 2

Article No.: 7, Pages 1 - 22

https://doi.org/10.1145/3301294

Published: 06 March 2019 Publication History

Get Access

Abstract

The goal of this work is to investigate the impact of missing values in clustering joint categorical social sequences. Identifying patterns in sociodemographic longitudinal data is important in a number of social science settings. However, performing analytical operations, such as clustering on life course trajectories, is challenging due to the categorical and multidimensional nature of the data, their mixed data types, and corruption by missing and inconsistent values. Data quality issues were investigated previously on single variable sequences. To understand their effects on multivariate sequence analysis, we employ a dataset of mixed data types and missing values, a dissimilarity measure designed for joint categorical sequence data, together with dimensionality reduction methodologies in a systematic design of sequence clustering experiments. Given the categorical nature of our data, we employ an “edit” distance using optimal matching. Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single dissimilarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is more difficult in the nominal domain than in the binary domain. Additionally, alignment of leading missing values can result in systematic biases in dissimilarity matrices and subsequently introduce both artificial clusters and unrealistic interpretations of associated data domains. We demonstrate the usage of t-distributed stochastic neighborhood embedding to visually guide mitigation of such biases by tuning the missing value substitution cost parameter or determining an optimal sequence span.

Supplementary Material

PDF File (a7-lazar-suppl.pdf)

Supplemental movie and image files for, Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

Download
8.35 MB

ZIP File (a7-lazar.zip)

Download
2.04 MB

References

[1]

Andrew Abbott and John Forrest. 1986. Optimal matching methods for historical sequences. Journal of Interdisciplinary History 16, 3 (1986), 471--494. http://www.jstor.org/stable/204500.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Imputation Strategies for Clustering Mixed-Type Data with Missing Values

Clustering mixed numerical and categorical data with missing values

k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations