Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

Published: 06 March 2019 Publication History

Abstract

The goal of this work is to investigate the impact of missing values in clustering joint categorical social sequences. Identifying patterns in sociodemographic longitudinal data is important in a number of social science settings. However, performing analytical operations, such as clustering on life course trajectories, is challenging due to the categorical and multidimensional nature of the data, their mixed data types, and corruption by missing and inconsistent values. Data quality issues were investigated previously on single variable sequences. To understand their effects on multivariate sequence analysis, we employ a dataset of mixed data types and missing values, a dissimilarity measure designed for joint categorical sequence data, together with dimensionality reduction methodologies in a systematic design of sequence clustering experiments. Given the categorical nature of our data, we employ an “edit” distance using optimal matching. Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single dissimilarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is more difficult in the nominal domain than in the binary domain. Additionally, alignment of leading missing values can result in systematic biases in dissimilarity matrices and subsequently introduce both artificial clusters and unrealistic interpretations of associated data domains. We demonstrate the usage of t-distributed stochastic neighborhood embedding to visually guide mitigation of such biases by tuning the missing value substitution cost parameter or determining an optimal sequence span.

Supplementary Material

PDF File (a7-lazar-suppl.pdf)
Supplemental movie and image files for, Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization
ZIP File (a7-lazar.zip)

References

[1]
Andrew Abbott and John Forrest. 1986. Optimal matching methods for historical sequences. Journal of Interdisciplinary History 16, 3 (1986), 471--494. http://www.jstor.org/stable/204500.
[2]
Andrew Abbott and Alexandra Hrycak. 1990. Measuring resemblance in sequence data: An optimal matching analysis of musicians’ careers. American Journal of Sociology 96, 1 (1990), 144--185.
[3]
Silke Aisenbrey and Anette E. Fasang. 2010. New life for old ideas: The “second wave” of sequence analysis bringing the “course” back into the life course. Sociological Methods and Research 38, 3 (2010), 420--462.
[4]
Mikhail Belkin and Partha Niyogi. 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems. 585--591.
[5]
Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux, Jean-François Paiement, Pascal Vincent, and Marie Ouimet. 2006. Spectral dimensionality reduction. In Feature Extraction. Springer, 519--550.
[6]
Hilde Bras, Aart C. Liefbroer, and Cees H. Elzinga. 2010. Standardization of pathways to adulthood? An analysis of Dutch cohorts born between 1850 and 1900. Demography 47, 4 (2010), 1013--1034. http://www.springerlink.com/index/3547442765W022X4.pdf.
[7]
Cees H. Elzinga. 2007. Sequence Analysis: Metric Representations of Categorical Time Series. Technical Report. Department of Social Science Research Methods, Vrije Universiteit, Amsterdam.
[8]
Cees H. Elzinga. 2014. Distance, similarity and sequence comparison. In Advances in Sequence Analysis: Theory, Method, Applications. Springer, 51--73.
[9]
Tak-Chung Fu. 2011. A review on time series data mining. Engineering Applications of Artificial Intelligence 24, 1 (2011), 164--181. http://www.sciencedirect.com/science/article/pii/S0952197610001727.
[10]
Alexis Gabadinho, Gilbert Ritschard, Nicolas Séverin Mueller, and Matthias Studer. 2011. Analyzing and visualizing state sequences in R with TraMineR. Journal of Statistical Software 40, 4 (2011), 1--37. https://archive-ouverte.unige.ch/unige:16809.
[11]
Jacques-Antoine Gauthier, Eric D. Widmer, Philipp Bucher, and Cédric Notredame. 2009. How much does it cost? Optimization of costs in sequence analysis of social science data. Sociological Methods and Research 38, 1 (2009), 197--231.
[12]
Jacques-Antoine Gauthier, Eric D. Widmer, Philipp Bucher, and Cédric Notredame. 2010. Multichannel sequence analysis applied to social science data. Sociological Methodology 40, 1 (2010), 1--38.
[13]
Brendan Halpin. 2012. Multiple Imputation for Life-Course Sequence Data. Technical Report WP2012-01. Department of Sociology, University of Limerick.
[14]
Brendan Halpin. 2014. Three narratives of sequence analysis. In Advances in Sequence Analysis: Theory, Method, Applications. Springer, Cham, Switzerland, 75--103.
[15]
Christian Hennig and Tim F. Liao. 2010. Comparing Latent Class and Dissimilarity Based Clustering for Mixed Type Variables with Application to Social Stratification. Technical Report 308. Department of Statistical Science, University College London.
[16]
Harold Hotelling. 1933. Analysis of a complex of statistical variables into principal components.Journal of Educational Psychology 24, 6 (1933), 417.
[17]
Ling Jin, Doris Lee, Alex Sim, Sam Borgeson, Kesheng Wu, C. Anna Spurlock, et al. 2017. Comparison of clustering techniques for residential energy behavior using smart meter data. In Proceedings of the AI for Smart Grids and Buildings Workshop.
[18]
R. Burke Johnson and Anthony J. Onwuegbuzie. 2004. Mixed methods research: A research paradigm whose time has come. Educational Researcher 33, 7 (2004), 14--26.
[19]
Leonard Kaufman and Peter J. Rousseeuw. 1990. Partitioning around medoids (program PAM). In Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley, 68--125.
[20]
Dimitrios Kotsakos, Goce Trajcevski, Dimitrios Gunopulos, and Charu C. Aggarwal. 2013. Time-Series Data Clustering. In Data Clustering. Chapman and Hall/CRC, 357–380.
[21]
Joseph B. Kruskal. 1964a. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1 (1964), 1--27.
[22]
Joseph B. Kruskal. 1964b. Nonmetric multidimensional scaling: A numerical method. Psychometrika 29, 2 (1964), 115--129.
[23]
H. Lauder, P. Brown, and A. H. Halsey. 2004. Sociology and political arithmetic: Some principles of a new policy science. British Journal of Sociology 55, 1 (2004), 3--22.
[24]
T. Warren Liao. 2005. Clustering of time series data—A survey. Pattern Recognition 38, 11 (2005), 1857--1874. http://www.sciencedirect.com/science/article/pii/S0031320305001305.
[25]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (Nov. 2008), 2579--2605.
[26]
Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 3 (1970), 443--453. http://www.sciencedirect.com/science/article/pii/0022283670900574
[27]
Raffaella Piccarreta. 2017. Joint sequence analysis: Association and clustering. Sociological Methods and Research 46, 2 (2017), 252--287.
[28]
Raffaella Piccarreta and Francesco C. Billari. 2007. Clustering work and family trajectories by using a divisive algorithm. Journal of the Royal Statistical Society: Series A (Statistics in Society) 170, 4 (2007), 1061--1078.
[29]
Raffaella Piccarreta and Orna Lior. 2010. Exploring sequences: A graphical tool based on multi-dimensional scaling. Journal of the Royal Statistical Society: Series A (Statistics in Society) 173, 1 (2010), 165--184.
[30]
Gary Pollock. 2007. Holistic trajectories: A study of combined employment, housing and family careers by using multiple-sequence analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society) 170, 1 (2007), 167--183.
[31]
PSID. 2017. Panel Study of Income Dynamics Public Use Dataset. Retrieved February 9, 2019 from https://psidonline.isr.umich.edu/.
[32]
Sangeeta Rani and Geeta Sikka. 2012. Recent techniques of clustering of time series data: a survey. International Journal of Computer Applications 52, 15 (2012), 8887.
[33]
Patrick Royston. 2004. Multiple imputation of missing values. Stata Journal 4, 3 (2004), 227--41. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.466.84808rep=rep18type=pdf
[34]
Bernhard Schőlkopf, Christopher J. C. Burges, and Alexander J. Smola. 1999. Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge, MA.
[35]
Bernhard Schőlkopf, Alexander Smola, and Klaus-Robert Müller. 1997. Kernel principal component analysis. In Proceedings of the International Conference on Artificial Neural Networks. 583--588.
[36]
Reto Schumacher, Koenraad Matthijs, and Sarah Moreels. 2012. Migration and Reproduction in an Urbanizing Context. A Sequence Analysis of Family Life Courses in 19th Century Antwerp and Geneva. Available at https://lirias.kuleuven.be/bitstream/123456789/345904/1/WOG+working+paper+17.pdf.
[37]
Katherine Stovel and Marc Bolan. 2004. Residential trajectories: Using optimal alignment to reveal the structure of residential mobility. Sociological Methods and Research 32, 4 (2004), 559--598.
[38]
Katherine Stovel, Michael Savage, and Peter Bearman. 1996. Ascription into achievement: Models of career systems at Lloyds Bank, 1890-1970. American Journal of Sociology 102, 2 (1996), 358--399.
[39]
Matthias Studer. 2013. WeightedCluster Library Manual: A Practical Guide to Creating Typologies of Trajectories in the Social Sciences With R. Available at https://archive-ouverte.unige.ch/unige:78576.
[40]
Matthias Studer and Gilbert Ritschard. 2016. What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures. Journal of the Royal Statistical Society: Series A (Statistics in Society) 179, 2 (2016), 481--511.
[41]
Matthias Studer, Gilbert Ritschard, Alexis Gabadinho, and Nicolas S. Müller. 2011. Discrepancy analysis of state sequences. Sociological Methods and Research 40, 3 (2011), 471--510.
[42]
Matthias Studer, Emanuela Struffolino, and Anette E. Fasang. 2018. Estimating the relationship between time-varying covariates and trajectories: The sequence analysis multistate model procedure. Sociological Methodology 48, 1 (Aug. 2013), 103--135.
[43]
Warren S. Torgerson. 1952. Multidimensional scaling: I. Theory and method. Psychometrika 17, 4 (1952), 401--419.
[44]
Robert A. Wagner and Michael J. Fischer. 1974. The string-to-string correction problem. Journal of the ACM 21, 1 (1974), 168--173. http://dl.acm.org/citation.cfm?id=321811
[45]
Eric D. Widmer and Gilbert Ritschard. 2009. The de-standardization of the life course: Are men and women equal?Advances in Life Course Research 14, 1 (2009), 28--39. http://www.sciencedirect.com/science/article/pii/S1040260809000069
[46]
Paul Wiles. 2004. Policy and sociology. British Journal of Sociology 55, 1 (2004), 31--34.

Cited By

View all
  • (2023)PORDE: Explaining Data Poisoning Attacks Through Visual Analytics with Food Delivery App ReviewsCompanion Proceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581754.3584128(46-50)Online publication date: 27-Mar-2023
  • (2022)What Makes You Hold on to That Old Car? Joint Insights From Machine Learning and Multinomial Logit on Vehicle-Level Transaction DecisionsFrontiers in Future Transportation10.3389/ffutr.2022.8946543Online publication date: 4-Jul-2022
  • (2022)Sequence analysis: Its past, present, and futureSocial Science Research10.1016/j.ssresearch.2022.102772107(102772)Online publication date: Sep-2022
  • Show More Cited By

Index Terms

  1. Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

        Recommendations

        Reviews

        Jonathan P. E. Hodgson

        The authors discuss how one can compensate for missing values when clustering joint categorical sequences. For example, one might have a set of sequences with both nominal values, such as family size, and binary values, for example, marriage status. Usually such sequences will exhibit missing values; however, discarding these sequences will result in too few sequences to allow valid conclusions. Some missing values can be inferred-for example, age-but often this is not the case. The authors consider various ways to address missing values. The principal method used is the idea of an edit distance, which measures the number and size of the changes required to edit one sequence to another. In the case where the entries in the sequence consist of multiple values, one can use the average of the edit distances for each category. Given a distance, one can proceed to infer a set of clusters of "like" sequences. Using a study of income dynamics as an exemplar exhibiting both binary and nominal values, the authors provide a detailed description of the process. They give experimental results comparing various choices for distance. The income dynamics sequences provide sufficient variety to allow several measures for distance. By using dimension reduction techniques, the authors are able to provide visual representations of the clusters corresponding to the distance choices. The paper is a useful guide to the available techniques, with sufficient illustrative examples so that readers can apply the ideas.

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Information & Contributors

        Information

        Published In

        cover image Journal of Data and Information Quality
        Journal of Data and Information Quality  Volume 11, Issue 2
        On the Horizon, Experience Paper and Regular Papers
        June 2019
        66 pages
        ISSN:1936-1955
        EISSN:1936-1963
        DOI:10.1145/3317030
        Issue’s Table of Contents
        © 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 06 March 2019
        Accepted: 01 December 2018
        Revised: 01 October 2018
        Received: 01 May 2018
        Published in JDIQ Volume 11, Issue 2

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Joint sequence analysis
        2. data quality
        3. dimensionality reduction
        4. life trajectories
        5. missing values
        6. optimal matching
        7. t-SNE
        8. time series clustering

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Funding Sources

        • U.S. Department of Energy, Office of Science

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)27
        • Downloads (Last 6 weeks)3
        Reflects downloads up to 04 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)PORDE: Explaining Data Poisoning Attacks Through Visual Analytics with Food Delivery App ReviewsCompanion Proceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581754.3584128(46-50)Online publication date: 27-Mar-2023
        • (2022)What Makes You Hold on to That Old Car? Joint Insights From Machine Learning and Multinomial Logit on Vehicle-Level Transaction DecisionsFrontiers in Future Transportation10.3389/ffutr.2022.8946543Online publication date: 4-Jul-2022
        • (2022)Sequence analysis: Its past, present, and futureSocial Science Research10.1016/j.ssresearch.2022.102772107(102772)Online publication date: Sep-2022
        • (2021)Context-Based Evaluation of Dimensionality Reduction Algorithms—Experiments and Statistical Significance AnalysisACM Transactions on Knowledge Discovery from Data10.1145/342807715:2(1-40)Online publication date: 4-Jan-2021
        • (2021)Performance of the Gold Standard and Machine Learning in Predicting Vehicle Transactions2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671286(3700-3704)Online publication date: 15-Dec-2021
        • (2020)Clustering Life Course to Understand the Heterogeneous Effects of Life Events, Gender, and Generation on Habitual Travel ModesIEEE Access10.1109/ACCESS.2020.30323288(190964-190980)Online publication date: 2020
        • (2019)Machine Learning for Prediction of Mid to Long Term Habitual Transportation Mode Use2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9006411(4520-4524)Online publication date: Dec-2019

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media