Overview of the state-of-the-art Time Series Clustering based on literature study; distance metrics, prototypes, time-series preprocessing, and clustering algorithms
4. Background
• High dimensionality
• Irregular lengths
• Noise and time shifts
time (s)
variable
A time series is a collection of observations made sequentially in time.
15. Which Distance Measure to Use
• Type of the data
• Research questions
Criteria Euclidean DTW
Supports Time Series length differences No Yes
Supports Time Series time shifts No Yes
Computational costs Low High
29. Clustering
Clustering algorithm Distance measure Prototype
Partitional
K – means / K – medoid Euclidean / Manhattan Mean / PAM
TAD Pole DTW DBA
K – shape SBD Shape Extraction
Hierarchical Agglomerative All All
Clustering AlgorithmDistance
Measure
Prototype
N clusters
Time Series
Data
44. DTW
Right combination of distance measure & prototype
Conclusions
Clustering algorithm Distance measure Prototype
Partitional
K – means / K – medoid Euclidean / Manhattan Mean / PAM
TAD Pole DTW DBA
K – shape SBD Shape Extraction
Hierarchical Agglomerative All All
Editor's Notes
Provide quantification for the dissimilarity between two time-series
The classification of objects, into clusters, requires some methods for measuring the distance or the (dis)similarity between the objects
The term proximity is used to refer to either similarity or dissimilarity. Frequently, the term distance is used as a synonym for dissimilarity.
Variable for
Recent years have seen a surge of interest in time series clustering.
Data characteristics are evolving and traditional clustering algorithms are becoming less popular in time series clustering.
The most commonly used distance measures are only defined for series of equal length and are sensitive to noise, scale and time shifts
Thus, many other distance measures tailored to time-series have been developed in order to overcome these limitations; other challenges associated with the structure of time-series, such as multiple variables, serial correlation
each
Goal is to put them all together in clusters
Input in customer segmentation
Mention about chicken segmentation
Behavior based on purchases, bank transactions, energy, other utilities usage/consumption, social networks – who is connected to who
Hierarchy of classes dendrogram
Provide quantification for the dissimilarity between two time-series
The classification of objects, into clusters, requires some methods for measuring the distance or the (dis)similarity between the objects
The term proximity is used to refer to either similarity or dissimilarity. Frequently, the term distance is used as a synonym for dissimilarity.
https://en.wikipedia.org/wiki/Taxicab_geometry
The distance between two points measured along axes at right angles.
Also known as Manhattan length, rectilinear distance, Minkowski's L1 distance, L1 norm, taxi cab metric, snake distance, city block distance
Correlation measures are only useful if/when the relationship between attributes is linear. So if the correlation is 0, then there is no linear relationship between the two data objects.
http://cs.tsu.edu/ghemri/CS497/ClassNotes/ML/Similarity%20Measures.pdf
Be ready to explain pearson and spearman
When time series have different lengths
One of the most used measure of the similarity between two time series
Originally designed to treat automatic speech recognition
Optimal global alignment between two time series, exploiting temporal distortions between them
Designed especially for time series analysis
Ignore shifts in time dimension
Ignore speeds of two time series
How is it calculated?
When time series have different lengths
One of the most used measure of the similarity between two time series
Originally designed to treat automatic speech recognition
Optimal global alignment between two time series, exploiting temporal distortions between them
Designed especially for time series analysis
Ignore shifts in time dimension
Ignore speeds of two time series
How is it calculated?
https://www.datanovia.com/en/lessons/clustering-distance-measures/
For example, correlation-based distance is often used in gene expression data analysis.
Correlation-based distance considers two objects to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance.
For most clustering package, Euclidean is default.
If we want to identify clusters of observations with the same overall profiles regardless of their magnitudes, then correlation-based distance
If correlation, Pearson’s correlation is quite sensitive to outliers
Commonly used in
gene expression data analysis
marketing, if we want to identify group of shoppers with the same preference in term of items, regardless of the volume of items they bought.
Hierarchy of classes dendrogram
Gamma is the optimization function.
A is the alignment function
Hierarchy of classes dendrogram
Hierarchy of classes dendrogram
Clusters are defines beforehand
Compute distance between point and centroids and keep the minimum
Predict For each data point calculate the distance from both centroids and the data point is assigned to the cluster with the min distance
Move centroids in the point where the is the mean distance so that they are in the center of the cluster
Compute distance between point and centroids and keep the minimum
Predict For each data point calculate the distance from both centroids and the data point is assigned to the cluster with the min distance
Move centroids in the point where the is the mean distance so that they are in the center of the cluster
Compute distance between point and centroids and keep the minimum
Predict For each data point calculate the distance from both centroids and the data point is assigned to the cluster with the min distance
Move centroids in the point where the is the mean distance so that they are in the center of the cluster
Hierarchy of classes dendrogram
Each character has each one cluster
Input = genetic code
Selma + Patty twins
Lisa + Merge mother and daughter (less similarity because the share genetic code with Homer Simpson)
Selma + patty sisters of Marge
Number of clusters and order of clustering
A: number of time series assigned to same cluster and belong to the same class
B: number of time series assigned to different cluster and belong to the different class
C: number of time series assigned to different cluster and belong to the same class
D: number of time series assigned to same cluster and belong to the different class