Anomaly Detection Paradigm for Multivariate Time Series Data Mining for Healthcare
Abstract
:1. Introduction
1.1. Research Problem Exploration
- Providing the full join method, which eliminates the need to specify a similarity threshold.
- Recovering the top K-nearest neighbors or the nearest neighbor for each object if that neighbor is within a user-supplied threshold.
1.2. Research Significance
1.3. Research Contribution
- The NMP uses an ultrafast similarity search process based on the z-normalized Euclidean distance as a subroutine, exploiting the redundancies between overlapping subsequences to achieve dramatic speedup and low space overhead.
- The proposed NMP provides no false positives or false negatives. This property is important in many domains.
- The NMP provides lower space complexity, namely, , which helps in handling several data mining tasks (e.g., motif discovery, discord discovery, LTE discovery, semantic segmentation, and clustering).
1.4. Proposed Solution
1.5. Research Structure
2. Background and Related Literature
2.1. Univariate Anomaly Detection
2.2. Multivariate Anomaly Detection
3. Proposed Novel Matrix Profile for Similarity Search
- Inputting the Time Series (ITS);
- All-pairs-Similarity Search Process (ASSP);
- Distance Determination and Matrix Profile Index (DD&MPI).
3.1. Inputting the Time Series (Its)
3.1.1. Data Preprocessing
3.1.2. Data Representation
- Temporal representation: storing a temporal point on the displayed data.
- Spectral representation: designing the data in the frequency domain.
- Other representations: implementing different modifications not related to the above.
- A significant drop in the data dimensionality.
- An emphasis on the essential shape features for both global and local scales.
- Lower computational costs for the computational representation.
- Better restoration quality for the reduced representation.
- Implicit noise handling or insensitivity to noise.
- Dynamic time scaling: The series G obtained by a dynamic change in the time scale is generated by:
- Additive Noise: The series G obtained by adding a noisy component to the original series is given by:
- Outliers: The series G is obtained by adding outliers at random positions. Formally, these are a given set of random time positions expressed by:
3.2. All-Pairs Similarity Search Process
- Shape-based;
- Edit-based;
- Feature-based;
- Structure-based.
3.2.1. Shape-Based Similarity
3.2.2. Edit-Based Similarity
3.2.3. Feature-Based Similarity
- It uses an ultrafast similarity search process and the z-normalized Euclidean distance as a subroutine.
- It exploits the redundancies between overlapping subsequences.
- It provides lower space complexity to handle several data mining tasks.
3.2.4. Structure-Based Similarity
- Model-based distances;
- Compression-based distances.
- Model-based distances
- This approach applies a model to different series and then compares the parameters of the underlying model. Similarity can be calculated by modeling the underlying time series. This determines the probability of one time series by using another underlying model. Any type of parametric temporal model can be used. Furthermore, the distance determination process for neighboring nodes is given in Algorithm 1.
- In Algorithm 1, the distance of the neighbor node is determined. In step 1, the variables are initialized. The input and output are given at the beginning of the algorithm. In step 2, the time series length is detected from the time series data. Steps 3–4 define the memory allocation process, and the initial matrix profile and matrix profile index are stored in memory. Steps 5–7 calculate the mean value and standard deviation. Step 8 is used to perform the vector dot product. In step 9, the distance profile index is determined from the time series. In steps 2–13, the distance of each neighbor is calculated and finally stored in an array.
Algorithm 1 Distance Determination Process |
|
- 2.
- Compression-based distances
- Compression-based methods determine how easily two series can be compacted together. One distance measure is based on Kolmogorov complexity, called the compression-based dissimilarity measure (CDM). It is established based on bioinformatics findings. The fundamental notion is that concatenating and compressing similar series produces higher compression ratios for different data. If this method is successful for clustering, then it can be extended for fetal heart rate tracing. This process is effective for clustering.
- Our basic approach is to find similar pairs to compute the dot product of each normalized time series over the z-normalized Euclidean distance . In other words, the dot product can be evaluated in when is given:
- Before finding the distance profile, we search for similarities, where each normalization of each subsequence must be normalized before it is compared to the query by defining the mean value and the standard deviation. The mean of the subsequence can be calculated by holding two running sums of a long time series with a lag of exactly m values. Similarly, the sum of the subsequence squares can be determined. Here, consistency can be determined by the following:
- The standard deviation of the time series calculated by the average of the squared deviations from the mean is shown below:
- Given a and a time series T query list, the distance between and all subsequences is determined in T. We call this a distance profile:
- Once we have , we will dispose of the closest neighbor to in time series T.
- Note that if query is a subsequence of T, the i-th distance profile position is zero (i.e., ), and the value is near zero only to the left and right of i. In the literature, this match is considered a trivial match. We prevent such matches by ignoring the duration “exclusion”, in practice, the following property is set:
- Therefore, the nearest neighbor of can be determined by evaluating u in .
3.3. Distance Determination & Matrix Profile Index (DD & MPI)
Algorithm 2: Immediate Neighbor Detection Process |
|
- The profile index captures the starting index j of this nearest neighbor. In the (theoretical) case of several minimizers j, the smallest minimizer will be selected:
4. Experimental Results
4.1. Experimental Setup
4.2. Performance Metrics
- Time required for the similarity search;
- Accuracy;
- Detection efficiency.
4.2.1. Time Required for Similarity Search
4.2.2. Accuracy
4.2.3. Detection Efficiency
5. Discussion of the Results and Limitations
6. Conclusions and Future Work
6.1. Conclusions
6.2. Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Definitions
References
- Li, H. Time works well: Dynamic time warping based on time weighting for time series data mining. Inf. Sci. 2021, 547, 592–608. [Google Scholar] [CrossRef]
- Sattari, M.T.; Avram, A.; Apaydin, H.; Matei, O. Soil Temperature Estimation with Meteorological Parameters by Using Tree-Based Hybrid Data Mining Models. Mathematics 2020, 8, 1407. [Google Scholar] [CrossRef]
- Zhang, S.Q.; Zhou, Z.H. Harmonic recurrent process for time series forecasting. In Proceedings of the European Conference on Artificial Intelligence, Santiago de Compostela, Spain, 29 August—8 September 2020; IOS Press: Amsterdam, The Netherlands, 2020; pp. 1714–1721. [Google Scholar]
- Soleimani, G.; Abessi, M. DLCSS: A new similarity measure for time series data mining. Eng. Appl. Artif. Intell. 2020, 92, 103664. [Google Scholar] [CrossRef]
- Gharghabi, S.; Ding, Y.; Yeh, C.C.M.; Kamgar, K.; Ulanova, L.; Keogh, E. Matrix profile VIII: Domain agnostic online semantic segmentation at superhuman performance levels. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; IEEE: New Orleans, LA, USA, 2017; pp. 117–126. [Google Scholar]
- Gharghabi, S.; Yeh, C.C.M.; Ding, Y.; Ding, W.; Hibbing, P.; LaMunion, S.; Keogh, E. Domain agnostic online semantic segmentation for multi-dimensional time series. Data Min. Knowl. Discov. 2019, 33, 96–130. [Google Scholar] [CrossRef]
- Guigou, F.; Collet, P.; Parrend, P. SCHEDA: Lightweight Euclidean-like heuristics for anomaly detection in periodic time series. Appl. Soft Comput. 2019, 82, 105594. [Google Scholar] [CrossRef]
- Hu, M.; Feng, X.; Ji, Z.; Yan, K.; Zhou, S. A novel computational approach for discord search with local recurrence rates in multivariate time series. Inf. Sci. 2019, 477, 220–233. [Google Scholar] [CrossRef]
- Zhou, Y.; Ren, H.; Li, Z.; Pedrycz, W. An anomaly detection framework for time series data: An interval-based approach. Knowl.-Based Syst. 2021, 228, 107153. [Google Scholar] [CrossRef]
- Li, J.; Izakian, H.; Pedrycz, W.; Jamal, I. Clustering-based anomaly detection in multivariate time series data. Appl. Soft Comput. 2021, 100, 106919. [Google Scholar] [CrossRef]
- Crnkić, A.; Ivanović, I.; Jaćimović, V.; Mijajlović, N. Swarms on the 3-sphere for online clustering of multivariate time series and data streams. Future Gener. Comput. Syst. 2020, 112, 11–17. [Google Scholar] [CrossRef]
- Yu, C.; Luo, L.; Chan, L.L.H.; Rakthanmanon, T.; Nutanong, S. A fast LSH-based similarity search method for multivariate time series. Inf. Sci. 2019, 476, 337–356. [Google Scholar] [CrossRef]
- Yeh, C.C.M.; Zhu, Y.; Ulanova, L.; Begum, N.; Ding, Y.; Dau, H.A.; Silva, D.F.; Mueen, A.; Keogh, E. Matrix profile I: All pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; IEEE: Barcelona, Spain, 2016; pp. 1317–1322. [Google Scholar]
- Xiang, L.; Leng, P.; Zhang, J.; Luo, K.; Yang, Z.; Nai, W. Principal Component Analysis Based on Artificial Fish Swarm with T-Distribution Parameters. In Proceedings of the 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Xi’an China, 15–17 October 2021; IEEE: Xi’an China, 2021; Volume 5, pp. 2373–2377. [Google Scholar]
- Alexander Scarlat MD “Time Series with Anomalies”. Available online: https://www.kaggle.com/datasets/drscarlat/time-series (accessed on 10 March 2022).
- Zhu, Y.; Zimmerman, Z.; Senobari, N.S.; Yeh, C.C.M.; Funning, G.; Mueen, A.; Brisk, P.; Keogh, E. Matrix profile II: Exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins. In Proceedings of the 2016 IEEE 16th International Conference on data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; IEEE: Barcelona, Spain, 2016; pp. 739–748. [Google Scholar]
- Gowanlock, M.; Karsin, B. Accelerating the similarity self-join using the GPU. J. Parallel Distrib. Comput. 2019, 133, 107–123. [Google Scholar] [CrossRef]
- Zhao, H.; Wang, Y.; Duan, J.; Huang, C.; Cao, D.; Tong, Y.; Xu, B.; Bai, J.; Tong, J.; Zhang, Q. Multivariate time-series anomaly detection via graph attention network. In Proceedings of the 2020 IEEE International Conference on Data Mining, Sorrento, Italy, 17–20 November 2020; pp. 841–850. [Google Scholar]
- Liu, D.; Li, J.; Yuan, Q. A spectral grouping and attention-driven residual dense network for hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7711–7725. [Google Scholar] [CrossRef]
- Wan, P.; Davis, R.A. Goodness-of-fit testing for time series models via distance covariance. J. Econom. 2020, 227, 4–24. [Google Scholar] [CrossRef]
- Kalmykov, L.V.; Kalmykov, V.L. A solution to the dilemmalimiting similarity vs. limiting dissimilarity’by a method of transparent artificial intelligence. Chaos Solitons Fractals 2021, 146, 110814. [Google Scholar] [CrossRef]
- Tsuchiyama, A.; Nakajima, J. Diversity of deep earthquakes with waveform similarity. Phys. Earth Planet. Inter. 2021, 314, 106695. [Google Scholar] [CrossRef]
- Rubinstein, B. A fast noise filtering algorithm for time series prediction using recurrent neural networks. arXiv 2020, arXiv:2007.08063. [Google Scholar]
- Wang, D.; Gao, X.; Wang, X.; He, L. Label consistent matrix factorization hashing for large-scale cross-modal similarity search. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2466–2479. [Google Scholar] [CrossRef]
- Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with gpus. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
- Ma, L.; Gu, X.; Wang, B. Correction of outliers in temperature time series based on sliding window prediction in meteorological sensor network. Information 2017, 8, 60. [Google Scholar] [CrossRef]
- Cai, X.; Xu, T.; Yi, J.; Huang, J.; Rajasekaran, S. Dtwnet: A dynamic time warping network. Adv. Neural Inf. Process. Syst. 2019, 32, 11640–11650. [Google Scholar]
- Pelletier, C.; Webb, G.I.; Petitjean, F. Temporal convolutional neural network for the classification of satellite image time series. Remote Sens. 2019, 11, 523. [Google Scholar] [CrossRef]
- Smyl, S. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. Int. J. Forecast. 2020, 36, 75–85. [Google Scholar] [CrossRef]
- Hong, J.Y.; Park, S.H.; Baek, J.G. SSDTW: Shape segment dynamic time warping. Expert Syst. Appl. 2020, 150, 113291. [Google Scholar] [CrossRef]
- Rubinstein, A.; Song, Z. Reducing approximate longest common subsequence to approximate edit distance. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Lake City, UT, USA, 5–8 January 2020; pp. 1591–1600. [Google Scholar]
- Vishwakarma, G.K.; Paul, C.; Elsawah, A.M. An algorithm for outlier detection in a time series model using backpropagation neural network. J. King Saud-Univ.-Sci. 2020, 32, 3328–3336. [Google Scholar] [CrossRef]
- Linardi, M.; Zhu, Y.; Palpanas, T.; Keogh, E. Matrix profile X: VALMOD-scalable discovery of variable-length motifs in data series. In Proceedings of the 2018 International Conference on Management of Data, Houston, TX, USA, 10–15 June 2018; pp. 1053–1066. [Google Scholar]
- Dong, E.; Du, H.; Gardner, L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect. Dis. 2020, 20, 533–534. [Google Scholar] [CrossRef]
Parameters | Description |
---|---|
Personal computer | x64 |
Operating system | Windows 7 |
Processor | Intel Core i7-2670QM |
RAM | 6 GB |
CPU GHz | 2.20 |
Algorithm | STAMP | STOMP | NMP |
---|---|---|---|
Time required for the similarity search with 18,000 time series | 118 milliseconds | 115 milliseconds | 104.1 milliseconds |
Time required for the similarity search with 45,000 time series | 331.3 milliseconds | 302.9 milliseconds | 236.3 milliseconds |
Time required for the similarity search with 90,000 time series | 520.7 milliseconds | 631.8 milliseconds | 668.2 milliseconds |
Time required for the similarity search with 135,000 time series | 959.2 milliseconds | 1040.6 milliseconds | 1074.9 milliseconds |
Accuracy with 135,000 time series | 98.62% | 98.09% | 99.76% |
Accuracy with 180,000 time series | 98.58% | 97.61% | 99.5% |
10 months of patient’s infection data | 87.4% | 87.1% | 97.2% |
10 months of patient’s recovered data | 85.6% | 83.6% | 98.4% |
10 months of patient’s death data | 85.3% | 81.7% | 94.1% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Razaque, A.; Abenova, M.; Alotaibi, M.; Alotaibi, B.; Alshammari, H.; Hariri, S.; Alotaibi, A. Anomaly Detection Paradigm for Multivariate Time Series Data Mining for Healthcare. Appl. Sci. 2022, 12, 8902. https://doi.org/10.3390/app12178902
Razaque A, Abenova M, Alotaibi M, Alotaibi B, Alshammari H, Hariri S, Alotaibi A. Anomaly Detection Paradigm for Multivariate Time Series Data Mining for Healthcare. Applied Sciences. 2022; 12(17):8902. https://doi.org/10.3390/app12178902
Chicago/Turabian StyleRazaque, Abdul, Marzhan Abenova, Munif Alotaibi, Bandar Alotaibi, Hamoud Alshammari, Salim Hariri, and Aziz Alotaibi. 2022. "Anomaly Detection Paradigm for Multivariate Time Series Data Mining for Healthcare" Applied Sciences 12, no. 17: 8902. https://doi.org/10.3390/app12178902