Clustering-based gradual pattern mining

Owuor, Dickson Odhiambo; Runkler, Thomas; Laurent, Anne; Bonyo, Lesley

doi:10.1007/s13042-023-02027-w

Clustering-based gradual pattern mining

Original Article
Published: 30 November 2023

Volume 15, pages 2263–2281, (2024)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

129 Accesses
Explore all metrics

Abstract

Generally, the classical problem of gradual pattern mining involves generating pattern candidates and determining the number of concordant object pairs associated with them. Given a numeric data set with n objects and m features, each feature yields two gradual items. Gradual pattern candidates can be formed by combining different sets of gradual items. In fact, a gradual pattern is composed of gradual items with similar concordant object pairs. However, computing the object pairs for each item has a complexity that is approximately quadratic in terms of the number of objects. As the main contribution of this paper, we propose finding gradual patterns by clustering gradual items based on their similarity in object pairs. First, we project the object pairs of each gradual item onto an n-dimensional subspace, thus reducing the complexity of computing object pairs from a quadratic function to a linear function. Second, we group gradual items into r clusters based on the similarity of object pairs in the n-dimensional subspace. As part of our experiments, we evaluated our approach using a variety of clustering algorithms. We found that the best clustering algorithms (across all the data sets we used) achieved precision scores above 55%, recall scores close to 100%, and F1 scores above 71%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Utility-Oriented Gradual Itemsets Mining Using High Utility Itemsets Mining

Graduality in Data Sciences: Gradual Patterns

Mining Frequent Seasonal Gradual Patterns

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The data were derived from the following resources available in the public domain: https://oreme.org/observation/ltc/, https://doi.org/10.1016/j.snb.2007.09.060, https://doi.org/10.1007/978-3-319-46349-0_36, https://doi.org/10.1109/TSMC.2014.2347265, https://doi.org/10.1186/s12885-017-3877-1.

Notes

https://meso-lr.umontpellier.fr

References

Arthur D, Vassilvitskii S (2007) K-means++: The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, p. 1027-1035. Society for Industrial and Applied Mathematics, USA
Balcan MF, Blum A, Vempala S (2008) A discriminative framework for clustering via similarity functions. In: Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, STOC ’08, p. 671-680. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1374376.1374474
Berkhin P (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data, pp. 25–71. Springer
Bezdek JC, Ehrlich R, Full W (1984) Fcm: The fuzzy c-means clustering algorithm. Comput Geosci 10(2):191–203. https://doi.org/10.1016/0098-3004(84)90020-7
Article Google Scholar
Bouchette F (2019) OREME: the coastline observation system. https://oreme.org/observation/ltc/
Bradley RA, Terry ME (1952) Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika 39(3/4), 324–345
Clémentin TD, Cabrel TFL, Belise KE (2021) A novel algorithm for extracting frequent gradual patterns. Machine Learn with Appl 5:100068. https://doi.org/10.1016/j.mlwa.2021.100068
Article Google Scholar
De Vito S, Massera E, Piga M, Martinotto L, Di Francia G (2008) On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sensors Actuators B: Chem 129(2):750–757. https://doi.org/10.1016/j.snb.2007.09.060
Article Google Scholar
Di-Jorio L, Laurent A, Teisseire M (2009) Mining frequent gradual itemsets from large databases. In: Advances in Intelligent Data Analysis VIII, pp. 297–308. Springer-Verlag, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03915-7_26
Dias MLD (2019) fuzzy-c-means: An implementation of Fuzzy C-means clustering algorithm. https://doi.org/10.5281/zenodo.3066222. https://git.io/fuzzy-c-means
Dinari O, Freifeld O (2022) Revisiting dp-means: Fast scalable algorithms via parallelism and delayed cluster creation. In: The 38th Conference on Uncertainty in Artificial Intelligence
Dubey A, Choubey A (2017) A systematic review on k-means clustering techniques. Int J Sci Res Eng Technol (IJSRET, ISSN 2278–0882) 6(6)
Gondek C, Hafner D, Sampson OR (2016) Prediction of failures in the air pressure system of scania trucks using a random forest and feature engineering. In: Advances in Intelligent Data Analysis XV, pp. 398–402. Springer International Publishing, Cham
Gulati H, Singh P, et al. (2015) Clustering techniques in data mining: A comparison. In: 2015 2nd international conference on computing for sustainable global development (INDIACom), pp. 410–415. IEEE
Hunter DR (2004) MM algorithms for generalized Bradley-Terry models. The Annal Statis 32(1):384–406. https://doi.org/10.1214/aos/1079120141
Article MathSciNet Google Scholar
Laurent A, Lesot MJ, Rifqi M (2009) Graank: Exploiting rank correlations for extracting gradual itemsets. In: Proceedings of the 8th International Conference on Flexible Query Answering Systems, FQAS ’09, pp. 382–393. Springer-Verlag, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04957-6_33
van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Machine Learn Res 9(86):2579–2605
Google Scholar
Madhulatha TS (2012) An overview on clustering methods. arXiv preprint arXiv:1205.1117
Metzger A, Leitner P, Ivanovićs D, Schmieders E, Franklin R, Carro M, Dustdar S, Pohl K (2015) Comparing and combining predictive business process monitoring techniques. IEEE Transact Syst, Man, Cyber: Syst 45(2):276–290. https://doi.org/10.1109/TSMC.2014.2347265
Article Google Scholar
Negrevergne B, Termier A, Rousset MC, Méhaut JF (2014) Paraminer: a generic pattern mining algorithm for multi-core architectures. Data Mining Knowl Discovery 28(3):593–633. https://doi.org/10.1007/s10618-013-0313-2
Article MathSciNet Google Scholar
Owuor D, Laurent A, Orero J (2019) Mining fuzzy-temporal gradual patterns. In: 2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–6. IEEE, New York, NY, USA. https://doi.org/10.1109/FUZZ-IEEE.2019.8858883
Owuor DO, Laurent A (2021) Efficiently mining large gradual patterns using chunked storage layout. In: L. Bellatreche, M. Dumas, P. Karras, R. Matulevičius (eds.) Advances in Databases and Information Systems, pp. 30–42. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-030-82472-3_4
Owuor DO, Runkler T, Laurent A (2022) A metaheuristic approach for mining gradual patterns. Swarm and Evolutionary Computation p. 101205. https://doi.org/10.1016/j.swevo.2022.101205
Owuor DO, Runkler T, Laurent A, Orero JO, Menya EO (2021) Ant colony optimization for mining gradual patterns. Int J Machine Learn Cyber. https://doi.org/10.1007/s13042-021-01390-w
Article Google Scholar
Pandit S, Gupta S et al (2011) A comparative study on distance measuring approaches for clustering. Int J Res Comput Sci 2(1):29–31
Article Google Scholar
Patrício M, Pereira J, Crisóstomo J, Matafome P, Gomes M, Seiça R, Caramelo F (2018) Using resistin, glucose, age and bmi to predict the presence of breast cancer. BMC Cancer 18(1):29. https://doi.org/10.1186/s12885-017-3877-1
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Machine Learn Res 12:2825–2830
MathSciNet Google Scholar
Satpathi S (2017) Perfect clustering from pairwise comparisons. Master’s thesis, Graduate College of the University of Illinois, Urbana, Illinois
Scoccola L, Rolle A (2023) Persistable: persistent and stable clustering. Journal of Open Source Software 8(83), 5022. https://doi.org/10.21105/joss.05022
Shalini DVS, Shashi M, Sowjanya AM (2011) Mining frequent patterns of stock data using hybrid clustering. In: 2011 Annual IEEE India Conference, pp. 1–4. https://doi.org/10.1109/INDCON.2011.6139404
Wu R, Xu J, Srikant R, Massoulie L, Lelarge M, Hajek B (2015) Clustering and inference from pairwise comparisons. In: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’15, p. 449-450. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2745844.2745887. https://doi.org/10.1145/2745844.2745887
Xu J, Wu R, Zhu K, Hajek B, Srikant R, Ying L (2014) Jointly clustering rows and columns of binary matrices: Algorithms and trade-offs. In: The 2014 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’14, p. 29-41. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2591971.2592005
Yiu ML, Mamoulis N (2003) Frequent-pattern based iterative projected clustering. In: Third IEEE International Conference on Data Mining, pp. 689–692. https://doi.org/10.1109/ICDM.2003.1251009
Zhang R, Peng H, Dou Y, Wu J, Sun Q, Li Y, Zhang J, Yu PS (2022) Automating dbscan via deep reinforcement learning. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, CIKM ’22, p. 2620-2630. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3511808.3557245
Zimek A, Assent I, Vreeken J (2014) Frequent Pattern Mining Algorithms for Data Clustering, pp. 403–423. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-319-07821-2_16

Download references

Acknowledgements

This work has been realized with the support of the High Performance Computing Platform: MESO@LR, financed by the Occitanie/Pyrénées-Médite-rranée Region, Montpellier Mediterranean Metropole and Montpellier University.

Author information

Authors and Affiliations

SCES Strathmore University, Nairobi, Kenya
Dickson Odhiambo Owuor & Lesley Bonyo
Siemens AG, Munich, Germany
Thomas Runkler
LIRMM Univ Montpellier, CNRS, Montpellier, France
Anne Laurent

Authors

Dickson Odhiambo Owuor
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Runkler
View author publications
You can also search for this author in PubMed Google Scholar
Anne Laurent
View author publications
You can also search for this author in PubMed Google Scholar
Lesley Bonyo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dickson Odhiambo Owuor.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Theorems and Proofs

Theorem 1

Under the conditions stated in Sect. 3.4; assume that $b \le \frac{4}{3}$ for any arbitrary small constant or $b > \frac{4}{3}$ for some large constant, with high probability, there exists a permutation $\pi$ such that

$$\begin{aligned} |{\mathcal {C}}_{k} ~\triangle ~ \widehat{{\mathcal {C}}}_{\pi (k)} | \le \frac{C'\cdot r~\text {max}\{n, 2m\}\log {n}\log {2m}}{(1 - \epsilon )n^{2}}, ~~ \forall k. \end{aligned}$$

Particularly, when $(1 - \epsilon )Kn^{2} \ge C'\cdot r~\text {max}\{n, 2\,m\}$ $\log {n}\log ^{2}{2\,m}$ then ${\frac{|{\mathcal {C}}_{k} ~\triangle ~ \widehat{{\mathcal {C}}}_{\pi (k)} | }{K} \le \frac{1}{\log {2\,m}}}$.

Theorem 1 implies that if n and 2m are on the same order, then each gradual item only needs to provide $r^{2}\text {poly}(\log {n})$ object pairs to allow accurate clustering except for $K/\log {2m}$ gradual items.

Theorem 2

Assume the conditions stated in Sect. 3.4. Define

$$\begin{aligned} \eta _{1} = \frac{r~\text {max}\{n,2m\}\log {n}\log {2m}}{(1-\epsilon )Kn^{2}}, ~~ \eta _{2} = \sqrt{\frac{\log {n}}{(1-\epsilon )Kn}} \end{aligned}$$

Assume $b \le \frac{4}{3}$ for any arbitrarily small constant, then there exists a constant $C > 0$ such that with high probability

$$\begin{aligned} \frac{||{\hat{\theta }} _{g} - \theta _{g} ||_{2}}{||\theta _{g} ||_{2}} \le \frac{C (e^{b} + 1)^2}{be^{b}}\text {max}\{\eta _{1}, \eta _{2} \} \end{aligned}$$

Particularly, when

$(1 - \epsilon )Kn^{2} \ge r\text {max}\{n, 2m\}\log {n}\log ^{2}{2m}$ then ${\frac{||{\hat{\theta }} _{g} - \theta _{g} ||_{2}}{||\theta _{g} ||_{2}} = O(\frac{1}{\log {2\,m}})}$ except for $K/\log {2m}$ gradual items.

Theorem 2 demonstrates that the estimation error depends on the maximum of $\eta _{1}$ and $\eta _{2}$. It also shows that Algorithm 1 requires approximately $(1 - \epsilon )Kn^{2} = O(r\text {max}\{n, 2\,m\}\log {n}\log ^{2}{2\,m})$ object pairs per cluster.

Proof of Theorem 1

Let ${\rho = \frac{(1-\epsilon )n}{\sqrt{\log {n}}}}$. We say a gradual item is a good gradual item if $||{\widetilde{S}}_{g} - {\bar{S}}_{g} ||_{2} \le \frac{\rho }{2}$. Let ${\bar{S}}_{k}$ denote the common expected net-win vectors for gradual items in cluster k, where $k = 1, 2,..., r$. Under the model assumptions in Sect. 3.4, for all good gradual items g, for any $k \not = k'$,

$$\begin{aligned} S_{g} - {\bar{S}}_{g} ||_{2} \le \frac{\rho }{2} < {\bar{S}}_{k} - {\bar{S}}_{k'} ||_{2} \end{aligned}$$

$\square$

Following the proofs of Lemma 3, Lemma 4, and Lemma 5 in [31]; if we let ${\mathcal {G}}$ denote the set of good gradual items, then the number of bad gradual items is given by

$$\begin{aligned} |{\mathcal {G}}^{c}| \le \frac{C'\cdot r~\text {max}\{n, 2m\}\log {n}\log {2m}}{(1 - \epsilon )n^{2}}. \end{aligned}$$

Following the proof of Proposition 1 in [32], we make the conclusion that there exists a permutation $\pi$ such that

$$\begin{aligned} |{\mathcal {C}}_{k} ~\triangle ~ \widehat{{\mathcal {C}}}_{\pi (k)} | \le |{\mathcal {G}}^{c}| \end{aligned}$$

Proof of Theorem 2

From Theorem 1, we get that there exists a permutation $\pi$ such that

$$\begin{aligned} |{\mathcal {C}}_{k} ~\triangle ~ \widehat{{\mathcal {C}}}_{\pi (k)} | \le \frac{C'\cdot r~\text {max}\{n, 2m\}\log {n}\log {2m}}{(1 - \epsilon )n^{2}}, ~~ \forall k. \end{aligned}$$

$\square$

In order to get the result of Theorem 2, we apply Theorem 3 in [31]. If we want to achieve $\frac{||{\hat{\theta }} _{g} - \theta _{g} ||_{2}}{||\theta _{g} ||_{2}} = O(\frac{1}{\log {2\,m}})$ for the good gradual items, we need

$$\begin{aligned} \frac{r~\text {max}\{n,2m\}\log {n}\log {2m}}{(1-\epsilon )Kn^{2}} \le \frac{1}{\log {2m}}\\ \sqrt{\frac{\log {n}}{(1-\epsilon )Kn}} \le \frac{1}{\log {2m}} \end{aligned}$$

which requires that $(1 - \epsilon )Kn^{2} > r\text {max}\{n, 2\,m\}\log {n}\log ^{2}{2\,m}$ and $(1 - \epsilon )Kn > n\log {n}\log ^{2}{2\,m}$ respectively. Note that the former requirement is more strict than the latter one; this implies that the clustering step of Algorithm 1 needs more object pairs that the score vector estimation step to achieve the same error rate.

B some properties of bradley-terry model

The proofs in this section are adopted from [28] and [31]. First, we introduce some additional notation used in this section. Let I denote the identity matrix. Let ${\textbf{1}}$ denote the vector with all-one entries and $\textbf{11}^{\top }$ denote the matrix with all-one entries. This section aims to prove that if parameters $\theta$ from Bradley-Terry model are drawn from a uniform distribution on a hypercube, then

$$\begin{aligned} \begin{aligned} ||{\bar{S}}_{u} - {\bar{S}}_{v} ||_{2} \ge C(1-\epsilon )n,\\ \text {where } u, v \text { belong to different clusters.} \end{aligned} \end{aligned}$$

We fix $b \in \mathbb {R}$. We assume that for each cluster $k \in \{1, 2,..., r\}$ and each item $i \in \{1, 2,..., n\}$, $\theta _{k, i}$ is drawn from a uniform distribution on [0, b]. The proof is divided into two parts: (1) the case of $b \le \frac{4}{3}$ and, (2) the case of $b > \frac{4}{3}$. Due to space constraints, these proofs appear in the Appendix of [28].

It should be noted that $AA^{\top } = nI - J \triangleq L_{n}$, which is the Laplacian of the complete graph. It is easy to verify that the eigenvalues of $L_{n}$ are 0, n, and the eigenvector corresponding to the zero eigenvalue is given by $\frac{1}{\sqrt{n}}{\textbf{1}}$. Therefore, A is of rank $m - 1$ and all its nonzero singular values are $\sqrt{n}$. The SVD of A is $A = \sqrt{n}UV^{\top }$ and $S_{g} = \frac{1}{\sqrt{n}}R_{g}A^{\top }$, we have that

$$\begin{aligned} {\bar{S}}_{g} = {\bar{R}}_{g}UV^{\top }. \end{aligned}$$

Using the probability density function in Eq. 2 Let us compute ${\bar{R}}_{g}$:

$$\begin{aligned} \begin{aligned}&\mathbb {E}[R_{g, i, j}] = (1 - \epsilon )\frac{e^{\theta _{g,i}} - e^{\theta _{g, j}}}{e^{\theta _{g,i}} + e^{\theta _{g, j}}} \triangleq (1 - \epsilon )f(\theta _{g, i} - \theta _{g, j}), \text { where }\\&f(x) = \frac{e^{x} - 1}{e^{x} + 1}. \text { Then,}\\&\mathbb {E}[R_{g}] = (1 - \epsilon )f(\theta _{g}A),\\&\text { where } f:x \in \mathbb {R}^{m} \mapsto (f(x_{1}),..., f(x_{n})). \end{aligned} \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Owuor, D.O., Runkler, T., Laurent, A. et al. Clustering-based gradual pattern mining. Int. J. Mach. Learn. & Cyber. 15, 2263–2281 (2024). https://doi.org/10.1007/s13042-023-02027-w

Download citation

Received: 27 March 2023
Accepted: 27 October 2023
Published: 30 November 2023
Issue Date: June 2024
DOI: https://doi.org/10.1007/s13042-023-02027-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering-based gradual pattern mining

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Utility-Oriented Gradual Itemsets Mining Using High Utility Itemsets Mining

Graduality in Data Sciences: Gradual Patterns

Mining Frequent Seasonal Gradual Patterns

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Theorems and Proofs

Theorem 1

Theorem 2

Proof of Theorem 1

Proof of Theorem 2

B some properties of bradley-terry model

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Clustering-based gradual pattern mining

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Utility-Oriented Gradual Itemsets Mining Using High Utility Itemsets Mining

Graduality in Data Sciences: Gradual Patterns

Mining Frequent Seasonal Gradual Patterns

Explore related subjects

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Theorems and Proofs

Theorem 1

Theorem 2

Proof of Theorem 1

Proof of Theorem 2

B some properties of bradley-terry model

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation