Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Clustering-based gradual pattern mining

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Generally, the classical problem of gradual pattern mining involves generating pattern candidates and determining the number of concordant object pairs associated with them. Given a numeric data set with n objects and m features, each feature yields two gradual items. Gradual pattern candidates can be formed by combining different sets of gradual items. In fact, a gradual pattern is composed of gradual items with similar concordant object pairs. However, computing the object pairs for each item has a complexity that is approximately quadratic in terms of the number of objects. As the main contribution of this paper, we propose finding gradual patterns by clustering gradual items based on their similarity in object pairs. First, we project the object pairs of each gradual item onto an n-dimensional subspace, thus reducing the complexity of computing object pairs from a quadratic function to a linear function. Second, we group gradual items into r clusters based on the similarity of object pairs in the n-dimensional subspace. As part of our experiments, we evaluated our approach using a variety of clustering algorithms. We found that the best clustering algorithms (across all the data sets we used) achieved precision scores above 55%, recall scores close to 100%, and F1 scores above 71%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The data were derived from the following resources available in the public domain: https://oreme.org/observation/ltc/, https://doi.org/10.1016/j.snb.2007.09.060, https://doi.org/10.1007/978-3-319-46349-0_36, https://doi.org/10.1109/TSMC.2014.2347265, https://doi.org/10.1186/s12885-017-3877-1.

Notes

  1. https://meso-lr.umontpellier.fr

References

  1. Arthur D, Vassilvitskii S (2007) K-means++: The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, p. 1027-1035. Society for Industrial and Applied Mathematics, USA

  2. Balcan MF, Blum A, Vempala S (2008) A discriminative framework for clustering via similarity functions. In: Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, STOC ’08, p. 671-680. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1374376.1374474

  3. Berkhin P (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data, pp. 25–71. Springer

  4. Bezdek JC, Ehrlich R, Full W (1984) Fcm: The fuzzy c-means clustering algorithm. Comput Geosci 10(2):191–203. https://doi.org/10.1016/0098-3004(84)90020-7

    Article  Google Scholar 

  5. Bouchette F (2019) OREME: the coastline observation system. https://oreme.org/observation/ltc/

  6. Bradley RA, Terry ME (1952) Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika 39(3/4), 324–345

  7. Clémentin TD, Cabrel TFL, Belise KE (2021) A novel algorithm for extracting frequent gradual patterns. Machine Learn with Appl 5:100068. https://doi.org/10.1016/j.mlwa.2021.100068

    Article  Google Scholar 

  8. De Vito S, Massera E, Piga M, Martinotto L, Di Francia G (2008) On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sensors Actuators B: Chem 129(2):750–757. https://doi.org/10.1016/j.snb.2007.09.060

    Article  Google Scholar 

  9. Di-Jorio L, Laurent A, Teisseire M (2009) Mining frequent gradual itemsets from large databases. In: Advances in Intelligent Data Analysis VIII, pp. 297–308. Springer-Verlag, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03915-7_26

  10. Dias MLD (2019) fuzzy-c-means: An implementation of Fuzzy C-means clustering algorithm. https://doi.org/10.5281/zenodo.3066222. https://git.io/fuzzy-c-means

  11. Dinari O, Freifeld O (2022) Revisiting dp-means: Fast scalable algorithms via parallelism and delayed cluster creation. In: The 38th Conference on Uncertainty in Artificial Intelligence

  12. Dubey A, Choubey A (2017) A systematic review on k-means clustering techniques. Int J Sci Res Eng Technol (IJSRET, ISSN 2278–0882) 6(6)

  13. Gondek C, Hafner D, Sampson OR (2016) Prediction of failures in the air pressure system of scania trucks using a random forest and feature engineering. In: Advances in Intelligent Data Analysis XV, pp. 398–402. Springer International Publishing, Cham

  14. Gulati H, Singh P, et al. (2015) Clustering techniques in data mining: A comparison. In: 2015 2nd international conference on computing for sustainable global development (INDIACom), pp. 410–415. IEEE

  15. Hunter DR (2004) MM algorithms for generalized Bradley-Terry models. The Annal Statis 32(1):384–406. https://doi.org/10.1214/aos/1079120141

    Article  MathSciNet  Google Scholar 

  16. Laurent A, Lesot MJ, Rifqi M (2009) Graank: Exploiting rank correlations for extracting gradual itemsets. In: Proceedings of the 8th International Conference on Flexible Query Answering Systems, FQAS ’09, pp. 382–393. Springer-Verlag, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04957-6_33

  17. van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Machine Learn Res 9(86):2579–2605

    Google Scholar 

  18. Madhulatha TS (2012) An overview on clustering methods. arXiv preprint arXiv:1205.1117

  19. Metzger A, Leitner P, Ivanovićs D, Schmieders E, Franklin R, Carro M, Dustdar S, Pohl K (2015) Comparing and combining predictive business process monitoring techniques. IEEE Transact Syst, Man, Cyber: Syst 45(2):276–290. https://doi.org/10.1109/TSMC.2014.2347265

    Article  Google Scholar 

  20. Negrevergne B, Termier A, Rousset MC, Méhaut JF (2014) Paraminer: a generic pattern mining algorithm for multi-core architectures. Data Mining Knowl Discovery 28(3):593–633. https://doi.org/10.1007/s10618-013-0313-2

    Article  MathSciNet  Google Scholar 

  21. Owuor D, Laurent A, Orero J (2019) Mining fuzzy-temporal gradual patterns. In: 2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–6. IEEE, New York, NY, USA. https://doi.org/10.1109/FUZZ-IEEE.2019.8858883

  22. Owuor DO, Laurent A (2021) Efficiently mining large gradual patterns using chunked storage layout. In: L. Bellatreche, M. Dumas, P. Karras, R. Matulevičius (eds.) Advances in Databases and Information Systems, pp. 30–42. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-030-82472-3_4

  23. Owuor DO, Runkler T, Laurent A (2022) A metaheuristic approach for mining gradual patterns. Swarm and Evolutionary Computation p. 101205. https://doi.org/10.1016/j.swevo.2022.101205

  24. Owuor DO, Runkler T, Laurent A, Orero JO, Menya EO (2021) Ant colony optimization for mining gradual patterns. Int J Machine Learn Cyber. https://doi.org/10.1007/s13042-021-01390-w

    Article  Google Scholar 

  25. Pandit S, Gupta S et al (2011) A comparative study on distance measuring approaches for clustering. Int J Res Comput Sci 2(1):29–31

    Article  Google Scholar 

  26. Patrício M, Pereira J, Crisóstomo J, Matafome P, Gomes M, Seiça R, Caramelo F (2018) Using resistin, glucose, age and bmi to predict the presence of breast cancer. BMC Cancer 18(1):29. https://doi.org/10.1186/s12885-017-3877-1

    Article  Google Scholar 

  27. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Machine Learn Res 12:2825–2830

    MathSciNet  Google Scholar 

  28. Satpathi S (2017) Perfect clustering from pairwise comparisons. Master’s thesis, Graduate College of the University of Illinois, Urbana, Illinois

  29. Scoccola L, Rolle A (2023) Persistable: persistent and stable clustering. Journal of Open Source Software 8(83), 5022. https://doi.org/10.21105/joss.05022

  30. Shalini DVS, Shashi M, Sowjanya AM (2011) Mining frequent patterns of stock data using hybrid clustering. In: 2011 Annual IEEE India Conference, pp. 1–4. https://doi.org/10.1109/INDCON.2011.6139404

  31. Wu R, Xu J, Srikant R, Massoulie L, Lelarge M, Hajek B (2015) Clustering and inference from pairwise comparisons. In: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’15, p. 449-450. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2745844.2745887. https://doi.org/10.1145/2745844.2745887

  32. Xu J, Wu R, Zhu K, Hajek B, Srikant R, Ying L (2014) Jointly clustering rows and columns of binary matrices: Algorithms and trade-offs. In: The 2014 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’14, p. 29-41. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2591971.2592005

  33. Yiu ML, Mamoulis N (2003) Frequent-pattern based iterative projected clustering. In: Third IEEE International Conference on Data Mining, pp. 689–692. https://doi.org/10.1109/ICDM.2003.1251009

  34. Zhang R, Peng H, Dou Y, Wu J, Sun Q, Li Y, Zhang J, Yu PS (2022) Automating dbscan via deep reinforcement learning. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, CIKM ’22, p. 2620-2630. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3511808.3557245

  35. Zimek A, Assent I, Vreeken J (2014) Frequent Pattern Mining Algorithms for Data Clustering, pp. 403–423. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-319-07821-2_16

Download references

Acknowledgements

This work has been realized with the support of the High Performance Computing Platform: MESO@LR, financed by the Occitanie/Pyrénées-Médite-rranée Region, Montpellier Mediterranean Metropole and Montpellier University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dickson Odhiambo Owuor.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Theorems and Proofs

Theorem 1

Under the conditions stated in Sect. 3.4; assume that \(b \le \frac{4}{3}\) for any arbitrary small constant or \(b > \frac{4}{3}\) for some large constant, with high probability, there exists a permutation \(\pi\) such that

$$\begin{aligned} |{\mathcal {C}}_{k} ~\triangle ~ \widehat{{\mathcal {C}}}_{\pi (k)} | \le \frac{C'\cdot r~\text {max}\{n, 2m\}\log {n}\log {2m}}{(1 - \epsilon )n^{2}}, ~~ \forall k. \end{aligned}$$

Particularly, when \((1 - \epsilon )Kn^{2} \ge C'\cdot r~\text {max}\{n, 2\,m\}\) \(\log {n}\log ^{2}{2\,m}\) then \({\frac{|{\mathcal {C}}_{k} ~\triangle ~ \widehat{{\mathcal {C}}}_{\pi (k)} | }{K} \le \frac{1}{\log {2\,m}}}\).

Theorem 1 implies that if n and 2m are on the same order, then each gradual item only needs to provide \(r^{2}\text {poly}(\log {n})\) object pairs to allow accurate clustering except for \(K/\log {2m}\) gradual items.

Theorem 2

Assume the conditions stated in Sect. 3.4. Define

$$\begin{aligned} \eta _{1} = \frac{r~\text {max}\{n,2m\}\log {n}\log {2m}}{(1-\epsilon )Kn^{2}}, ~~ \eta _{2} = \sqrt{\frac{\log {n}}{(1-\epsilon )Kn}} \end{aligned}$$

Assume \(b \le \frac{4}{3}\) for any arbitrarily small constant, then there exists a constant \(C > 0\) such that with high probability

$$\begin{aligned} \frac{||{\hat{\theta }} _{g} - \theta _{g} ||_{2}}{||\theta _{g} ||_{2}} \le \frac{C (e^{b} + 1)^2}{be^{b}}\text {max}\{\eta _{1}, \eta _{2} \} \end{aligned}$$

Particularly, when

\((1 - \epsilon )Kn^{2} \ge r\text {max}\{n, 2m\}\log {n}\log ^{2}{2m}\) then \({\frac{||{\hat{\theta }} _{g} - \theta _{g} ||_{2}}{||\theta _{g} ||_{2}} = O(\frac{1}{\log {2\,m}})}\) except for \(K/\log {2m}\) gradual items.

Theorem 2 demonstrates that the estimation error depends on the maximum of \(\eta _{1}\) and \(\eta _{2}\). It also shows that Algorithm 1 requires approximately \((1 - \epsilon )Kn^{2} = O(r\text {max}\{n, 2\,m\}\log {n}\log ^{2}{2\,m})\) object pairs per cluster.

Proof of Theorem 1

Let \({\rho = \frac{(1-\epsilon )n}{\sqrt{\log {n}}}}\). We say a gradual item is a good gradual item if \(||{\widetilde{S}}_{g} - {\bar{S}}_{g} ||_{2} \le \frac{\rho }{2}\). Let \({\bar{S}}_{k}\) denote the common expected net-win vectors for gradual items in cluster k, where \(k = 1, 2,..., r\). Under the model assumptions in Sect. 3.4, for all good gradual items g, for any \(k \not = k'\),

$$\begin{aligned} S_{g} - {\bar{S}}_{g} ||_{2} \le \frac{\rho }{2} < {\bar{S}}_{k} - {\bar{S}}_{k'} ||_{2} \end{aligned}$$

\(\square\)

Following the proofs of Lemma 3, Lemma 4, and Lemma 5 in [31]; if we let \({\mathcal {G}}\) denote the set of good gradual items, then the number of bad gradual items is given by

$$\begin{aligned} |{\mathcal {G}}^{c}| \le \frac{C'\cdot r~\text {max}\{n, 2m\}\log {n}\log {2m}}{(1 - \epsilon )n^{2}}. \end{aligned}$$

Following the proof of Proposition 1 in [32], we make the conclusion that there exists a permutation \(\pi\) such that

$$\begin{aligned} |{\mathcal {C}}_{k} ~\triangle ~ \widehat{{\mathcal {C}}}_{\pi (k)} | \le |{\mathcal {G}}^{c}| \end{aligned}$$

Proof of Theorem 2

From Theorem 1, we get that there exists a permutation \(\pi\) such that

$$\begin{aligned} |{\mathcal {C}}_{k} ~\triangle ~ \widehat{{\mathcal {C}}}_{\pi (k)} | \le \frac{C'\cdot r~\text {max}\{n, 2m\}\log {n}\log {2m}}{(1 - \epsilon )n^{2}}, ~~ \forall k. \end{aligned}$$

\(\square\)

In order to get the result of Theorem 2, we apply Theorem 3 in [31]. If we want to achieve \(\frac{||{\hat{\theta }} _{g} - \theta _{g} ||_{2}}{||\theta _{g} ||_{2}} = O(\frac{1}{\log {2\,m}})\) for the good gradual items, we need

$$\begin{aligned} \frac{r~\text {max}\{n,2m\}\log {n}\log {2m}}{(1-\epsilon )Kn^{2}} \le \frac{1}{\log {2m}}\\ \sqrt{\frac{\log {n}}{(1-\epsilon )Kn}} \le \frac{1}{\log {2m}} \end{aligned}$$

which requires that \((1 - \epsilon )Kn^{2} > r\text {max}\{n, 2\,m\}\log {n}\log ^{2}{2\,m}\) and \((1 - \epsilon )Kn > n\log {n}\log ^{2}{2\,m}\) respectively. Note that the former requirement is more strict than the latter one; this implies that the clustering step of Algorithm 1 needs more object pairs that the score vector estimation step to achieve the same error rate.

B some properties of bradley-terry model

The proofs in this section are adopted from [28] and [31]. First, we introduce some additional notation used in this section. Let I denote the identity matrix. Let \({\textbf{1}}\) denote the vector with all-one entries and \(\textbf{11}^{\top }\) denote the matrix with all-one entries. This section aims to prove that if parameters \(\theta\) from Bradley-Terry model are drawn from a uniform distribution on a hypercube, then

$$\begin{aligned} \begin{aligned} ||{\bar{S}}_{u} - {\bar{S}}_{v} ||_{2} \ge C(1-\epsilon )n,\\ \text {where } u, v \text { belong to different clusters.} \end{aligned} \end{aligned}$$

We fix \(b \in \mathbb {R}\). We assume that for each cluster \(k \in \{1, 2,..., r\}\) and each item \(i \in \{1, 2,..., n\}\), \(\theta _{k, i}\) is drawn from a uniform distribution on [0, b]. The proof is divided into two parts: (1) the case of \(b \le \frac{4}{3}\) and, (2) the case of \(b > \frac{4}{3}\). Due to space constraints, these proofs appear in the Appendix of [28].

It should be noted that \(AA^{\top } = nI - J \triangleq L_{n}\), which is the Laplacian of the complete graph. It is easy to verify that the eigenvalues of \(L_{n}\) are 0, n, and the eigenvector corresponding to the zero eigenvalue is given by \(\frac{1}{\sqrt{n}}{\textbf{1}}\). Therefore, A is of rank \(m - 1\) and all its nonzero singular values are \(\sqrt{n}\). The SVD of A is \(A = \sqrt{n}UV^{\top }\) and \(S_{g} = \frac{1}{\sqrt{n}}R_{g}A^{\top }\), we have that

$$\begin{aligned} {\bar{S}}_{g} = {\bar{R}}_{g}UV^{\top }. \end{aligned}$$

Using the probability density function in Eq. 2 Let us compute \({\bar{R}}_{g}\):

$$\begin{aligned} \begin{aligned}&\mathbb {E}[R_{g, i, j}] = (1 - \epsilon )\frac{e^{\theta _{g,i}} - e^{\theta _{g, j}}}{e^{\theta _{g,i}} + e^{\theta _{g, j}}} \triangleq (1 - \epsilon )f(\theta _{g, i} - \theta _{g, j}), \text { where }\\&f(x) = \frac{e^{x} - 1}{e^{x} + 1}. \text { Then,}\\&\mathbb {E}[R_{g}] = (1 - \epsilon )f(\theta _{g}A),\\&\text { where } f:x \in \mathbb {R}^{m} \mapsto (f(x_{1}),..., f(x_{n})). \end{aligned} \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Owuor, D.O., Runkler, T., Laurent, A. et al. Clustering-based gradual pattern mining. Int. J. Mach. Learn. & Cyber. 15, 2263–2281 (2024). https://doi.org/10.1007/s13042-023-02027-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-023-02027-w

Keywords