Abstract
We consider the problem of clustering a set of points so as to minimize the maximum intra-cluster dissimilarity, which is strongly NP-hard. Exact algorithms for this problem can handle datasets containing up to a few thousand observations, largely insufficient for the nowadays needs. The most popular heuristic for this problem, the complete-linkage hierarchical algorithm, provides feasible solutions that are usually far from optimal. We introduce a sampling-based exact algorithm aimed at solving large-sized datasets. The algorithm alternates between the solution of an exact procedure on a small sample of points, and a heuristic procedure to prove the optimality of the current solution. Our computational experience shows that our algorithm is capable of solving to optimality problems containing more than 500,000 observations within moderate time limits, this is two orders of magnitude larger than the limits of previous exact methods.


Similar content being viewed by others
References
Alcock, R., Manolopoulos, Y.: Time-series similarity queries employing a feature-based approach. In: 7th Hellenic Conference on Informatics, Ioannina, Greece, pp. 27–29 (1999)
Alpert, C.J., Kahng, A.B.: Splitting an ordering into a partition to minimize diameter. J. Classif. 14, 51–74 (1997)
Anderberg, M.R.: Cluster Analysis for Applications/Michael R. Anderberg. Academic Press, New York (1973)
Blackard, J.A.: Comparison of neural networks and discriminant analysis in predicting forest cover types. Ph.D. thesis, Colorado State University (1998)
Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to large databases. In: KDD’98 proceedings of the fourth international conference on knowledge discovery and data mining, pp. 9–15 (1998)
Brusco, M.J., Stahl, S.: Branch-and-Bound Applications in Combinatorial Data Analysis. Springer, New York (2006)
Dao, T.B.H., Duong, K.C., Vrain, C.: Constrained clustering by constraint programming. Artif. Intell. 244, 70–94 (2017)
Defays, D.: An efficient algorithm for a complete link method. Comput. J. 20(4), 364–366 (1977)
Delattre, M., Hansen, P.: Bicriterion cluster analysis. IEEE Trans. Pattern Anal. Mach. Intell. 4, 277–291 (1980)
Duarte, M., Hu, Y.H.: Vehicle classification in distributed sensor networks. J. Parallel Distrib. Comput. 64, 826–838 (2004)
Fioruci, J.A.A., Toledo, F.M., Nascimento, M.A.C.V.: Heuristics for minimizing the maximum within-clusters distance. Pesquisa Operacional 32, 497–522 (2012)
Fraley, C., Raftery, A., Wehrens, R.: Incremental model-based clustering for large datasets with small clusters. J. Comput. Graph. Stat. 14(3), 529–546 (2005)
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to NP-Completeness. WH Freeman, New York (1979)
Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 38, 293–306 (1985)
Hansen, P., Delattre, M.: Complete-link cluster analysis by graph coloring. J. Am. Stat. Assoc. 73(362), 397–403 (1978)
Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254 (1967)
Kahraman, H.T., Sagiroglu, S., Colak, I.: Developing intuitive knowledge classifier and modeling of users’ domain dependent data in web. Knowl. Based Syst. 37, 283–295 (2013)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data : An Introduction to Cluster Analysis. Wiley Series in Probability and Mathematical Statistics, Wiley, New York (1990)
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 27 Feb 2018
Lozano, L., Smith, J.C.: A backward sampling framework for interdiction problems with fortification. INFORMS J. Comput. 29(1), 123–139 (2017)
Östergård, P.R.: A fast algorithm for the maximum clique problem. Discrete Appl. Math. 120(1), 197–207 (2002)
Prokhorov, D.: IJCNN 2001 neural network competition. Slide presentation in IJCNN, 1, 97 (2001)
Sibson, R.: SLINK: an opoptimal efficient algorithm for the single-link cluster method. Comput. J. 16, 30–34 (1973)
Siebert, J.P.: Vehicle recognition using rule based methods. Research Memorandum TIRM-87-018, Turing Institute (1987)
Sørensen, T.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons. Biol Skr 5, 1–34 (1948)
Torgo, L.: Regression datasets (2009). http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html. Accessed 27 Feb 2018
Ugulino, W., Cardador, D., Vega, K., Velloso, E., Milidiu, R., Fuks, H.: Wearable computing: Accelerometers’ data classification of body postures and movements. In: Proceedings of 21st Brazilian Symposium on Artificial Intelligence, Springer, Berlin/Heidelberg, Lecture Notes in Computer Science, pp. 52–61 (2012)
Uzilov, A.V., Keegan, J.M., Mathews, D.H.: Detection of non-coding rnas on the basis of predicted secondary structure formation free energy change. BMC Bioinform. 7, 173 (2006)
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: a new data clustering algorithm and its applications. Data Min. Knowl. Discrete 1(2), 141–182 (1997)
Acknowledgements
This research was financed by the Fonds de recherche du Québec - Nature et technologies (FRQNT) under grant no 181909 and by the Natural Sciences and Engineering Research Council of Canada (NSERC) under grants 435824-2013 and 2017-05617. These supports are gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Aloise, D., Contardo, C. A sampling-based exact algorithm for the solution of the minimax diameter clustering problem. J Glob Optim 71, 613–630 (2018). https://doi.org/10.1007/s10898-018-0634-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-018-0634-1