Abstract
In the domain of data preparation for supervised classification, filter methods for variable ranking are time efficient. However, their intrinsic univariate limitation prevents them from detecting redundancies or constructive interactions between variables. This paper introduces a new method to automatically, rapidly and reliably extract the classificatory information of a pair of input variables. It is based on a simultaneous partitioning of the domains of each input variable, into intervals in the numerical case and into groups of categories in the categorical case. The resulting input data grid allows to quantify the joint information between the two input variables and the output variable. The best joint partitioning is searched by maximizing a Bayesian model selection criterion. Intensive experiments demonstrate the benefits of the approach, especially the significant improvement of accuracy for classification tasks.
Similar content being viewed by others
References
Abramowitz M, Stegun I (1970) Handbook of mathematical functions. Dover, New York
Bay S (2001) Multivariate discretization for set mining. Mach Learn 3(4): 491–512
Berger J (2006) The case of objective Bayesian analysis. Bayesian Anal 1(3): 385–402
Bernardo J, Smith A (2000) Bayesian theory. Wiley, New York
Bertier P, Bouroche J (1981) Analyse des données multidimensionnelles. Presses Universitaires de France
Blake C, Merz C (1996) UCI repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html
Boullé M (2004) Khiops: a statistical discretization method of continuous attributes. Mach Learn 55(1): 53–69
Boullé M (2005) A Bayes optimal approach for partitioning the values of categorical attributes. J Mach Learn Res 6: 1431–1452
Boullé M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65(1): 131–165
Boullé M (2007) Compression-based averaging of selective naive Bayes classifiers. J Mach Learn Res 8: 1659–1685
Boullé M (2008) Bivariate data grid models for supervised learning. Technical Report NSM/R&D/ TECH/EASY/TSI/4/MB, France Telecom R&D. http://perso.rd.francetelecom.fr/boulle/publications/BoulleNTTSI4MB08.pdf
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth International, California
Carr D, Littlefield R, Nicholson W, Littlefield J (1987) Scatterplot matrix techniques for large n. J Am Stat Assoc 82: 424–436
Chapman P, Clinton J, Kerber R, Khabaza T, Reinartz T, Shearer C, Wirth R (2000) CRISP-DM 1.0: step-by-step data mining guide
Cochran W (1954) Some methods for strengthening the common chi-squared tests. Biometrics 10(4): 417–451
Connor-Linton J (2003) Chi square tutorial. http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html
Fayyad U, Irani K (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8: 87–102
Goldstein M (2006) Subjective Bayesian analysis: principles and practice. Bayesian Anal 1(3): 403–420
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3: 1157–1182
Guyon I, Gunn S, Hur AB, Dror G (2006) Design and analysis of the NIPS2003 challenge. In: Guyon I, Gunn S, Nikravesh M, Zadeh L (eds) Feature extraction: foundations and applications, chap 9. Springer, New York, pp 237–263
Hansen P, Mladenovic N (2001) Variable neighborhood search: principles and applications. Eur J Oper Res 130: 449–467
Holte R (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11: 63–90
Kass G (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 29(2): 119–127
Kerber R (1992) ChiMerge discretization of numeric attributes. In: Proceedings of the 10th international conference on artificial intelligence. MIT Press, Cambridge, pp 123–128
Kohavi R, John G (1997) Wrappers for feature selection. Artif Intell 97(1-2): 273–324
Kohavi R, Sahami M (1996) Error-based and entropy-based discretization of continuous features. In: Proceedings of the 2nd international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, pp 114–119
Kononenko I, Bratko I, Roskar E (1984) Experiments in automatic learning of medical diagnostic rules. Technical report, Joseph Stefan Institute, Faculty of Electrical Engineering and Computer Science, Ljubljana
Kurgan L, Cios J (2004) CAIM discretization algorithm. IEEE Trans Knowl Data Eng 16(2): 145–153
Kwedlo W, Kretowski M (1999) An evolutionary algorithm using multivariate discretization for decision rule induction. In: Principles of data mining and knowledge discovery. Lecture notes in computer science, vol 1704. Springer, Berlin, 392–397
Langley P, Iba W, Thompson K (1992) An analysis of Bayesian classifiers. In: 10th National conference on artificial intelligence. AAAI Press, San Jose, pp 223–228
Maass W (1994) Efficient agnostic pac-learning with simple hypothesis. In: COLT ’94: Proceedings of the seventh annual conference on Computational learning theory. ACM Press, New York, pp 67–75
Nadif M, Govaert G (2005) Block clustering of contingency table and mixture model. In: Advances in intelligent data analysis VI. Lecture notes in computer science, vol 3646. Springer, Berlin, pp 249–259
Olszak M, Ritschard G (1995) The behaviour of nominal and ordinal partial association measures. The Statistician 44(2): 195–212
Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, San Francisco
Quinlan J (1986) Induction of decision trees. Mach Learn 1: 81–106
Quinlan J (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco
Rissanen J (1978) Modeling by shortest data description. Automatica 14: 465–471
Ritschard G, Nicoloyannis N (2000) Aggregation and association in cross tables. In: PKDD ’00: proceedings of the 4th European conference on principles of data mining and knowledge discovery. Springer, Berlin, pp 593–598
Robert C (1997) The Bayesian choice: a decision-theoretic motivation. Springer, New York
Saporta G (1990) Probabilités analyse des données et statistique. TECHNIP, Paris
Shannon C (1948) A mathematical theory of communication. Technical Report 27, Bell systems technical journal
Steck H, Jaakkola T (2004) Predictive discretization during model selection. Pattern Recognit LNCS 3175: 1–8
Weaver W, Shannon C (1949) The mathematical theory of communication. University of Illinois Press, Urbana
Zighed D, Rakotomalala R (2000) Graphes d’induction. Hermes, France
Zighed D, Rabaseda S, Rakotomalala R (1998) Fusinter: a method for discretization of continuous attributes for supervised learning. Int J Uncertain Fuzziness Knowl Based Syst 6(33): 307–326
Zighed D, Ritschard G, Erray W, Scuturici V (2005) Decision trees with optimal joint partitioning. Int J Intell Syst 20(7): 693–718
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Boullé, M. Optimum simultaneous discretization with data grid models in supervised classification: a Bayesian model selection approach. Adv Data Anal Classif 3, 39–61 (2009). https://doi.org/10.1007/s11634-009-0038-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-009-0038-7