Abstract
Ensemble learning is an effective technique to improve performance and stability compared to single classifiers. This work proposes a selective ensemble classification strategy to handle missing data classification, where an uncertain extreme learning machine with probability constraints is used as individual (or base) classifiers. Then, three selective ensemble frameworks are developed to optimize ensemble margin distributions and aggregate individual classifiers. The first two are robust ensemble frameworks with the proposed loss functions. The third is a sparse ensemble classification framework with the zero-norm regularization, to automatically select the required individual classifiers. Moreover, the majority voting method is applied to produce ensemble classifier for missing data classification. We demonstrate some important properties of the proposed loss functions such as robustness, convexity and Fisher consistency. To verify the validity of the proposed methods for missing data, numerical experiments are implemented on benchmark datasets with missing feature values. In experiments, missing features are first imputed by using expectation maximization algorithm. Numerical experiments are simulated in filled datasets. With different probability lower bounds of classification accuracy, experimental results under different proportion of missing values show that the proposed ensemble methods have better or comparable generalization compared to the traditional methods in handling missing-value data classifications.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aravkin AY, Kambadur A, Lozano AC et al (2014) Sparse quantile Huber regression for efficient and robust estimation. Mathematics 14(Suppl 1):1–1
Bershad NJ (2004) Comments on a recursive least M-estimate algorithm for robust adaptive filtering in impulsive noise: fast algorithm and convergence performance analysis. IEEE Trans Signal Process 57(1):388–389
Bi Y, Guan J, Bell D (2008) The combination of multiple classifiers using an evidential reasoning approach. Artif Intell 172(15):1731–1751
Cao T, Zhang M, Andreae P, Xue B, Bui LT (2018) An effective and efficient approach to classification with incomplete data. Knowl-Based Syst 154:1–16
Chen C, Yan C, Zhao N, Guo B, Lin G (2016) A robust algorithm of support vector regression with a trimmed huber loss function in the primal. Soft Comput 21(18):1–9
Dempster AP (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–38
Eirola E, Miche Y, Rui N, Akusok A, Lendasse A (2016) Extreme learning machine for missing data using multiple imputations. Neurocomputing 174(PA):220–231
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 41(12):3692–3705
Han B, He B, Nian R et al (2015) LARSEN-ELM: selective ensemble of extreme learning machines using LARS for blended data. Neurocomputing 149:285–294
Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501
Huang GB, Ding X, Zhou H (2010) Optimization method based extreme learning machine for classification. Neurocomputing 74(1):155–163
Huang G, Song S, Wu C, You K (2012) Robust support vector regression for uncertain input and output data. IEEE Trans Neural Netw Learn Syst 23(11):1690–1700
Huang X, Shi L, Suykens JAK (2014) Support vector machine classifier with pinball loss. IEEE Trans Pattern Anal Mach Intell 36(5):984–997
Huang G, Huang GB, Song S et al (2015) Trends in extreme learning machines: a review. Neural Netw 61:32–48
Jing S, Yang L (2018) A robust extreme learning machine framework for uncertain data classification. J Supercomput. https://doi.org/10.1007/s11227-018-2430-6
Lanckriet GRG, El Ghaoui L, Bhattacharyya C, Jordan MI (2004a) Minimax probability machine. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems, vol 14. MIT Press, Cambridge
Lin Y (2004) A note on margin-based loss functions in classification. Stat Probab Lett 68(1):73–82
Liu ZG, Pan Q, Dezert J et al (2016) Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognit 52(C):85–95
Lopez J, Maldonado S, Carrasco M (2018) Double regularization methods for robust feature selection and SVM classification via DC programming. Inf Sci 429:377–389
Marshall AW, Olkin I (1960) Multivariate chebyshev inequalities. Ann Math Stat 31(4):1001–1014
Martin B, Dragi K, Sa D (2018) Ensembles for multi-target regression with random output selections. Mach Learn 107:1673–1709
Pelckmans K, Brabanter JD, Suykens JAK et al (2005) Handling missing values in support vector machine classifiers. Neural Netw 18(5–6):684–692
Polikar R, Depasquale J, Mohammed HS, Brown G, Kuncheva LI (2010) Learn.mf: a random subspace approach for the missing feature problem. Pattern Recognit 43(11):3817–3832
Schapire RE (1989) The strength of weak learnability. Proceedings of the second annual workshop on computational learning theory 5(2):197–227
Schapire RE, Freund Y, Bartlett P, Lee WS (1997) Boosting the margin: a new explanation for the effectiveness of voting methods. In: Fourteenth international conference on machine learning, pp 322–330
Schapire RE, Freund Y, Bartlett P et al (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5):1651–1686
Shivaswamy PK, Bhattacharyya C, Smola AJ (2006) Second order cone programming approaches for handling missing and uncertain data. J Machine Learn Res 7:1283–1314
Sousa LM, Vandenberghe L, Boyd S, Lebret H (1998) Applications of second order cone programming. Linear Algebra Appl 284(1):193–228
Steinwart I, Christmann A (2011) Estimating conditional quantiles with the help of the pinball loss. Bernoulli 17(1):211–225
Yang L, Zhang S (2016) A sparse extreme learning machine framework by continuous optimization algorithms and its application in pattern recognition. Eng Appl Artif Intell 53:176–189
Yang L, Dong H (2018) Support vector machine with truncated pinball loss and its application in pattern recognition. Chemom Intell Lab Syst 177:89–99
Yang L, Ren Z, Wang Y, Dong H (2017) A robust regression framework with laplace kernel-induced loss. Neural Comput 29:3014–3039
Zhang L, Zhou WD (2011) Sparse ensembles using weighted combination methods based on linear programming. Pattern Recognit 44(1):97–106
Acknowledgements
This work is supported by National Nature Science Foundation of China (Nos.11471010, 11271367). Moreover, the authors thank the referees and the editor for their constructive comments. Their suggestions improved the paper significantly.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that we have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1
In this appendix, we prove the following triangle inequality holds for \(\eta (u)=1-\epsilon ^{-\alpha |u|},(\alpha >0 , u\in R)\):
The first inequality holds for \(|u_{1}+u_{2}|\le (|u_{1}|+|u_{2}|)\) accompanied by the monotone decreasing of \(f(x)=\epsilon ^{-x}\). The last inequality holds for \(\alpha >0\).
Appendix 2
Generally speaking, a DC program takes the form
where g and h are lower semicontinuous proper convex functions on \(R^{n}\). Such a function f is called a DC function. g and h are the DC components of f. A function \(\pi (x)\) is said to be polyhedral convex if
where \(\varpi _{i}\in R^{n},\sigma _{i}\in R,(i=1,2,\ldots m )\). The \(\chi _{{\varOmega }}(x)\) is the indicator function of the non-empty convex set \({\varOmega }\), and is defined as: \(\chi _{{\varOmega }}=0\) if \(x\in {\varOmega }\) and \(+\infty\) otherwise. A DC program is called a polyhedral DC program when either g or h is a polyhedral convex function.
A point \(x^{*}\) that satisfies the following generalized Kuhn–Tucker condition is called a critical point of \((P_{dc})\)
where \(\partial h\) is the subdifferential of the convex function h. It follows that if h is polyhedral convex, then such a critical point for \((P_{dc})\) is almost always a local solution for \((P_{dc})\).
The necessary local optimality condition for \((P_{dc})\) is
which is also sufficient for many important classes of DC programs, for example, polyhedral DC programs or when f is locally convex at \(x^{*}\). We use \(g^{*}(y)=\sup \{x^{T}y-g(x),x\in R^{n}\}\) to denote the conjugate function of g. The Fenchel–Rockafellar dual of \((P_{dc})\) is defined as
DCA is an iterative algorithm based on local optimality conditions and duality. The idea of DCA is simple: at each iteration, one replaces the second component h in the primal DC problem \((P_{dc})\) by its affine minorization, \(h(x^{k})+(x-x^{k})^{T}y^{k}\), to generate the convex program
which is equivalent to determining \(x^{k+1} \in \partial g^{*}(y^{k})\). Likewise, the second DC component \(g^{*}\) of the dual DC program \((D_{dc})\) is replaced by its affine minorization, \(g^{*}(y^{k})+(y-y^{k})^{T}x^{k+1}\), to obtain a convex program that is equivalent to determining \(y^{k+1}\in \partial h(x^{k+1})\).
In practice, a simplified form of the DCA is used. Two sequences \(\{x^{k}\}\) and \(\{y^{k}\}\) satisfying \(y^{k}\in \partial h(x^{k})\) are constructed, and \(x^{k+1}\) is a solution to the convex program (46). The simplified DCA scheme is described as follows.
Initialization: Choose an initial point \(x^{0}\in R^{n}\) and set \(k=0\)
Repeat
Calculate \(y^{k}\in \partial h(x^k)\)
Solve convex program (46) to obtain \(x^{k+1}\) Let k:=k+1
Until some stopping criterion is satisfied.
DCA is a descent algorithm without linesearch. The following properties are used in next sections :(for simplicity, we omit the dual part of these properties).
-
(1)
If \(g(x^{k+1})-h(x^{k+1}) = g(x^{k})-h(x^{k})\), then \(x^{k}\) is a critical point for \((P_{dc})\). In this case, DCA terminates at k-th iteration.
-
(2)
Let \(y^{*}\) be a local solution to the dual of \((P_{dc})\) and \(x^{*}\in \partial g^{*}(y^{*})\). If h is differentiable at \(x^{*}\), then \(x^{*}\) is a local solution to \((P_{dc})\).
-
(3)
If the optimal value of problem \((P_{dc})\) is finite and the infinite sequence \(\{x^{k}\}\) is bounded, then every limit point \(x^{*}\) of the sequence \(\{x^{k}\}\) is a critical point of \((P_{dc})\).
-
(4)
DCA converges linearly for general DC programs. Especially, for polyhedral DC programs, the sequence \(\{x^{k}\}\) contains finitely many elements, and in a finite number of iterations the algorithm converges to a critical point satisfying the necessary optimality condition.
Moreover, if the second DC component h in \((P_{dc})\) is differentiable , then the subdifferential of the h at point \(x^{k}\) is reduced to a singleton, \(\partial h(x^{k})=\{\bigtriangledown h(x^{k})\}\). In this case, \(x^{k+1}\) is a solution to the following convex program:
Rights and permissions
About this article
Cite this article
Jing, S., Wang, Y. & Yang, L. Selective ensemble of uncertain extreme learning machine for pattern classification with missing features. Artif Intell Rev 53, 5881–5905 (2020). https://doi.org/10.1007/s10462-020-09836-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-020-09836-3