Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

A clusterwise nonlinear regression algorithm for interval-valued data

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

A clusterwise nonlinear regression algorithm for interval-valued data

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Information Sciences 555 (2021) 357–385

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

A clusterwise nonlinear regression algorithm


for interval-valued data
Francisco de A.T. de Carvalho a,⇑, Eufrásio de A. Lima Neto b, Kassio C.F. da Silva c
a
Centro de Informatica, Universidade Federal de Pernambuco, Av. Jornalista Anibal Fernandes s/n – Cidade Universitaria, CEP 50740-560 Recife, PE, Brazil
b
Universidade Federal da Paraíba, Departamento de Estatística, Cidade Universitária, 58051-900 João Pessoa, PB, Brazil
c
Pró-Reitoria de Planejamento, Universidade Federal Rural do Semi-Árido, Av. Francisco Mota 572 – Bairro Costa e Silva, CEP 59625-900 Mossoró, RN, Brazil

a r t i c l e i n f o a b s t r a c t

Article history: Interval-valued variables are required in data analysis since this type of data represents
Received 11 September 2019 either the uncertainty existing in an error measurement or the natural variability of the
Received in revised form 13 October 2020 data. Currently, methods and algorithms which aim to manage interval-valued data are
Accepted 16 October 2020
very much required. Hence, this paper presents a center and range clusterwise nonlinear
Available online 27 October 2020
regression algorithm for interval-valued data. The proposed algorithm combines a k-
means type algorithm with the center and range linear and nonlinear regression methods
Keywords:
for interval-valued data, with the aim to identify both the partition of the data and the rel-
Nonlinear regression
Clusterwise regression
evant regression models fitted on the center and range of the intervals simultaneously, one
Interval-valued data for each cluster. The proposed method is able to automatically select the best pair of center
Partitioning clustering and range (linear and/or nonlinear) functions according to optimization criteria. A simula-
tion study with synthetic data sets with the purpose of assessing the parameter estimation
and the prediction performance of the proposed algorithm was undertaken. Finally, appli-
cations on real data sets were performed and the prediction accuracy of the proposed
method was compared to the linear case. The results obtained showed that the proposed
method performed well on both synthetic and real data sets.
Ó 2020 Elsevier Inc. All rights reserved.

1. Introduction

Information Technology advances have allowed for a huge amount of data to be collected and stored in many fields, such
as finance, commerce, biology, medicine, social networks, etc. The information collected in these big data sets is heteroge-
neous, imprecise, and time-varying. Currently, there is an important requirement to aggregate these huge data sets into
information granules, which simultaneously keep as much information as possible while also enabling the results to be
easily explained and supported with empirical evidence. These information granules are often referred to as symbolic or
granular data.
Granular data analysis has seen a rapid growth in interest over the last two decades and is often carried out through inter-
val analyses, fuzzy sets, random sets, rough sets, and shadowed sets formal frameworks [40]. Symbolic data analysis is used
to extend data mining and statistics exploratory methods to manage symbolic data and is often supported by statistical and
multivariate approaches [3,18]. Recent special issues [20,28,46] testify to the growing interest in these fields.

⇑ Corresponding author.
E-mail addresses: fatc@cin.ufpe.br (Francisco de A.T. de Carvalho), eufrasio@de.ufpb.br (Eufrásio de A. Lima Neto), kassio.silva@ufersa.edu.br (K.C.F. da
Silva).

https://doi.org/10.1016/j.ins.2020.10.054
0020-0255/Ó 2020 Elsevier Inc. All rights reserved.
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

In classical data analysis, the variables (numerical or categorical) used to describe the objects are usually single-
valued, which means that, for a given object, a variable takes on a single quantitative or qualitative value. However,
in many real situations, the utilization of single-valued variables can be very restrictive, especially when analyzing a
group rather than an individual, where the variability inherent to the group must be taken into account. Thus, the
aggregation of single-valued observations allows the creation of new variable types, such as set-valued variables,
interval-valued variables, or even histogram-valued variables. Within the framework of statistics and multivariate data
analysis, Symbolic Data Analysis (SDA) deals with the specific kind of information granules referred to as symbolic data
[3,18].
This paper focuses on clusterwise regression methods for interval-valued data. This type of data can represent the impre-
cision and/or uncertainty existing in an error measurement, but it can also represent the natural variability of the data. This
paper will only consider the second case. Interval-valued data arise in practical situations, such as recording monthly interval
temperatures at meteorological stations, daily interval stock prices, etc. Another source of interval-valued data can be found
in the aggregation of huge databases into a reduced number of groups, with their properties described by interval-valued
variables. Therefore, tools for interval-valued data analysis are very much required.

1.1. Regression overview

Regression analysis is designed to estimate the form of a relationship between a dependent variable (Y) and a set of inde-
pendent variables (X 1 ; . . . ; X p ) based on a mathematical function which depends on Y; X 1 ; . . . ; X p and a set of parameter
b0 ; b1 ; . . . ; bp that need to be estimated. The least squares method is often used to estimates the vector of parameters
b ¼ ðb0 ; b1 ; . . . ; bp ) and does not require any probabilistic assumption for the error of the model.
The linear regression model is the simplest form with which to represent a relationship between the variables.
However, in many real situations, the variables present a nonlinear relationship, making the linear regression unsuit-
able, i.e. the dependent variable Y and its predictors are related through a nonlinear function. The algebraic process to
obtain the least squares estimation from nonlinear relationships is the same as the linear case, but in several cases, the
resulting normal equations demand the use of iterative procedures in order to find the estimates for the b parameters
vector.
With regard to regression methods for granular data, Cerny et al. [9] present a possibilistic generalization of the least
squares estimator, defined as OLS-set for the interval model. The proposed models consider that both the input data and out-
put data are affected by loss of information caused by uncertainty, indeterminacy, rounding or censoring. Cimino et al. [11]
propose a multilayer perceptron to interval-valued regression that is trained using a genetic algorithm designed to fit data
with different levels of granularity. Boukezzoula et al. [5] propose the so-called gradual regression, which is supported by
gradual interval arithmetic and the notion of gradual intervals, to be an extension of the imprecise interval-based regression.
Zuo et al. [50] present three granular fuzzy regression methods to determine the estimated values for a regression target.
What makes these proposed methods original is that they take into account granularity in transfer learning. Chen and Miao
[10] present a set-based granular regression model where granules are constructed by introducing a distance metric on
single-atom features. The paper introduces a gradient descent method for the granular regression model that achieves an
optimal solution of granular regression. Finally, with regards to robust regression methods for granular data, Peters and Lacic
[41] suggest an approach to tackle outliers in granular box regression. The paper considers three methods of tackling outliers
in granular box regression and discuss their properties. Later, Mashinchi et al. [33] modified the granular box regression
approaches to deal with data sets with outliers by incorporating a three-stage procedure including granular box configura-
tion, outlier elimination, and linear regression analysis.
With regards to SDA, Lima Neto and de Carvalho [36] propose a linear regression that consists of fitting a model for the
centers and the ranges, independently, and then combining them to obtain the predicted values of the interval-valued
response variable y ¼ ½y ^L ; y
^U . In order to guarantee that the lower bound is less than or equal to the upper bound, avoiding
the inversion of interval bounds, a constrained half-range version of the linear regression model for interval-valued variables
was introduced by Lima Neto and de Carvalho [37] using quadratic programming techniques to estimate the parameters of
the model. Giordani [26] introduces a Lasso-based constraints regression model to interval-valued data that also guarantees
mathematical coherence between the predicted values of the lower and upper boundaries of the intervals. A model that
adopts overlapping constraints in order to improve prediction was introduced by Ref. [27], which is built as a special case
of least squares with inequality (LSI) problem.
Some studies have addressed inference techniques in the regression model for interval-valued variables by considering a
probabilistic support for the response interval-valued variable Y ¼ ½Y L ; Y U  [7,35,38]. Another advance in this field is the use
of the expectation–maximization algorithm to estimate the parameters of the model [47] when the data are interval-valued.
Regarding nonlinear and non-parametric regression techniques, Fagundes et al. [23] introduce kernel regression for
interval-valued data. Jeon et al. [29] propose to use a Gaussian kernel type estimator to approximate the distribution of
the interval’s hyper-rectangles in a nonparametric way. Lima Neto and de Carvalho [39] introduce a center and range non-
linear regression method for interval-valued variables that uses iterative procedures which aim to find the estimates of the
vector b.
358
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

1.2. Clusterwise regression overview

Clusterwise regression is a useful technique when heterogeneity is present in the data. It aims to identify both the par-
tition of the data in a previously specified number of clusters and the corresponding regression models, one for each cluster.
Regressions for each cluster can be viewed as a mixture model that uses maximum likelihood estimation. DeSarbo and
Cron [16] present a maximum likelihood methodology for clusterwise linear regression where an EM algorithm is used to
estimate separate regression functions and membership in K clusters. Garcia-Escudero et al. [25] propose a robust cluster-
wise regression through trimming. This method allows different scatters for the regression errors together with different
group weightings. Qiang and Weixin [49] introduce a semi-parametric mixture of quantile regression models aiming to
improve robustness to skewness and outliers. The paper presents a kernel density based EM-type algorithm to estimate
the model parameters, and a stochastic version of the EM-type algorithm for the variance estimation. More recently, Mazza
and Punzo [34] introduce mixtures of multivariate contaminated normal regression models. The paper provides identifabil-
ity conditions and outlines an expectation-conditional maximization algorithm for parameter estimation.
Clusterwise regression has also been studied either from a fuzzy data analysis or from a mathematical programming per-
spectives. With regards to fuzzy data analysis, D’Urso et al. [22] propose a class of fuzzy clusterwise regression models with
the LR fuzzy response variable and numeric explanatory variables. Moreover, the paper introduces a set of goodness of fit
indices and considers some cluster validity criteria that were useful in identifying the ‘‘optimal” number of clusters.
Shao-Tung et al. [44] present a possibilistic c-regression model algorithm by incorporating possibilistic c-means into switch-
ing regressions. Then, it proposes a schema for a nested stepwise procedure for possibilistic c-regressions which aims to
improve the robustness of the model to noise and outliers. More recently, Di Mari et al. [32] proposed a fully data-
dependent soft constrained method to estimate the maximum likelihood of clusterwise linear regression. Based on the
homoscedastic variance and a cross-validated tuning parameter, the method imposes soft-scale bounds.
Concerning the mathematical programming perspective, Kin-Nam et al. [31] present a nonlinear programming procedure
with linear constraints to estimate the cluster memberships and the regression coefficients simultaneously. Furthermore, a
clusterwise discriminant model is developed to incorporate parameter heterogeneity into the traditional discriminant anal-
ysis. Carbonneau et al. [8] present a column generation based approach for solving the clusterwise regression problem. The
paper shows that the proposed strategy outperforms the best-known alternative when the number of clusters is greater than
three. It also demonstrates the application of the new paradigm of using incrementally larger ending subsets to strengthen
the lower bounds of a branch and bound search.
From an exploratory data analysis perspective, clusterwise regression can be understood as a combination of clustering
and regression analysis based on the minimization of a suitable loss function. Ref. [45] proposes a clusterwise linear regres-
sion method that provides a partition of the observations in K clusters and K linear regression models simultaneously such
that the overall sum-of-squared errors within those clusters becomes a minimum. Preda and Saporta [43] propose cluster-
wise PLS regression on a stochastic process aiming to solve multicollinearity problems for regression and also when the
number of observations is smaller than the number of predictor variables. Vicari and Vichi [48] introduce a general regres-
sion model to account for both the between-cluster and the within-cluster regression variation. Furthermore, the paper pro-
vides the derivation of the least-squares estimation of the parameters with an appropriate coordinate descent algorithm and
some decompositions of the variance of the responses. Bagirov et al. [1] present an incremental algorithm, which uses pow-
erful smooth optimization techniques, to solve the CLR problem. The paper also shows that the proposed algorithm can find
the global or near global solutions if the data sets are sufficiently dense. Beauregard et al. [4] present a clusterwise PLS for
multiblock component methods, which is suitable for items which consisit of several blocks of variables which aim to ana-
lyze the relationships between these blocks of variables. Finally, it was from an exploratory data analysis perspective that
Ref. [14] used the center and range approach to adapt clusterwise linear regression to interval-valued data.

1.3. Paper proposal

This paper proposes a center and range clusterwise nonlinear regression method for interval-valued variables, hereafter
named iCNLR (interval center and range clusterwise nonlinear regression). Its main contributions are as follows:

 The proposed method combines the center and range linear regression method (CRM) [36] and/or nonlinear (NLM) [39]
regression method for interval-valued data with a dynamic clustering algorithm [19]. This is a clustering algorithm
related to k-means algorithm, which aims to identify both the partition of the data and the relevant regression models
fitted on the center and range of the intervals simultaneously, one for each cluster;
 In comparison with the interval center and range clusterwise linear regression method (iCLR), proposed by de Carvalho
et al. [14], the new method iCNLR method is able to provide the partition of the interval-valued data into a fixed number
of clusters. It can also select the best fit pair of linear and/or nonlinear functions on the center and range of the intervals
for each cluster from a set of nonlinear models (including the linear model).

The remainder of the paper is structured as follows: Sections 2.1 and 2.2 present the linear (CRM) and nonlinear (NLM)
center and range methods for interval-valued data, respectively. Section 2.3 describes the linear clusterwise regression
method for interval-valued data (iCLR). Section 2.4 presents the proposed method for clusterwise nonlinear regression for
359
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

interval-valued data. It describes the corresponding algorithm that starts from an initial solution and alternates two steps,
where the first step updates the partition in a predefined number of clusters, and the second step fits linear/nonlinear regres-
sion models on the center and ranges of the intervals, one for each cluster. Furthermore, Section 2.4 also presents the proof of
the convergence of the algorithm. Section 3 provides a discussion on experiments performed with synthetic interval-valued
data that seek to measure the estimation and prediction capabilities of the proposed method for different scenarios. Section 4
provides the new method’s prediction performance on real interval-valued data sets. Finally, Section 5 presents the conclud-
ing remarks.

2. Clusterwise nonlinear regression for interval-valued data

This section reviews the CRM [36] and the NLM [39] methods for interval-valued variables as well as the interval center
and range clusterwise linear regression method (iCLR) [14]. The section concludes with the proposed interval center and
range clusterwise nonlinear regression method for interval-valued data (iCNLR).
Let E ¼ fe1 ; . . . ; en g be a set of examples described by p þ 1 interval-valued variables Y; X 1 ; . . . ; X p . Each example
ei 2 Eð1 6 i 6 nÞ is represented by an interval-valued feature vector zi ¼ ðxi ; yi Þ, with xi ¼ ðxi1 ; . . . ; xip Þ, where
xij ¼ ½aij ; bij  2 I ¼ f½a; b : a; b 2 R; a 6 bgð1 6 j 6 pÞand yi ¼ ½yLi ; yUi  are the observed values of X j and Y, respectively.
Let the centers of these interval-valued variables be represented by the real-valued variables X cj ðj ¼ 1; 2; . . . ; pÞ and Y c ,
while the values of half-range are represented by X rj ðj ¼ 1; 2; . . . ; pÞ and Y r . In this case, the example ei 2 Eð1 6 i 6 nÞ is rep-
resented by ti ¼ ðwi ; ri Þ, with wi ¼ ðxci ; yci Þ and ri ¼ ðxri ; yri Þ, where xci ¼ ðxci1 ; . . . ; xcip Þ and xri ¼ ðxri1 ; . . . ; xrip Þ with xcij ¼ ðaij þ bij Þ=2,
xrij ¼ ðbij  aij Þ=2, yci ¼ ðyLi þ yUi Þ=2 and yri ¼ ðyUi  yLi Þ=2. Let D ¼ ft 1 ; . . . ; t n g be the observed data set.

2.1. The CRM method

This subsection offers a brief overview of the CRM method, in which two linear regression models are independently
applied to the center and the range of the intervals. In the CRM method, the dependent variables Y c ; Y r and the independent
variables X cj ; X rj ð1 6 j 6 pÞ are related by the following linear regression relationship:

X
p
yci ¼ bc0 þ bcj xcij þ ci ð1Þ
j¼1

X
p
yri ¼ br0 þ brj xrij þ ri ð2Þ
j¼1

where: ci and ri are the random errors for the centers and half-ranges, respectively, with
Eði Þ ¼ Eði Þ ¼ 0; Varðci Þ ¼ r2c ; Varðri Þ ¼ r2r ; Corðci ; cl Þ ¼ 0 and Corðri ; rl Þ ¼ 0; 8i – l. From Eqs. (1) and (2), one can obtain
c r

the sum of the squares of deviations for the CRM method, given by:
n 
X 
SCRM ¼ ðci Þ2 þ ðri Þ2
i¼1
! !2 !!2
X
n X
p X
n X
p
ð3Þ
¼ yci  bc0 þ bj xcij þ yri  br0 þ bj xrij
i¼1 j¼1 i¼1 j¼1
> >
¼ ðyc  Xc bc Þ ðyc  Xc bc Þ þ ðyr  Xr br Þ ðyr  Xr br Þ

where:
0 1
1 xc11 ... xc1p 0 1 0 1
B. bc0 yc1
B .. .. .. .. C
 X c
¼B . . . C C, b c B .. C
¼ @ . A,
B .. C
y ¼ @ . A;
c
ðnðpþ1ÞÞ @1 xcn1 ... xcnp A ððpþ1Þ1Þ ðn1Þ
bcp ycn
0 1
1 xr11 ... xr1p 0 1 01
B. C br0 yr1
B . ... .. ... C B . C B C
 Xr ¼ B . . C, br ¼ @ .. A, yr ¼ @ ... A.
ðnðpþ1ÞÞ @1 xrn1 ... xrnp A ððpþ1Þ1Þ ðn1Þ
brp yrn

In order to obtain the vectors of parameters bc and br that minimize SCRM , the Eq. (3) is differentiated with respect to these
vectors of parameters and the results are set as equal to zero. The least squares estimator for bc and br represents the solution
of the two independent systems with ðp þ 1Þ normal equations, each being given by:
360
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

 1
^c ¼ ðX c Þ> X c
b
>
ðX c Þ yc
 1 ð4Þ
^r ¼ ðX r Þ> X r
b
>
ðX r Þ yr :

Finally, the predicted values for a new example, described by z ¼ ðx; yÞ, is given by y ^L ¼ y
^c  y
^r and y ^U ¼ y
^c þ y
^r , where
> >
^c
y ¼ ðxc > ^c
~ Þ b,y r r > ^r
~ Þ b , ðx
^ ¼ ðx c >
~ Þ ¼ ð1; x1 ; . . . ; xp Þ, ðx
c c r >
~ Þ ¼ ð1; x1 ; . . . ; xp Þ, b
r r c ^ ;b
^ ¼ ðbc ^ ;...;b
c ^ Þ and b
c ^ ¼ ðb
r ^ ;b
r ^ ;...;b
r ^ Þ .
r
0 1 p 0 1 p

2.2. The NLM method

The NLM method consists of fitting two nonlinear regression models over the center and half-range of the intervals. The
regression relationship between the real-valued variables Y c ; Y r and X cj ; X rj ð1 6 j 6 pÞ is given by

yci ¼ f c ðxci ; bc Þ þ ci ð5Þ


yri ¼ f r ðxri ; br Þ þ ri ; ð6Þ

where: bc and br are qc  1 and qr  1 vectors of parameters, f c and f r are nonlinear functions, and xci ¼ ðxci1 ; . . . ; xcip Þ and
xri ¼ ðxri1 ; . . . ; xrip Þ are two p  1 vectors, which represent the values of the p explanatory variables on the center and the
half-range of the intervals, respectively.
In contrast to the linear case, the NLM method allows the consideration of different nonlinear relationships for the center
and ranges of intervals. Thus, one can have f c – f r and, consequently, qc – qr , which is not directly related to the number of
explanatory variables X c1 ; . . . ; X cp and X r1 ; . . . ; X rp , respectively. Furthermore, the number of independent variables are the same
in both Eqs. (5) and (6).
The sum of the squared errors for the NLM method is given by
X
n X
n
ðci Þ þ ðri Þ
2 2
SNLM ¼ ð7Þ
i¼1 i¼1
Xn
 2 X
n
 r 2
¼ yci  f c ðxci ; bc Þ þ yi  f r ðxri ; br Þ :
i¼1 i¼1

In nonlinear regression, the normal equation system often does not have a closed form. In order to find the values of bc
and br that minimizes SNLM , optimization methods such as Broyden-Fletcher-Goldfarb-Shanno (BFGS), Conjugate Gradient
(CG) or Simulated Annealing (SANN) [30] can be used. The BFGS is a quasi-Newton method that consists of approximating
the Hessian matrix using updates specified by gradient evaluations. This method converges only if the objective function has
a quadratic Taylor expansion near the optimum. The CG method does not require Hessian matrix evaluations or matrix stor-
age and inversion. The SANN is a heuristic that performs a probabilistic local search inspired by thermodynamics. SANN sub-
stitutes the current solution with another one on its neighborhood based on the objective function and the value of a variable
T. As the algorithm progresses, the value of T drops in decrements and the method converges to a local solution.
Finally, the predicted value for a new example, described by z ¼ ðx; yÞ, is given by y ^L ¼ y
^c  y
^r and y ^U ¼ y^c þ y
^r , where
^c Þ and y
^c ¼ f ðxc ; b
y ^r Þ, with xc ¼ ðxc ; . . . ; xc Þ; xr ¼ ðxr ; . . . ; xr Þ; b
^r ¼ f ðxr ; b ^c ; . . . ; b
^ c ¼ ðb ^c Þ and b ^r ; . . . ; b
^r ¼ ðb ^r Þ.
c r 1 p 1 p 1 qc 1 qr

2.3. The iCLR method

The interval Center and Range Clusterwise Linear Regression method (iCLR) proposed by de Carvalho et al. [14] combines
the dynamic clustering algorithm [19] with the center and range linear regression method for interval-valued data [36]. It
provides a partition of a set of observations E into a fixed number of clusters P 1 ; . . . ; P K , and K center and range linear regres-
sion models, one for each cluster.
More precisely, in the iCLR method, for each cluster Pk ð1 6 k 6 KÞ, let nk ¼ jP k j ð1 6 k 6 KÞ and let eil 2 P k be described by
two features vectors wil ¼ ðxil c ; ycil Þ and ril ¼ ðxil r ; yril Þ, where xcil ¼ ðxcil 1 ; . . . ; xcil p Þ and xril ¼ ðxril 1 ; . . . ; xril p Þ with
xcil j ¼ ðail j þ bil j Þ=2; xril j ¼ ðbil j  ail j Þ=2; ycil ¼ ðyLil þ yUil Þ=2 and yri ¼ ðyUil  yLil Þ=2 ð1 6 l 6 nk ; 1 6 j 6 pÞ.
For each cluster k ð1 6 k 6 KÞ, it is assumed that the dependent variables Y c ; Y r and the independent variables
X cj ; X rj
ð1 6 j 6 pÞ are related by the following linear regression relationships:
X
p
ycil ðkÞ ¼ bc0ðkÞ þ bcjðkÞ xcil j þ cil ðkÞ ð8Þ
j¼1

X
p
yril ðkÞ ¼ br0ðkÞ þ brjðkÞ xril j þ ril ðkÞ : ð9Þ
j¼1

361
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

The K clusters and the corresponding K center and K range linear regression models are obtained by an algorithm that
locally optimizes a suitable objective function through an iterative optimization technique. The objective function computes
the total within cluster sum-of-squares deviations, given by:
K Xh
X i
SiCLR ¼ ðcil ðkÞ Þ2 þ ðril ðkÞ Þ2
k¼1 ei 2Pk
l
" !#2
X
K X X
p
¼ ycil ðkÞ  bc0ðkÞ þ bcjðkÞ xcil j
k¼1 ei 2Pk j¼1
l
" !#2
X
K X X
p
þ yril ðkÞ  br0ðkÞ þ brjðkÞ xril j ð10Þ
k¼1 ei 2Pk j¼1
l

K 
X >  
¼ ycðkÞ  XcðkÞ bcðkÞ ycðkÞ  XcðkÞ bcðkÞ
k¼1
K 
X >  
þ yrðkÞ  XrðkÞ brðkÞ yrðkÞ  XrðkÞ brðkÞ ;
k¼1

where
0 1 0 1
1 xci1 1 ... xci1 p 0 1 yci1
bc0ðkÞ
B. .. .. .. C B
B .. C B .. C .. C
c
 XðkÞ ¼ B . . . C, bðkÞ ¼ @ .
c
A, ycðkÞ ¼ B
@ . A;
C
ðnk ðpþ1ÞÞ
@1 xcin 1 ... xin p A
c
ððpþ1Þ1Þ ðnk 1Þ
k k
c
bpðkÞ ycin
k
0 1 0 1
1 xri1 1 ... xri1 p 0 1 yri1
br0ðkÞ
B. .. .. .. C B. C
B . C B C
 XrðkÞ ¼ B . . . . C, brðkÞ ¼ @ ... A, yrðkÞ ¼ B . C
@ . A.
ðnk ðpþ1ÞÞ
@1 xrin ... xrin A ððpþ1Þ1Þ ðnk 1Þ
k
1 k
p brpðkÞ yrin
k

The algorithm alternates between two steps, the fitting step that provides the center and range linear regression models,
and the assignment step that provides the clusters, until the algorithm convergence, when there are no more assignment
changes of objects into clusters. These steps are described bellow.

2.3.1. Step 1: best fitting


In this step, the partition of E in K clusters is fixed. With the aim of estimating the parameter vectors bcðkÞ and
brðkÞ ð1 6 k 6 KÞ that minimize SiCLR , the Eq. (10) is differentiated with respect to these parameter vectors and the results equal
to zero. The least squares estimators of bcðkÞ and brðkÞ are the solution of the two independent systems each with ðp þ 1Þ normal
equations given by:
 1
^c ¼
b
>
ðX cðkÞ Þ X cðkÞ
>
ðX cðkÞ Þ ycðkÞ
ðkÞ
 1 ð11Þ
^r ¼
b
>
ðX rðkÞ Þ X rðkÞ
>
ðX rðkÞ Þ yrðkÞ :
ðkÞ

2.3.2. Step 2: best assignment


^c and b
In this step, the parameter estimation vectors b ^r ð1 6 k 6 KÞ are kept fixed. The optimal clusters P k which min-
ðkÞ ðkÞ
imize the criterion SiCLR , are obtained according to the following assignment rule:
  2  2  K
 2  2 
Pk ¼ ei 2 E : ciðkÞ þ riðkÞ ¼ min ciðhÞ þ riðhÞ : ð12Þ
h¼1

Thus, the observation ei is assigned to cluster Pk if the sum-of-squared errors in the center plus in the range are minimal
for this cluster, when compared to the others sum-of-squared errors for the observation ei , but evaluated by the linear mod-
els of the others K  1 clusters. In others words, the observation ei (i ¼ 1; . . . ; n) will be assigned to the cluster Pk with min-
imal sum of the squared error.
362
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Finally, the predicted value provided by the k-th center and the k-th range linear regression models ð1 6 k 6 KÞ for a new
example, described by z ¼ ðx; yÞ, is given by y ^L ¼ y
^c  y
^r and y ^U ¼ y
^c þ y
^r , where y^c ¼ ðx ^c , y
~ c Þ> b ^ r ¼ ðx ^r ,
~ r Þ> b
ðkÞ ðkÞ

~c Þ> ¼ ð1; xc1 ; . . . ; xcp Þ, ðx


ðx ^c ; b
^c ¼ ðb
~r Þ> ¼ ð1; xr1 ; . . . ; xrp Þ, b ^c Þ> and b
^c ; . . . ; b ^r ; b
^r ¼ ðb ^r Þ> .
^r ; . . . ; b
ðkÞ 0ðkÞ 1ðkÞ pðkÞ ðkÞ 0ðkÞ 1ðkÞ pðkÞ

2.4. The iCNLR method

This section presents the interval Center and Range Clusterwise Nonlinear Regression method for interval-valued data
(iCNLR). It is based on the dynamic clustering algorithm [19] and the center and range nonlinear regression model for
interval-valued data [39].
Let H ¼ ff 1 ; . . . ; f H g be a set of nonlinear and differentiable functions and let O ¼ fo1 ; . . . ; oO g be a set of optimization
methods (such as BFGS, SANN,. . .) required to estimate the parameters of the nonlinear functions belonging to H.
Consider a partition P ¼ ðP1 ; . . . ; P K Þ of E into K clusters and, for each cluster Pk ð1 6 k 6 KÞ, let nk ¼ jP k j ð1 6 k 6 KÞ and let
eil 2 Pk be described by two feature vectors wil ¼ ðxil c ; ycil Þ and ril ¼ ðxil r ; yril Þ, where xcil ¼ ðxcil 1 ; . . . ; xcil p Þ and xril ¼ ðxril 1 ; . . . ; xril p Þ
with xcil j ¼ ðail j þ bil j Þ=2; xril j ¼ ðbil j  ail j Þ=2; ycil ¼ ðyLil þ yUil Þ=2 and yri ¼ ðyUil  yLil Þ=2 ð1 6 l 6 nk ; 1 6 j 6 pÞ.
For each cluster k ð1 6 k 6 KÞ, it is assumed that the dependent variables Y c ; Y r and the independent variables
X cj ; X rj ð1 6 j 6 pÞ are related by the following nonlinear regression relationships, respectively:
 
ycil ðkÞ ¼ f ðkÞ xcil ; bcðkÞ þ cil ðkÞ
c
ð13Þ
 
yril ðkÞ ¼ f ðkÞ xril ; brðkÞ þ ril ðkÞ ;
r
ð14Þ
c r c r
where f ðkÞ ; f ðkÞ 2 H, and bcðkÞ and brðkÞ are the vector of parameters of f ðkÞ and f ðkÞ , respectively.
The K clusters and the corresponding K center and K range nonlinear regression models are obtained by an algorithm that
locally optimizes a suitable objective function through an iterative optimization technique. The objective function also com-
putes the total within cluster sum-of-squares deviations, given by:
K Xh
X i
SiCNLR ¼ ðcil ðkÞ Þ2 þ ðril ðkÞ Þ2
k¼1 ei 2P k
ð15Þ
l

K X h
X  i2 h  i2
c r
¼ ycil ðkÞ  f ðkÞ xcil ; bcðkÞ þ yril ðkÞ  f ðkÞ xril ; brðkÞ :
k¼1 ei 2P k
l

Starting from an initial solution, the algorithm alternates between two steps, the fitting step that provides the best pair of
center and range nonlinear regression models, and the assignment step that provides the clusters, until the algorithm con-
vergence, when there are no more assignment changes of objects into clusters. These steps are described bellow.

2.4.1. Step 1: best fitting


In this step, the partition of E in K clusters is fixed. With the aim of estimating the parameter vectors bcðkÞ and
c r
brðkÞ ð1 6 k 6 KÞ of the nonlinear functions f ðkÞ and f ðkÞ that minimize SiCNLR , the normal equations for the center and the range
nonlinear models can be obtained and are shown as
2  3
K Xh
X  i @f cðkÞ xci ; bcðkÞ
ycil ðkÞ 
c ^c
xcil ; b
f ðkÞ 4 l
5 ¼0 ð16Þ
ðkÞ
k¼1 ei 2P k
@bjðkÞ
l ^c
bc ¼ b
2  3
XK Xh  i @f rðkÞ xri ; brðkÞ
r ^r
yril ðkÞ  f ðkÞ xril ; b 4 l
5 ¼ 0: ð17Þ
ðkÞ
k¼1 e 2P
@bjðkÞ
il k ^r
b ¼b r

The solutions of normal Eqs. (16) and (17) can be difficult to obtain analytically and, very often, there is no closed alge-
braic solution to the estimators. To deal with these difficulties, some numerical optimization methods are considered: BFGS,
Conjugate Gradient (CG) and Simulated Annealing (SANN).

2.4.2. Step 2: best assignment


^c and b
In this step, the parameter estimation vectors b ^r of the nonlinear functions f c and f r ð1 6 k 6 KÞ are kept fixed.
ðkÞ ðkÞ ðkÞ ðkÞ
The optimal clusters Pk which minimize the criterion SiCNLR , are obtained according to the following assignment rule:
  2  2  K
 2  2 
Pk ¼ ei 2 E : ciðkÞ þ riðkÞ ¼ min ciðhÞ þ riðhÞ : ð18Þ
h¼1

363
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Therefore, the observation ei will be assigned to the cluster P k if the sum-of-squared errors in the center plus in the range
is minimal for this cluster, when compared to the others sum-of-squared errors for the observation ei , but computed by the
nonlinear models of the others K  1 clusters. In others words, the observation ei ð1 6 i 6 nÞ will be assigned to the cluster Pk
with the minimal sum-of-squared error.
Finally, the predicted value provided by the k-th interval center and range nonlinear regression models ð1 6 k 6 KÞ for a
   
^L ¼ y
new example, described by z ¼ ðx; yÞ, is given by y ^c  y
^r and y ^U ¼ y^c þ y
^r , where y
^c ¼ f c x ^c ; y
~c ; b ^r ¼ f r x ^r , with
~r ; b ðkÞ ðkÞ ðkÞ ðkÞ

~c >
ðx Þ ¼ ~ r Þ>
ð1; xc1 ; . . . ; xcp Þ; ðx ¼ ^c
ð1; xr1 ; . . . ; xrp Þ; b ^c ; b
¼ ðb ^c Þ> and b
^c ; . . . ; b ^r ; b
^r ¼ ðb ^ r Þ> .
^r ; . . . ; b
ðkÞ 0ðkÞ 1ðkÞ qc ðkÞ ðkÞ 0ðkÞ 1ðkÞ qr ðkÞ

2.4.3. iCNLR algorithm


Algorithm 1 summarizes the interval center and range clusterwise nonlinear regression algorithm for interval-valued
data. From an initial solution, it alternates two steps until the convergence. The first step (model fitting) aims to compute
the parameter estimation vectors of the nonlinear functions associated with each cluster in both the center and the range.
The second step (assignment) updates the clusters. Note that in the first step, for each cluster, if one heuristic does not con-
verge, another one is triggered and so on. Moreover, if for a given cluster none of the heuristics converge for a parameter
estimate, the previous iteration model is maintained and the next allocation step is performed, without harming the crite-
rion minimization SiCNLR .

Algorithm 1 iCNLR Algorithm


Require:
1: The data set D ¼ ðt 1 ; . . . ; t n Þ; the number of clusters K; the set of nonlinear and differentiable continuous functions
H ¼ ff 1 ; . . . ; f H g and the set of optimization methods O ¼ fo1 ; . . . ; oO g;
Ensure
2: The parameter estimation vectors b ^c and b^r of the nonlinear functions f c ; f r 2 H and the partition
ðkÞ ðkÞ ðkÞ ðkÞ
P ¼ ðP1 ; . . . ; P K Þ.
3: Initialization:
4: Set t 0
5: Randomly assign the objects ei ð1 6 i 6 nÞ to the cluster P k ð1 6 k 6 KÞ to form the initial partition
 
ð0Þ ð0Þ
P ð0Þ ¼ P 1 ; . . . ; P K ;
c;ð0Þ r;ð0Þ
6: Randomly select f ðkÞ ; f ðkÞ 2 H ð1 6 k 6 KÞ.
^c;ð0Þ and b
7: Randomly initialize the parameter vectors b ^r;ð0Þ of the nonlinear functions f c;ð0Þ and f r;ð0Þ ð1 6 k 6 KÞ;
ðkÞ ðkÞ ðkÞ ðkÞ
8: Step 1: Model Fitting
9: Set t tþ1
10: for 1 6 k 6 K
c;ðtÞ c;ðt1Þ r;ðtÞ r;ðt1Þ
11: Set f ðkÞ ¼ f ðkÞ and f ðkÞ ¼ f ðkÞ .
12: Set ^c;ðtÞ
b ¼ ^c;ðt1Þ
b and ^r;ðtÞ
b ^
¼b
r;ðt1Þ
, the corresponding parameter
ðkÞ ðkÞ ðkÞ ðkÞ
c;ðtÞ r;ðtÞ
13: vectors of the nonlinear functions f ðkÞ and f ðkÞ ð1 6 k 6 KÞ;
14: end for
15: for 1 6 k 6 K do
16: for 1 6 h 6 jHj do
17: for 1 6 o 6 jOj do
18: Apply the optimization heuristic o 2 O, to compute the
19: ^c;ðtÞ of the nonlinear function f c;ðtÞ
parameter estimation vector b hðkÞ hðkÞ
20: that minimizes the Eq. (16).
21: if Heuristic o 2 O had converged then
22: ^c;ðtÞ of the nonlinear
store the parameter estimation vector bhoðkÞ
c;ðtÞ
23: function f hoðkÞ and break;
24: end if
25: end for
26: if None of the o 2 O had converged then
27: Set f
c;ðtÞ
¼f
c;ðtÞ ^c;ðtÞ ¼ b
and b ^c;ðtÞ ;
hðkÞ ðkÞ hðkÞ ðkÞ
28: else
29:
c;ðtÞ c;ðtÞ ^c;ðtÞ ¼ b
Set f hðkÞ ¼ f hoðkÞ and b ^c;ðtÞ ;
hðkÞ hoðkÞ

364
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

30: end if
31: end for
32: for 1 6 h 6 jHj do
33: for 1 6 o 6 jOj do
34: Apply the optimization heuristic o 2 O, to compute the
35: ^r;ðtÞ of the nonlinear function f r;ðtÞ
parameter estimation vector b hoðkÞ hoðkÞ
36: that minimizes the Eq. (17);
37: if Heuristic o 2 O had converged then
38: ^r;ðtÞ of the nonlinear
store the parameter estimation vector bhoðkÞ
r;ðtÞ
39: function f hoðkÞ and break;
40: end if
41: end for
42: if None of the o 2 O had converged then
43: Set f
r;ðtÞ
¼f
r;ðtÞ ^r;ðtÞ ¼ b
and b ^r;ðtÞ ;
hðkÞ ðkÞ hðkÞ ðkÞ
44: else
45:
r;ðtÞ r;ðtÞ ^r;ðtÞ ¼ b
Set f hðkÞ ¼ f hoðkÞ and b ^r;ðtÞ ;
hðkÞ hoðkÞ
46: end if
47: end for
c;ðtÞ c;ðtÞ r;ðtÞ r;ðtÞ ^ c;ðtÞ ^c;ðtÞ and b
^r;ðtÞ ¼ b
^r;ðtÞ , where
48: Set f ðkÞ ¼ f hðkÞ , f ðkÞ ¼ f hðkÞ , b ðkÞ
¼b hðkÞ ðkÞ hðkÞ
P h i 2 P h i
c;ðtÞ c c;ðtÞ r;ðtÞ r;ðtÞ 2
49: h ¼ min16s6H ei 2Pk yi  f sðkÞ ðxi ; bsðkÞ Þ and h ¼ min16s6H ei 2Pk yri  f sðkÞ ðxri ; bsðkÞ Þ
c

50: end for


51: Step 2: Assignment
52: P ðtÞ ¼ P ðt1Þ ;
53: for 1 6 i 6 n do
54: test 0;
55: Find m : ei 2 P ðtÞ
m  2  2
56: Find the winner cluster P k such that k ¼ argmin  ^ c;ðtÞ
iðsÞ
þ ^ r;ðtÞ
iðsÞ
16s6K
57: if k – m then
ðtÞ ðtÞ
58: test 1, P k ¼ P k [ fei g, PðtÞ ðtÞ
m ¼ P m n fei g;
59: end if
60: end for
61: Stopping criterion:
62: if test ¼¼ 0 then
63: stop;
64: else
65: t t þ 1 and go to Step 1
66: end if

2.4.4. Convergence properties of the iCNLR algorithm


The iCNLR algorithm looks for a partition P  ¼ ðP1 ; . . . ; PK Þ of E in K non-empty clusters and a K-dimensional vector of rep-
resentatives G ¼ g1 ; . . . ; gK , with
   
c r
gðkÞ ¼ f ðkÞ ; bc r
ðkÞ ; f ðkÞ ; bðkÞ ;
c r
where bc
ðkÞ and bðkÞ are the parameter vectors associated with the continuous nonlinear functions f ðkÞ and f ðkÞ belonging to H.
r

 
Thus, G and P represents the values that minimizes J:

SiCNLR ðG ; P  Þ ¼ min SiCNLR ðG; P Þ : G 2 LK ; P 2 PK ; ð19Þ


where PK is the set of all partitions of E in K non-empty clusters such that P k 2 ðpðEÞ  £Þ; pðEÞ is the power set of E, and
P 2 PK . L represents the space of prototypes such that gðkÞ 2 L ð1 6 k 6 KÞ and G 2 LK . In this paper
L ¼ H  Rpþ1  H  Rpþ1 .
365
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

The convergence properties of this kind of algorithm can be defined by the study of the two series [19]:
v t ¼ ðGðtÞ ; P ðtÞ Þ 2 LK  PK
and ut ¼ SiCNLR ðv t Þ ¼ SiCNLR ðGðtÞ ; P ðtÞ Þ; t ¼ 0; 1; . . .. Starting from the initial term v 0 ¼ ðGð0Þ ; P ð0Þ Þ, the
algorithm computes the terms of the series until convergence, when criterion J assumes a stationary value.

Proposition 1. The series ut ¼ SiCNLR ðv t Þ decreases at each iteration and converges.

Proof. First, we show that the following inequalities hold, i.e., the series decrease at each iteration:

SiCNLR ðGðtÞ ; P ðtÞ Þ P SiCNLR ðGðtþ1Þ ; P ðtÞ Þ P SiCNLR ðGðtþ1Þ ; P ðtþ1Þ Þ: ð20Þ

The left-hand side holds because, for a fixed partition P ðtÞ ; 8h 2 H:

c;ðtþ1Þ
 f hðkÞ
c;ðtÞ
¼ f ðkÞ ; f hðkÞ
r;ðtþ1Þ r;ðtÞ ^
¼ f ðkÞ and b
c;ðtþ1Þ ^c;ðtÞ ; b
¼b ^r;ðtþ1Þ ¼ b
^r;ðtÞ , i.e., the nonlinear functions f c;ðtþ1Þ ; f r;ðtþ1Þ and the corresponding
hðkÞ ðkÞ hðkÞ ðkÞ hðkÞ hðkÞ
^c;ðtþ1Þ ; b
parameter estimation vectors b ^r;ðtþ1Þ are equal to the best computed in the previous step ðtÞ, or
hðkÞ hðkÞ
c;ðtþ1Þ
 f hðkÞ
c;ðtþ1Þ
¼ f hoðkÞ ; f hðkÞ
r;ðtþ1Þ r;ðtþ1Þ ^c;ðtþ1Þ ¼ b
¼ f hoðkÞ and b ^c;ðtþ1Þ ; b
^r;ðtþ1Þ ¼ b
^r;ðtþ1Þ , i.e., the nonlinear functions f c;ðtþ1Þ ; f r;ðtþ1Þ and the corre-
hðkÞ hoðkÞ hðkÞ hoðkÞ hðkÞ hðkÞ
^c;ðtþ1Þ ; b
sponding parameter estimation vectors b ^r;ðtþ1Þ minimizes the Eqs. (16) and (17), respectively.
hðkÞ hðkÞ

c;ðtþ1Þ c;ðtþ1Þ r;ðtþ1Þ r;ðtþ1Þ ^ c;ðtþ1Þ ^c;ðtþ1Þ and b


^r;ðtþ1Þ ¼ b
^r;ðtþ1Þ , where
Moreover, for ð1 6 k 6 K; 1 6 h 6 HÞ; f ðkÞ ¼ f hðkÞ ; f ðkÞ ¼ f hðkÞ ; b ðkÞ
¼bhðkÞ ðkÞ hðkÞ
P h i2 P h i2
c;ðtþ1Þ c c;ðtþ1Þ r;ðtþ1Þ r r;ðtþ1Þ
h ¼ min16s6H ei 2Pk yi :  f sðkÞ ðxi ; bsðkÞ Þ
c
and h ¼ min16s6H ei 2Pk yi  f sðkÞ ðxi ; bsðkÞ Þ , i.e., the selected functions
r

and corresponding parameter estimation vectors minimizes the sum-of-squared errors.


Consequently, at iteration t þ 1, the value of the objective function is reduced regarding the previous iteration. Thus, due
to both the selected function and the least squared estimate which minimizes the sum-of-squared errors the left side of
inequality (20)

SiCNLR ðGðtÞ ; P ðtÞ Þ P SiCNLR ðGðtþ1Þ ; P ðtÞ Þ:


holds.
The right-hand side of the inequality in Eq. (20) holds because
K X nh
X  i
PK
ðtþ1Þ
¼ arg min
c;ðtþ1
yci  f ðkÞ ^
xci ; b
c;ðtþ1Þ
þ
ðkÞ
P¼ðP1 ;...;P K Þ2PK k¼1 e 2P ð21Þ
h  io
i k

r;ðtþ1Þ
yri  f ðkÞ ^r;ðtþ1Þ
xri ; b :
ðkÞ

There is not another partition P that makes SiCNLR decrease more than P ðtþ1Þ , which is unique. Finally, it can be concluded
that the series ut decreases and it is bounded (SiCNLR ðv t Þ P 0), therefore, it converges. h

 
Proposition 2. The series vt ¼ GðtÞ ; P ðtÞ converges.

Proof. Assume that the stationarity of the series ut is reached at the iteration t ¼ T. Therefore, uT ¼ uTþ1 and then
SiCNLR ðv T Þ ¼ SiCNLR ðv Tþ1 Þ, i.e., SiCNLR ðGðTÞ ; P ðTÞ Þ ¼ SiCNLR ðGðTþ1Þ ; P ðTþ1Þ Þ. From Proposition 1, this equality can be written as
     
SiCNLR GðTÞ ; P ðTÞ ¼ SiCNLR GðTþ1Þ ; P ðTÞ ¼ SiCNLR GðTþ1Þ ; P ðTþ1Þ : ð22Þ

For the left-hand side equality, GðTþ1Þ ¼ GðTÞ because G is unique in minimizing SiCNLR when the partition P ðTÞ is fixed. For
the right-hand side equality, P ðTÞ ¼ P ðTþ1Þ occurs because P is unique in minimizing SiCNLR when the vector of prototypes
GðTþ1Þ is fixed (because if the minimum is not unique, ei is assigned to the cluster having the smallest index). Furthermore,
vT ¼ v Tþ1 . Finally, this conclusion holds 8t P T and therefore v t ¼ v T ; 8t P T. It follows that the series v t ðt ¼ 0; 1; . . .Þ con-
verges. h

3. Experimental analysis

This section evaluates the performance of the iCNLR algorithm with respect to the parameter estimation and the predic-
tion of new observations, taking into account a wide range of scenarios. These scenarios consider the following important
features in clusterwise problems: the relative position of the clusters (named here as configuration); the type of function
used to generate the data (linear or nonlinear) and the number of prior classes k ¼ f2; 3g. With regards to the configurations,
three types of cluster relationships were considered: disjoint clusters (D), intersecting or partial overlap clusters (I) and over-
366
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

lapping clusters (U). Moreover, as the iCNLR algorithm considers a set of nonlinear regression models, a parameter estima-
tion schema will be briefly presented in order to avoid the presence of poor parameter estimates due to the use of a fixed
optimization method (Simulating Annealing, Conjugated Gradient or BFGS).
The performance of the parameter estimation algorithm will be evaluated in terms of bias and mean square error of the
parameter estimates. Another important task in the clusterwise regression problem is the predictive performance of the
algorithm for new observations, which is not included in the parameter estimation process. In this case, we evaluate the per-
formance of the proposed algorithm based on two assignment methods, the k-nearest neighbors (KNN) and the random
assignment (Random), and on the ensemble method Stacked Regression (SR).

3.1. Synthetic data sets

This section aims to illustrate the procedure of how to build the interval-valued data sets with the presence of a cluster-
wise structure as well as the scenarios considered.
In all scenarios, a sample size of 50 observations per a prior class (clusters) was used in the experiments. The scenarios
were built according to the following steps:

(i) choose a function f to represent the relationship between the dependent and a set of independent interval-valued vari-
ables (linear or nonlinear);
(ii) choose values for the independent variables based on a uniform distribution, e.g., X  Uða; bÞ, with a < b;
(iii) choose values for the vector of parameters b based on a uniform distribution, e.g., X  Uðc; dÞ, with c < d;
(iv) compute the values of the dependent variable Y based on the function selected in step (i), e.g., yi ¼ f ðxi ; bÞ; i ¼ 1; . . . ; n;
(v) add a random noise i  Uðe1 ; e2 Þ, with e1 < e2 , to observations yi of step (iv) to obtain the final observed values of the
dependent variable Y.

The scenarios presented in this work considered   Uð0:01; 0:1Þ. This small variation was added to the data irrespective
of the scale on the y-axis and the inherent error present in the data is similar for all scenarios. Thus, a better idea of the error
committed by the model can be reached, without any confusion, due to this perturbation being added to the data. The pro-
cedure above was used to generate the centers and ranges of the interval-valued variables.
For the sake of simplicity, we consider the case of a single interval explanatory variable X in this experimental section.
Regarding to the nonlinear function used in the experimental setting, the following relationship was considered:

xi
yi ¼ f ðxi ; b0 ; b1 Þ ¼ xib0 1 exp ð1 6 i 6 nÞ: ð23Þ
b1
Thus, for a nonlinear relationship between the center variables, Y c and X c ; yci ¼ f c ðxci ; bc0 ; bc1 Þ þ ci can be considered. Sim-
ilarly, a nonlinear relationship between the half-range variables Y r and X r is considered with yri ¼ f r ðxri ; br0 ; br1 Þ þ ri .
This function is justified by its flexibility, due the choice of the parameters b0 and b1 , as illustrated in the Fig. 1. Thus, the
function allows the generation of different nonlinear relationships, being useful in building scenarios with overlapping, dis-
joint or intersecting nonlinear clusters.
Finally, the scenarios considered in this experimental section can be categorized into four groups, according to the pres-
ence of nonlinearity in the center and in the ranges. For each group, scenarios with disjoint configuration (D), intersection (I)
and overlap (U) were generated, according to the relative position of classes on the independent variable X. The first group of
scenarios presents linearity in the centers and ranges, the second ones presents linearity in the centers and nonlinearity in
the ranges; the third ones nonlinearity in the centers and linearity in the ranges and, finally, the last group of scenarios pre-
sents nonlinearity on both centers and ranges of the intervals. Below, details about the parameters setup for these scenarios
are presented.

3.2. Scenario setups

Tables 1–4 present the parameter setup for each scenario according to: (i) the configuration of the cluster position for the
center and for the range of the intervals; (ii) the type of relationship between the dependent and independent variables and
(iii) the parameters used to generate the interval-valued observations in terms of center and range (dependent and indepen-
dent variables) for each cluster.
The scenarios 1 to 6 are shown in Table 1. They consider that both center and range independent and dependent variables
are related by linear functions. The aim is to evaluate the performance of the new clusterwise algorithm (iCNLR) in terms of
parameter estimates and assignment of new observations, with regards to the types of cluster relationships and in the pres-
ence of linearity on both centers and ranges.
Table 2 presents the setup for scenarios 7 to 12, where the center independent and dependent variables are related by a
linear function whereas the range independent and dependent variables are related by a nonlinear function.
Similarly, Table 3 presents the parameters for scenarios 13 to 18, where the center independent and dependent variables
are related by a nonlinear function and the range independent and dependent variables are related by a linear function.
367
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Fig. 1. Several shapes assumed by the expression (23) according to different values for the parameters b0 and b1 .

Finally, Table 4 presents the scenarios 19 to 24 where both center and ranges of the independent and dependent variables
are related by nonlinear functions.
Fig. 2 illustrates some of these scenarios. The left and the center sides of Fig. 2 provide the scatter plot of the real-valued
variables representing the centers ðX c ; Y c Þ and half-range ðX r ; Y r Þ of the intervals, respectively. The right side of this Figure is
the scatter plot of the interval-valued variables drawn as follows. Let x ¼ ½xL ; xU  and y ¼ ½yL ; yU  be two observed intervals of
the variables X and Y, respectively. Let the object o be described by the bi-dimensional vector of intervals
z ¼ ðx; yÞ ¼ ð½xL ; xU ; ½yL ; yU Þ. The rectangle formed by these intervals is drawn in a scatter plot from the connection of four
points, namely ðxL ; yL Þ; ðxU ; yL Þ; ðxL ; yU Þ and ðxU ; yU Þ aiming to represent the bi-dimensional vector of intervals
z ¼ ðx; yÞ ¼ ð½xL ; xU ; ½yL ; yU Þ.
The first row of this figure represents a clusterwise structure with 3 clusters and a linear relationship between X and Y (in
the center and in the ranges) and disjoint classes, corresponding to scenario 2. The second row represents a clusterwise
structure with the presence of a linear relationship in the center and nonlinear relationship in the range. Moreover, the clus-
ters present a partial overlap and represents the scenario 10. The third row represents scenario 24 and illustrates three over-
lap clusters with the presence of nonlinearity both in center an ranges of intervals. Finally, the proposed scenarios cover a
wide range of clusterwise structures allowing a deep evaluation of the new iCNLR algorithm in terms of parameter estima-
tion and assignment of new observations to the clusters.

368
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Table 1
Clusterwise scenarios with the presence of a linear relationship in the center and in the ranges of the interval-valued variables.

Scenario Configuration Function Parameters (b0 , b1 ) X  Uða; bÞ Classes

Center Range Center Range Center Range Center Range


1 D D Linear Linear (3,1) (3,1) (0,3) (0,3) 2
(0.5,1) (5,0.75) (4,8) (4,8)
2 D D Linear Linear (4,1) (6,2) (0,2) (0,2) 3
(2,2) (6,1) (3,5) (3,5)
(1,3) (1,3) (6,9) (6,9)
3 I I Linear Linear (3,1) (3,1) (0,3) (0,3) 2
(0.5,1) (5,0.5) (2,5) (2,5)
4 I I Linear Linear (4,1) (6,2) (0,4) (0,4) 3
(2,2) (6,1) (3,6) (3,6)
(1,3) (1,3) (5,8) (5,8)
5 U U Linear Linear (3,1) (3,1) (0,3) (0,3) 2
(0.5,1) (5,0.5) (0,3) (0,3)
6 U U Linear Linear (4,1) (6,2) (0,3) (0,3) 3
(2,2) (6,1) (0,3) (0,3)
(1,3) (1,3) (0,3) (0,3)

Table 2
Clusterwise scenarios with the presence of a linear relationship in the center and a nonlinear relationship in the ranges of the interval-valued variables.

Scenario Configuration Function Parameters (b0 , b1 ) X  Uða; bÞ Classes

Center Range Center Range Center Range Center Range


7 D D Linear Nonlinear (3,1) (0.5,2) (0,3) (0,3) 2
(0.5,1) (1,3) (4,8) (4,8)
8 D D Linear Nonlinear (4,1) (0.5,1) (0,2) (0,6) 3
(2,2) (0.75,4) (3,5) (7,12)
(1,3) (0.75,6) (6,9) (14,20)
9 I I Linear Nonlinear (3,1) (0.5,2) (0,3) (0,4) 2
(0.5,1) (1,3) (2,5) (2,5)
10 I I Linear Nonlinear (4,1) (5,1) (0,4) (0,4) 3
(2,2) (0.75,4) (3,6) (2,8)
(1,3) (0.75,6) (5,8) (5,10)
11 U U Linear Nonlinear (3,1) (0.5,2) (0,3) (0,4) 2
(0.5,1) (1,3) (0,3) (0,4)
12 U U Linear Nonlinear (4,1) (0.5,1) (0,3) (0,10) 3
(2,2) (0.75,4) (0,3) (0,10)
(1,3) (0.75,6) (0,3) (0,10)

Table 3
Clusterwise scenarios with the presence of a nonlinear relationship in the center and a linear relationship in the range of the interval-valued variables.

Scenario Configuration Function Parameters (b0 , b1 ) X  Uða; bÞ Classes

Center Range Center Range Center Range Center Range


13 D D Nonlinear Linear (0.5,2) (3,1) (0,3) (0,3) 2
(1,3) (0.5,1) (4,8) (4,8)
14 D D Nonlinear Linear (0.5,1) (4,1) (0,6) (0,2) 3
(0.75,4) (2,2) (7,12) (3,5)
(0.75,6) (1,3) (14,20) (6,9)
15 I I Nonlinear Linear (0.5,2) (3,1) (0,4) (0,5) 2
(1,3) (0.5,1) (2,5) (2,6)
16 I I Nonlinear Linear (0.5,1) (4,1) (0,10) (0,4) 3
(0.75,4) (2,2) (5,15) (3,6)
(0.75,6) (1,3) (10,20) (5,8)
17 U U Nonlinear Linear (0.5,2) (3,1) (0,4) (0,3) 2
(1,3) (0.5,1) (0,4) (0,3)
18 U U Nonlinear Linear (0.5,1) (4,1) (0,10) (0,3) 3
(0.75,4) (2,2) (0,10) (0,3)
(0.75,6) (1,3) (0,10) (0,3)

369
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Table 4
Clusterwise scenarios with the presence of a nonlinear relationship in the center and in the range of the interval-valued variables.

Scenario Configuration Function Parameters (b0 , b1 ) X  Uða; bÞ Classes

Center Range Center Range Center Range Center Range


19 D D Nonlinear Nonlinear (0.5,1) (0.5,2) (0,3) (0,3) 2
(0.75,3) (1,3) (4,8) (4,8)
20 D D Nonlinear Nonlinear (0.5,1) (0.5,1) (0,6) (0,6) 3
(0.5,3) (0.75,4) (7,12) (7,12)
(0.75,4) (0.75,6) (14,20) (14,20)
21 I I Nonlinear Nonlinear (0.5,1) (0.5,2) (0,4) (0,4) 2
(0.75,3) (1,3) (2,5) (2,5)
22 I I Nonlinear Nonlinear (0.5,1) (0.5,1) (0,4) (0,4) 3
(0.5,3) (0.75,4) (2,8) (2,8)
(0.75,4) (0.75,6) (5,10) (5,10)
23 U U Nonlinear Nonlinear (0.5,1) (0.5,2) (0,4) (0,4) 2
(0.75,3) (1,3) (0,4) (0,4)
24 U U Nonlinear Nonlinear (4,1) (0.5,1) (0,3) (0,10) 3
(2,2) (0.75,4) (0,3) (0,10)
(1,3) (0.75,6) (0,3) (0,10)

3.3. Parameter estimation schema

The non-convergence of the optimization method may imply poor parameter estimates for the iCNLR algorithm, taking
the objective function SiCNLR for a local minimum. In order to avoid this problem, the proposed clusterwise algorithm consid-
ers a scheme that discard a heuristic that provide parameter estimates based on the non-convergence of the optimization
methods (SANN, CG and BFGS).
As pointed out in Section 2.4.3, the scheme consists of iterating over these heuristics. Thus, for each cluster, if one heuris-
tic does not converge, another one is triggered and so on. Moreover if, for a given cluster, none of the heuristics converge for a
parameter estimation, the previous iteration model is maintained and the next assignment step is performed, without loss to
the minimization of the criterion SiCNLR . The estimation schema can be summarized as follows:

(i) get data and clusters of previous iteration and use the first heuristics to get the center and the range parameter
estimates;
(ii) if heuristic converges, use its estimates and proceed. Otherwise, discard the current heuristic and use the next one;
(iii) repeat step (ii) until some heuristic converges;
(iv) if none of the heuristics converge, keep models and estimates of previous iteration and proceed.

3.3.1. Estimation
This section evaluates the performance of the iCNLR algorithm for the scenarios presented in Tables 1–4, taking into
account the precision of the parameter estimates.
As the final solution provided by the iCNLR depends on the initial solution, the algorithm is run 100 times and the param-
eter estimates corresponding to the smaller value of SiCNLR are stored, for each Monte Carlo replication. The metrics presented
below were obtained from 1,000 Monte Carlo replicates for each configuration.
The results of Tables 5–7 are based on the following metrics: the average of the parameter estimates (APE) and the root
mean square error (RMSE) for the center parameters ðbc0 ; bc1 Þ and half-range parameters ðbr0 ; br1 Þ, of their respective nonlinear
functions f c ðxc ; bc0 ; bc1 Þ and f c ðxr ; br0 ; br1 Þ. The nonlinear function of Eq. (23) is kept fixed, since it is intended to measure the
quality of the estimation for a known function. The RMSE expressions for each parameter estimates are given by: see Table 6.
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
rP
N
^  2
ðb0 b0 Þ
RSMEb0 ¼ i¼1
N
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
rP ð24Þ
N
^  2
ðb1  b1 Þ
RSMEb1 ¼ i¼1
N
;

where: b^0 and b^1 are the parameter estimates of b0 and b1 , respectively, b
0 and b
1 are the APEs of b and b , respectively, and
0 1
N is the number of Monte Carlo replicates.
The results (see Tables 5–8) demonstrate that the values of the APE are very close to the true parameter, which means
that the proposed algorithm produce unbiased estimates. In addition, the RMSE of the parameter estimates is small in all
scenarios, even in scenarios with the presence of nonlinearity and/or overlap clusters. Thus, these results suggest that the
proposed iCNLR algorithm presented a satisfactory performance in all scenarios considered.
370
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Fig. 2. Clusterwise structures for synthetic interval-valued data sets. Scenarios 2, 10 and 24 respectively.

371
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Table 5
APE and RMSE of the parameter estimates of iCNLR algorithm. Scenarios 1 to 6: linear center and linear range.

Scenario Clusters APE RMSE


Center Range Center Range

bc0 bc1 br0 br1 bc0 bc1 br0 br1

1 2 3.0003 1.0003 3.0000 1.0002 0.0145 0.0085 0.0162 0.0089


0.5016 1.0002 5.0028 0.7496 0.0325 0.0055 0.0379 0.0063
2 3 3.9961 1.0087 5.9987 2.0012 0.0185 0.0213 0.0177 0.0156
2.0123 1.9970 5.9450 0.9866 0.0599 0.0149 0.0991 0.0245
1.0184 2.9975 1.0238 2.9972 0.0743 0.0099 0.0907 0.0117
3 2 3.0003 1.0001 2.9991 1.0006 0.0149 0.0090 0.0159 0.0092
0.4999 1.0002 5.0025 0.5007 0.0295 0.0082 0.0323 0.0090
4 3 3.9984 1.0008 6.0003 2.0000 0.0155 0.0068 0.0166 0.0073
1.9992 2.0002 5.9828 0.9963 0.0418 0.0092 0.0456 0.0099
1.0105 2.9984 1.0012 2.9998 0.0572 0.0089 0.0571 0.0088
5 2 2.9999 1.0001 3.0002 1.0001 0.0149 0.0087 0.0170 0.0101
0.4995 0.9999 5.0004 0.5004 0.0152 0.0088 0.0166 0.0095
6 3 4.0006 0.9999 5.9997 2.0002 0.0143 0.0083 0.0169 0.0097
2.0003 2.0000 5.9990 0.9999 0.0138 0.0079 0.0170 0.0096
1.0000 2.9997 0.9998 3.0003 0.0140 0.0081 0.0162 0.0095

Table 6
APE and RMSE of the parameter estimates of iCNLR algorithm. Scenarios 7 to 12: linear center and nonlinear range.

Scenario Clusters APE RMSE


Center Range Center Range

bc0 bc1 br0 br1 bc0 bc1 br0 br1

7 2 3.0009 1.0006 0.4998 2.0033 0.0145 0.0083 0.0055 0.0510


0.5013 0.9999 0.9978 2.9960 0.0144 0.0027 0.0057 0.0129
8 3 4.0004 0.9992 0.5153 0.7238 0.0142 0.0121 0.0362 0.3551
1.9953 2.0011 0.7444 3.9863 0.0764 0.0187 0.0147 0.0496
1.0388 2.9950 0.7443 5.9728 0.0983 0.0131 0.0087 0.0447
9 2 2.9995 0.9998 0.5003 2.0125 0.0144 0.0083 0.0081 0.0580
0.4996 1.0001 0.9972 2.9932 0.0266 0.0073 0.0093 0.0261
10 3 3.9999 1.0000 0.5056 1.0176 0.0146 0.0064 0.0128 0.0351
1.9891 2.0022 0.7732 4.1711 0.0486 0.0106 0.0381 0.2287
1.0724 2.9890 0.7474 5.9861 0.0939 0.0147 0.0108 0.0854
11 2 3.0000 0.9996 0.5017 2.0145 0.0143 0.0082 0.0087 0.0572
0.5006 0.9997 0.9979 2.9962 0.0149 0.0085 0.0129 0.0415
12 3 3.9992 1.0006 0.5266 0.8959 0.0155 0.0091 0.0507 0.2060
2.0011 1.9992 0.7521 4.0236 0.0173 0.0098 0.0164 0.0752
0.9995 3.0003 0.7399 5.9337 0.0150 0.0089 0.0193 0.1574

Table 7
APE and RMSE of the parameter estimates of iCNLR algorithm. Scenarios 13 to 18: nonlinear center and linear range.

Scenario Clusters APE RMSE


Center Range Center Range

bc0 bc1 br0 br1 bc0 bc1 br0 br1

13 2 0.5002 2.0007 2.999 0.9993 0.0057 0.0500 0.0145 0.0082


0.9958 2.9924 0.503 0.9995 0.0073 0.0155 0.0157 0.0029
14 3 0.5218 1.0264 4.0007 0.9992 0.0354 0.0477 0.0153 0.0134
0.7485 3.9978 2.0021 1.9997 0.0099 0.0346 0.0725 0.0179
0.7472 5.9860 1.0300 2.9959 0.0060 0.0322 0.0837 0.0111
15 2 0.5002 2.0105 3.0009 1.0004 0.0083 0.0633 0.0155 0.0087
0.9975 2.9956 0.5011 0.9998 0.0108 0.0308 0.0280 0.0079
16 3 0.5051 1.0202 4.0005 0.9998 0.0130 0.0394 0.0142 0.0064
0.7711 4.1510 1.9856 2.0030 0.0325 0.1971 0.0485 0.0105
0.7477 5.9917 1.0549 2.9917 0.0111 0.0891 0.0752 0.0118
17 2 0.5017 2.0146 3.0007 1.0003 0.0086 0.0608 0.0155 0.0089
0.9989 2.9978 0.5005 0.9999 0.0119 0.0378 0.0158 0.0089
18 3 0.5253 1.0367 4.0023 0.9990 0.0466 0.0654 0.0148 0.0087
0.7484 4.0017 2.0005 1.9997 0.0138 0.0577 0.0149 0.0088
0.7444 5.9616 1.0001 2.9998 0.0146 0.1246 0.0150 0.0086

372
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Table 8
APE and RMSE of the parameter estimates of iCNLR algorithm. Scenarios 19 to 24: nonlinear center and nonlinear range.

Scenario Clusters APE RMSE


Center Range Center Range

bc0 bc1 br0 br1 bc0 bc1 br0 br1

19 2 0.4998 1.0009 0.5003 1.9986 0.0045 0.0245 0.0055 0.0529


0.7490 2.9981 0.9979 2.9964 0.0072 0.0179 0.0064 0.0147
20 3 0.4959 1.0013 0.5125 1.0270 0.0175 0.0374 0.0262 0.0484
0.5068 3.0166 0.7533 4.0152 0.0110 0.0249 0.0088 0.0338
0.7495 3.9988 0.7474 5.9880 0.0014 0.0034 0.0055 0.0287
21 2 0.5013 0.9998 0.5003 2.0140 0.0069 0.0281 0.0075 0.0623
0.7461 2.9979 0.9971 2.9939 0.0326 0.0991 0.0097 0.0257
22 3 0.4986 0.9970 0.5070 1.0212 0.0072 0.0282 0.0137 0.0378
0.5156 3.0512 0.7705 4.1425 0.0273 0.0797 0.0352 0.2047
0.7412 3.9706 0.7462 5.9813 0.0164 0.0608 0.0151 0.1256
23 2 0.5001 1.0015 0.5005 2.0076 0.0072 0.0278 0.0083 0.0602
0.7496 2.9993 0.9983 2.9967 0.0068 0.0252 0.0127 0.0407
24 3 4.0035 0.9998 0.5328 1.0381 0.0744 0.0264 0.0611 0.0688
1.9948 2.0110 0.7430 3.9790 0.0438 0.0628 0.0163 0.0628
1.0015 2.9984 0.7392 5.9237 0.0103 0.0790 0.0186 0.1546

In conclusion, the iCNLR algorithm is able to obtain good parameter estimates for the different scenarios and functions
(linear or nonlinear) that were considered.

3.3.2. Prediction
Another important task in clusterwise regression for interval-valued variables is to predict new interval-valued observa-
tions using the model fitted with the training samples. Given a new observation x0 ¼ ðx01 ; . . . ; x0p Þ, the prediction of the value
y0 ¼ ½yL0 ; yU0  of the interval-valued response variable based on the values of the interval-valued explanatory variables can be
achieved either by the selection of the ‘‘best” regression model for interval-valued variables between the K fitted models (the
assignment of the unknown observation to a given cluster and selection of the associated regression model) or by a suitable
ensemble method that uses the K fitted models.
In this section, we evaluate the performance of the iCNLR algorithm for this task considering two assignment methods for
unknown observation, namely the k-nearest neighbors (KNN) for interval data with Hausdorff distance and the random
assignment (Random), and an ensemble method, the Stacked Regression (SR) [6]. A repeated cross-validation scheme 10
times 10-fold is performed to investigate which method presents the lowest assignment error in the test data sets for each
one of the 24 scenarios.
For each scenario, the iCNLR method is run 100 times for the observations in 9 training folds. Then, the parameter esti-
mates obtained for the smallest criterion SiCNLR are stored. The observations of the test fold are considered unknown obser-
vations and will be used to evaluate the predictive performance of the new clusterwise algorithm. The measures used to
evaluate the predictive performance in the test fold are the root mean square error for the lower bound (RMSELf ), upper
bound (RMSEUf ) and general (RMSEOf ), wherein the index f corresponds to the RMSE based on the predicted values for the
f-th fold, f ¼ 1; . . . ; 10. These measure are defined below:
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
sP
nf
i¼1 ðyLi  yLi Þ
^ 2
RMSELf ¼ ; ð25Þ
nf
sPffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
nf
i¼1 ðyUi  yUi Þ
^ 2
RMSEUf ¼ ; ð26Þ
nf
sPffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pnf
nf
i¼1 ðyLi  yLi Þ þ
^ 2 i¼1 ðyUi  yUi Þ
^ 2
RMSEOf ¼ ; ð27Þ
nf

where: yi ¼ ½yLi ; yUi  is the observed interval provided by the dependent variable for the i-th observation in the test fold,
y ^Li ; y
^ i ¼ ½y ^Ui  is the corresponding fitted interval provided by the dependent variable in this observation, and nf is the number
of observations in the test fold. These measures, being dependent on the scale of the data, should only be used to compare
the assignment methods in the same scenario. Finally, the evaluation of the assignment method is given for K ¼ f2; 3g.
In particular, in the k-nearest neighbor (KNN) assignment method, the number of neighbors (g) which minimizes the
RMSE error was selected based on a range of predefined values for g ¼ f1; 3; 5; 7; 9g. Moreover, the choice of the Hausdorff
distance is due to its wide use in cluster analysis for interval-valued variables [13]. However, other dissimilarity functions for
interval-valued data, such as the Euclidean [12] and City-Block [15] distances, are also suitable.
Table 9 shows a list of linear and nonlinear functions provided for the iCNLR algorithm that were used in this paper as
linear/nonlinear relationships between the centers and between the ranges of the intervals. The proposed algorithm evalu-
373
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Table 9
List of linear and nonlinear functions provided for the iCNLR algorithm.

Label yi ¼ f ðxi ; bÞ
1 ðb0 1Þ xi =b1
yi ¼ xi e
2 yi ¼ b0  b bþx
1
i
2

3 yi ¼ bbþx
1x
i
0

4 yi ¼ b0 þ b1 xi
5 yi ¼ b2 þ 1þeb1b 2
0 b1 log xi

6 yi ¼ 1þeb01b1 log xi
7 yi ¼ b2 þ 1þe1b 2
b0 b1 xi

8 yi ¼ 1þeb10 b1 xi
9 yi ¼ b0 þ ð1  b0 Þð1  eb1 xi Þ
10 yi ¼ 1  eb0 xi
11 yi ¼ b0 þ ð1  b0 Þð1  eb1 xi b2 xi Þ
2

12 yi ¼ 1  e b0 xi b1 x2i

13 yi ¼ 1  eb0 xi b1 xi b2 xi


2 3

ates the pair of nonlinear functions that presents the best fit for the center f c ðxci ; bc0 ; bc1 Þ and for the half-range f r ðxri ; br0 ; br1 Þ of
the synthetic interval-valued data sets, taking into account the list of nonlinear functions of this Table. However, in practice,
the number of functions to be supplied to the algorithm depends on the prior knowledge of the problem by the researcher/
practitioner.

3.3.3. Overall error (RMSEO )


This section presents a comparison study between the two assignments methods (KNN, Random) and the ensemble
method (SR) based on the RMSEO measure. The results for the RMSEL and RMSEU measures are quite similar and are presented
in the supplementary material in order to reduce the length of the manuscript.
Table 10 presents the results based on the mean and standard deviation (s.d.) of the RMSEO error in the test fold for the
scenarios 1 to 6, which consider a linear relationship between the centers and between the ranges of the interval-valued
variables. The KNN method presented the best performance in the scenarios 1 to 4, which consider disjoint (D) and intersec-
tion (I) clusters, due to the lowest RMSEO mean in the test data sets and the small standard deviation (s.d). In the scenario 5,
which consider overlap clusters, the SR method showed the best allocation performance for the iCNLR algorithm.
Table 11 presents the results for scenarios 7 through 12 that consider a nonlinear relationship for the ranges and a linear
relationship for the centers of the interval-valued variables. The KNN method presented the best performance for the sce-
narios 7 to 10. In the scenario 11, which consider overlap clusters, the KNN and SR demonstrated a similar performance.
Apart from scenario 12, where the three methods presented a similar performance, the Random assignment exhibited the
worst performance in the majority of scenarios considered.
Table 12 presents the results for scenarios 13 to 18, which consider a nonlinear relationship in the center and a linear
relationship in the ranges of the interval-valued variables. Again, the KNN method outperformed the other in the majority
of the configurations (disjoint and intersection clusters), while the SR method presented the best performance in scenarios
17 and 18, which considered overlap clusters.

Table 10
Comparative performance of the assignment methods based on RMSEO overall error. Mean and standard deviation in test folds of the cross-validation scheme.
Scenarios 1 to 6.

Scenario Statistic 2 Clusters 3 Clusters


KNN SR Random KNN SR Random
1 Mean 0.099 0.126 0.155 0.099 0.141 0.167
s.d. 0.023 0.034 0.043 0.021 0.038 0.061
2 Mean 0.224 2.013 2.302 0.083 2.856 1975.606
s.d. 0.044 0.501 0.515 0.016 1.098 6752.586
3 Mean 0.116 0.965 1.361 0.098 0.951 1.352
s.d. 0.075 0.124 0.362 0.029 0.143 0.339
4 Mean 0.787 1.929 2.404 0.431 2.355 2.987
s.d. 0.539 0.362 0.779 0.532 0.354 0.700
5 Mean 1.006 0.835 1.113 1.029 0.878 1.220
s.d. 0.262 0.078 0.274 0.300 0.105 0.244
6 Mean 1.673 1.429 2.031 1.798 1.441 2.014
s.d. 0.337 0.113 0.372 0.371 0.103 0.351

374
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Table 11
Comparative performance of the assignment methods based on RMSEO overall error. Mean and standard deviation in test folds of the cross-validation scheme.
Scenarios 7 to 12.

Scenario Statistic 2 Clusters 3 Clusters


KNN SR Random KNN SR Random
7 Mean 0.632 2.804 3.368 0.105 8.172 6.725
s.d. 0.346 1.242 2.435 0.037 2.160 4.374
8 Mean 0.983 3.055 3.079 0.509 5.475 10.533
s.d. 0.160 1.046 0.757 0.414 2.178 9.418
9 Mean 0.864 2.092 2.971 0.934 2.241 2.815
s.d. 0.916 0.354 1.026 0.958 0.446 0.882
10 Mean 0.919 1.301 1.772 0.676 1.932 2.787
s.d. 0.281 0.209 0.432 0.480 0.494 0.481
11 Mean 1.805 1.741 2.313 1.855 1.857 2.532
s.d. 0.861 0.308 0.878 0.866 0.442 7.851
12 Mean 2.314 2.174 2.599 2.464 2.059 2.654
s.d. 0.644 0.492 0.844 0.592 0.465 7.741

Table 12
Comparative performance of the assignment methods based on RMSEO overall error. Mean and standard deviation in test folds of the cross-validation scheme.
Scenarios 13 to 18.

Scenario Statistic 2 Clusters 3 Clusters


KNN SR Random KNN SR Random
13 Mean 0.116 3.157 6.540 0.096 3.478 5.682
s.d. 0.045 1.322 2.612 0.027 1.451 3.383
14 Mean 1.474 8.912 11.171 0.161 11.392 30.198
s.d. 0.302 7.309 9.111 0.086 6.225 13.598
15 Mean 1.164 2.273 3.449 1.183 2.759 3.432
s.d. 1.089 0.273 0.963 1.117 0.788 0.933
16 Mean 1.239 1.616 2.109 0.926 1.762 3.362
s.d. 0.312 0.345 0.527 0.497 0.340 1.088
17 Mean 2.186 1.880 2.468 2.212 2.151 2.517
s.d. 0.679 0.228 0.735 0.650 0.440 0.691
18 Mean 2.206 2.087 2.696 2.397 2.069 2.799
s.d. 0.632 0.514 0.742 0.694 0.371 0.882

Table 13
Comparative performance of the assignment methods based on RMSEO overall error. Mean and standard deviation in test folds of the cross-validation scheme.
Scenarios 19 to 24.

Scenario Measure 2 Clusters 3 Clusters


KNN SR Random KNN SR Random
19 Mean 0.411 2.764 4.905 0.093 4.018 5.291
s.d. 0.327 1.231 3.649 0.028 1.881 3.256
20 Mean 1.055 2.791 3.204 0.595 4.947 8.064
s.d. 0.212 0.801 0.980 0.390 1.762 6.740
21 Mean 1.685 2.190 3.253 1.709 2.304 3.011
s.d. 1.166 0.258 0.710 1.232 0.279 8.046
22 Mean 3.028 6.420 9.656 2.720 6.648 12.574
s.d. 1.994 2.076 5.591 2.254 1.704 6.880
23 Mean 1.749 1.557 1.999 1.795 1.684 2.098
s.d. 0.746 0.242 0.786 0.688 0.383 0.672
24 Mean 2.262 2.102 2.743 2.610 2.032 2.824
s.d. 0.583 0.388 0.695 0.739 0.333 6.574

Table 13 presents the results for scenarios 19 through 24. In scenarios 19 and 20, where there is no clusters overlap, the
KNN method had the lowest average RMSE. The same occurs in scenarios 20 and 21, where the overlap is partial. In scenario
23 and 24, the Stacked Regression method is the best method for the iCNLR algorithm.
Finally, the overall conclusion is that KNN was the best method for the iCNLR algorithm in those scenarios with the pres-
ence of disjoint (D) and partial overlap (I) clusters. On the other hand, in the majority of the scenarios with overlap (U) clus-
ters SR was the best approach.

375
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Fig. 3. Dispersion plots for the center, range and intervals and fitted models for iCNLR algorithm with three clusters in Cardio interval-valued data set.

4. Applications to real interval-valued data sets

In this section, a comparison study between the new iCNLR algorithm and the linear case, the iCLR algorithm, is per-
formed. The independent run of both the K-means and CRM [36] algorithms on the data sets, is also considered. This is
referred to, hereafter, as the ‘‘Two-step Approach”, which runs the K-means on each data-set, with the aim of obtaining a
partition into K clusters and then, a linear regression model on each cluster using the CRM algorithm is fitted.
The iCNLR and the iCLR algorithms as well as the ‘‘Two-step Approach” are compared in the real interval-valued data sets,
in terms of the value of their respective objective function S for K 2 f1; 2; 3g. Furthermore, the advantages/drawbacks of the
clusterwise methods (iCNLR and iCLR with K > 1) are highlighted in comparison to the fitting of a single regression model on
the data sets (iCNLR and iCLR methods with K ¼ 1).
Later in this section (Section 4.7), they are also compared in terms of their predictive performance on unknown observa-
tions computed by the root mean square error (RMSE) in test data sets, in the framework of a 10 times 10-folds cross-
validation scheme. Finally, a Mann–Whitney statistical test is performed to evaluate the difference between the methods
iCLR and iCNLR to predict unknown observations, according to the two assignment methods (KNN and Random) and the
ensemble method SR.

4.1. Cardio data set

This interval-valued data set [2] provides the relationship between systolic (X) and diastolic (Y) blood pressures for 59
patients. Measurements were taken throughout the day and the maximum and the minimum pressure values were com-
puted. Thus, a clusterwise regression model is used to identify and model different patient groups. Fig. 3 illustrates the cen-
ter, range, and interval dispersion plots for the Cardio interval-valued data set as well as the fitted models for the iCNLR
algorithm with three clusters.
For each method, Table 14 provides the functions that best fits the Cardio data set for each cluster, the parameter esti-
mates for each model, as well as the objective function S, for K ¼ 1; 2; 3.
This Table also presents the nonlinear function that best fits the data set of each cluster, according to the list of nonlinear
functions presents in Table 9, as well as the parameter estimates for each model. With respect to the objective function S
(except for K ¼ 1), it can be seen that its value is smaller with the new iCNLR clusterwise method than with the iCLR method.
Moreover, the value of S is smaller for the methods iCNLR and iCLR than with the ‘‘Two-step Approach”. It means that the
clusterwise regression methods presented a better fit to this data set, even in the linear case (iCLR method), in comparison
with the ‘‘Two-step Approach”. Finally, it should be noted that the new iCNLR clusterwise algorithm considered a linear
model in some clusters. This confirms that the new approach is more flexible than the previous linear algorithm (iCLR).

4.2. Tree data set

The data set Tree [24] contains the following interval-valued variables: trunk volume (Y) and height (X) from 60 clusters
of Eucalyptus of the region of Araripina, Brazil. Fig. 4 shows the fitted models by the iCNLR method, considering three
clusters.
Table 15 presents, for each method, the functions that best fit the Tree data set for each cluster, the parameter estimates
for each model, as well as the objective function S, for K ¼ 1; 2; 3. With respect to the objective function S, it can be noted that
its value is smaller for the methods iCNLR and iCLR than with the ‘‘Two-step Approach”. Again, the use of a clusterwise pro-
376
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Table 14
Comparison between the clusterwise algorithms according to the objective function S and the number of clusters for the Cardio interval-value data set.

Method Clusters S Selected Parameter Estimates


Models
Center Range Center Range
b0 b1 b2 b0 b1 b2
Two-step Approach (K-means + Linear 1 65.069 – – 1.685 0.453 – 1.584 0.257 –
Regression) 2 60.960 – – 6.335 0.102 – 2.118 0.074 –
– – 0.096 0.566 – 1.509 0.283 –
3 46.07 – – 10.946 0.172 – 1.905 0.178 –
– – 5.888 0.138 – 2.116 0.080 –
– – 1.296 0.523 – 0.397 0.593 –
iCLR 1 65.066 – – 1.685 0.453 – 1.584 0.257 –
2 25.227 – – 1.404 0.523 – 1.385 0.391 –
– – 4.031 0.242 – 1.501 0.212 –
3 16.126 – – 4.997 0.261 – 2.449 0.095 –
– – 1.992 0.352 – 1.269 0.248 –
– – 1.346 0.687 – 0.265 0.612 –
iCNLR 1 63.195 9 7 3.901 0.062 1.613 1.865 0.238 6.608
2 24.753 9 2 4.264 0.061 0.431 0.501 28.055 12.217
9 Linear 4.720 0.038 1.810 1.501 0.212 –
3 14.205 7 Linear 7.974 0.592 11.911 0.050 0.662 –
2 11 12.974 62.366 0.611 1.571 0.674 0.097
7 3 15.007 1.514 7.631 2.952 4.162 –

Fig. 4. Dispersion plots for the center, range and intervals and fitted models for iCNLR algorithm with three clusters in Tree interval-valued data set.

cedure provides a better fit for the regression models in this data set, even for the linear case (iCLR algorithm), in comparison
with the ‘‘Two-step Approach”.

4.3. Unemployment data set

This interval-valued data set reports on unemployment in Portugal [17] based on the logarithm of unemployment time
(X) and the time that people have worked previously (Y), for 58 classes of individuals grouped according to gender, region,
age and education. In this case, the aim is to predict the time of work experience (Y) from the time which the person takes to
secure employment (X). For each class of individuals, the variables record the minimum and maximum values observed for a
set of individuals.
Fig. 5 shows the dispersion plot for the values of the center, of the ranges and of the interval-valued variables as well as
the fitted models for the iCNLR algorithm with three clusters. The use of linear and nonlinear models represented the best fit
to the new iCNLR algorithm.
For each method, Table 16 provides the functions that best fit the Unemployment data set for each cluster, the parameter
estimates for each model, and the objective function S, for K ¼ 1; 2; 3. An important difference can be seen in the objective
function S between the clusterwise algorithms and the ‘‘Two-step Approach” for K ¼ 2; 3. This results highlights the gain
377
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Table 15
Comparison between the clusterwise algorithms based on objective function S and number of clusters for Tree interval-value data set.

Method Clusters S Selected Parameter Estimates


Models
Center Range Center Range
b0 b1 b2 b0 b1 b2
Two-step Approach (K-means + Linear Regression) 1 0.278 – – 0.361 0.031 – 0.044 0.016 –
2 0.237 – – 0.067 0.011 – 0.021 0.015 –
– – 0.426 0.035 – 0.077 0.011 –
3 0.225 – – 0.679 0.049 – 0.081 0.007 –
– – 0.516 0.020 – 0.096 0.005 –
– – 0.067 0.011 – 0.021 0.015 –
iCLR 1 0.278 – – 0.361 0.031 – 0.044 0.016 –
2 0.086 – – 0.128 0.015 – 0.012 0.019 –
– – 0.266 0.029 – 0.083 0.037 –
3 0.046 – – 0.337 0.033 – 0.086 0.039 –
– – 0.129 0.014 – 0.014 0.013 –
– – 0.082 0.014 – 0.058 0.013 –
iCNLR 1 0.275 7 11 10.884 0.507 0.065 0.039 0.023 0.001
2 0.084 8 12 4.539 0.152 – 0.031 0.002 –
Linear 13 0.334 0.032 – 0.333 0.223 0.045
3 0.036 6 2 9.706 2.913 – 0.100 0.004 0.710
6 13 10.450 2.990 – 0.03 0.001 0.001
2 13 0.884 10.733 0.021 0.384 0.228 0.041

Fig. 5. Dispersion plots for the center, range, intervals and fitted models for iCNLR algorithm with three clusters in Unemployment interval-valued data set.

obtained in terms of a better fit for the regression models on the data set due to the use of a clusterwise technique. Moreover,
by looking at the value of the objective function S, it can be seen that the iCNLR method provides a better fit for the regres-
sion models on this data set than the iCLR method, for K ¼ 2; 3.

4.4. Mushroom data set

The Mushroom data set [38] presents 23 species of the Amanita mushroom family. The experiment consists of predicting
the thickness of the strain Y by means of the size of the pile X.
Fig. 6 presents the dispersion plots for the center, for the range, and for the interval-valued variables as well as the fitted
models for the iCNLR algorithm with three clusters.
Table 17 presents, for each method, the functions that best fit the Mushroom data set for each cluster, the parameter esti-
mates for each model, as well as the objective function S, for K ¼ 1; 2; 3. Again, from the value of S for K ¼ 2; 3, it can be seen
that the clusterwise methods provides a better fit for the regression models in this data set than the ‘‘Two-step Approach”.

4.5. Soccer data set

The interval-valued data set Soccer [38] presents the variables weight (Y) and height (X) of 531 French football players
grouped in 20 teams. The grouping was undertaken by taking the minimum and maximum values of each variable per team.
378
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Table 16
Comparison between the clusterwise algorithms based on objective function S and number of clusters for Unemployment interval-value data set.

Method Clusters S Selected Parameter Estimates


Models
Center Range Center Range
b0 b1 b2 b0 b1 b2
Two-step Approach (K-means + Linear 1 6833.88 – – 9.428 0.137 – 5.101 0.094 –
Regression) 2 6564.21 – – 7.769 0.223 – 4.091 0.129 –
– – 16.181 0.111 – 9.735 0.051 –
3 6487.63 – – 4.808 0.439 – 4.697 0.050 –
– – 11.132 0.147 – 16.230 0.007 –
– – 11.153 0.166 – 7.570 0.078 –
iCLR 1 6833.88 – – 9.428 0.137 – 5.101 0.094 –
2 1955.70 – – 2.379 0.242 – 4.551 0.102 –
– – 33.159 0.029 – 6.592 0.079 –
3 1041.80 – – 2.437 0.193 – 2.993 0.14 –
– – 12.570 0.204 – 10.255 0.044 –
– – 36.018 0.086 – 7.051 0.034 –
iCNLR 1 6477.53 5 7 3.366 0.859 41.358 1.354 0.038 17.098
2 1250.05 2 Linear 36.583 63.908 7.917 6.946 0.077 –
7 7 2.484 0.075 18.046 2.469 0.078 14.996
3 759.41 2 2 40.710 172.879 1.978 17.689 306.317 18.552
7 Linear 4.870 0.097 30.945 0.626 0.135 –
7 2 16.096 1.597 15.256 15.936 163.263 11.244

Fig. 6. Dispersion plots for the center, range, intervals and fitted models for iCNLR algorithm with three clusters in Mushroom interval-valued data set.

Fig. 7 presents the center, the range and the interval-valued variables as well as the models fitted by the algorithm iCNLR
with K ¼ 3.
Table 18 presents, for each method, the functions that best fit the Soccer data set for each cluster, the parameter estimates
for each model, as well as the objective function S, for K ¼ 1; 2; 3. From the value of S for K ¼ 2; 3, it can seen that the clus-
terwise method provides a better fitting for the regression model in this data set than the ‘‘Two-step Approach”. Note that
the iCNLR suggests nonlinear models for cluster 1, linear models for cluster 3 and both models (linear and nonlinear) for clus-
ter 2. This is an advantage in comparison with the iCLR algorithm.

4.6. Scientific production data set

This interval-valued data set records information about the scientific production of 430 universities and research insti-
tutes in Brazil. The original database contains information about 141,260 Brazilian researchers, each one described by a con-
tinuous numerical variable representing the average production, computed over three years (2006, 2007 and 2008), for each
researcher.
Ref. [42] grouped the original database, according to institute and sub-area of knowledge, resulting in an interval-valued
data set. After a pre-processing step to remove inconsistent data, the resulting interval-valued data set has 2,656 items and
describes the scientific production of the institutes according to the subject area of knowledge. The explanatory variables
were chosen using a prior expert knowledge and they are: ‘‘Number of concluded PhD” (X1) and ‘‘Number of concluded MSc”
379
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Table 17
Comparison between the clusterwise algorithms based on objective function S and number of clusters for Mushroom interval-value data set.

Method Clusters S Selected Parameter Estimates


Models
Center Range Center Range
b0 b1 b2 b0 b1 b2
Two-step Approach (K-means + linear 1 349.634 – – 1.264 4.651 – 2.300 3.225 –
regression) 2 280.728 – – 4.047 2.409 – 1.436 2.793 –
– – 0.014 5.392 – 9.360 3.525 –
3 31.086 – – 4.047 2.409 – 1.436 2.793 –
– – 10.498 0.256 – 2.886 2.252 –
– – 82.500 24.000 – 21.833 10.666 –
iCLR 1 349.634 – – 1.264 4.651 – 2.300 3.225 –
2 49.346 – – 82.5 24 – 21.833 1.167 –
– – 4.221 2.586 – 10.666 3.853 –
3 18.710 – – 4.271 2.299 – 0.086 5.043 –
– – 7.5 12.0 – 7.5 48.0 –
– – 0.589 5.783 – 0.462 5.931 –
iCNLR 1 331.555 11 3 5.662 0.018 0.152 0.317 6.982 –
2 43.274 2 Linear 6.198 2.481 2.998 0.184 5.231 –
11 11 6.264 0.203 0.306 1.186 14.613 11.548
3 14.318 2 2 9.703 0.410 0.571 5.980 13.650 4.824
7 7 1.895 10.655 3.478 3.966 4.616 8.251
11 11 5.753 1.742 18.385 7.042 0.096 0.279

Fig. 7. Dispersion plots for the center, range, intervals and fitted models for iCNLR algorithm with three clusters in Soccer interval-valued data set.

(X2). These independent variables were considered to explain the response variable ‘‘Number of published papers” (Y). Fig. 8
illustrates the relationship between each explanatory variable (X1 and X2) and the response variable (Y) in terms of center,
range and interval-valued observations. The presence of a nonlinear relationship seems reasonable.
Some nonlinear functions with two independent variables are presented in Table 19. These functions are considered in
the iCNLR algorithm.
For each method, Table 20 brings the functions that best fit the Scientific Production data set for each cluster, the param-
eter estimates for each model, as well as the objective function S, for K ¼ 1; 2; 3. From the value of S for K ¼ 2; 3, note that,
once again the clusterwise methods provides a better fitting for the regression model in this data set than the ‘‘Two-step
Approach”.

4.7. Predictive performance on the real interval-valued data-sets

This section compares the predictive performance on unknown observations of the clusterwise methods and of the ‘‘Two-
step Approach” in real interval-value data-sets. The RMSE measure to evaluate each approach, based on 10-fold 10 times
cross-validation scheme, is considered taking into account two assignment methods (KNN and Random) and the ensemble
method SR. It is noted that the iCNLR algorithm considers the nonlinear functions of Table 9 for data sets with 1 independent
variable, and the nonlinear functions of the Table 19 for the data set with two independent variables.
380
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Table 18
Comparison between the clusterwise algorithms based on objective function S and number of clusters for Soccer interval-value data set.

Method Clusters S Selected Parameter Estimates


Models
Center Range Center Range
b0 b1 b2 b0 b1 b2
Two-step Approach (K-means + linear 1 116.808 – – 19.270 0.519 – 3.675 0.665 –
regression) 2 100.427 – – 62.373 0.762 – 6.018 0.432 –
– – 18.338 0.308 – 2.435 1.165 –
3 95.056 – – 96.434 0.943 – 9.736 0.077 –
– – 14.333 0.333 – 12.983 0.336 –
– – 9.909 0.466 – 11.285 1.785 –
iCLR 1 116.808 – – 19.270 0.519 – 3.675 0.665 –
2 53.300 – – 11.906 0.341 – 2.105 0.786 –
– – 58.332 0.752 – 6.460 0.467 –
3 23.513 – – 12.083 0.501 – 5.415 0.680 –
– – 55.694 0.106 – 0.734 0.825 –
– – 30.187 0.235 – 3.530 0.698 –
iCNLR 1 108.258 Linear 11 0.101 0.411 – 12.832 0.105 0.007
2 47.658 2 11 153.187 6478.557 95.844 14.282 0.109 0.007
2 9 114.724 2918.823 110.688 4.626 0.088 5.687
3 22.302 2 9 117.911 3051.230 103.657 7.349 0.055 4.531
Linear 9 30.187 0.235 – 5.821 0.067 1.103
Linear Linear 55.694 0.106 – 0.734 0.825 –

Fig. 8. Dispersion plots for the center, range and intervals for Scientific Production interval-valued data set.

In accordance with the results presented in Table 21, it is noted that the iCNLR method presented a better predictive per-
formance in comparison with the iCLR method. Moreover, KNN was the best method. Another important remark to note is
that the iCNLR method outperformed the ‘‘Two-step Approach”. This means that the use of a clusterwise procedure provides
better results than the independent run of both K-means and iCLR algorithms.

381
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Table 19
List of linear and nonlinear functions used in the iCNLR
algorithm.

Label yi ¼ f ðxi1 ; xi2 ; bÞ


14 yi ¼ expðb1 xi1 expðb0 ðx1i2  620
1
ÞÞÞ
15 b0
yi ¼ 1þb
1 ðxi1 þb2 xi2 Þ
 
16 yi ¼ b0 þ b1 log xi1
b2 þxi2
17 yi ¼ xbi21þb
xi1
2

18 yi ¼ b0 þ b1 logðxi1 þ b2 xi2 þ b3 ðb2 xi1 xi2 Þ2 Þ


1

19 yi ¼ b0 þ b1 logðxi1 þ b2 xi2 Þ
20 yi ¼ b0 þ b1 xi1 þ b2 xi2

Table 20
Comparison between the clusterwise algorithms based on objective function S and number of clusters for Scientific Production interval-value data set.

Method Clusters S Selected Parameter Estimates


Models
Center Range Center Range
b0 b1 b2 b3 a0 b1 b2 b3
Two-step Approach (K- 1 1035875 – – 12.22 1.87 0.95 – 11.14 2.00 1.03 –
means + Linear Regression) 2 1002185 – – 12.16 1.49 2.08 – 10.89 1.58 2.52 –
– – 13.20 1.37 1.63 – 12.05 1.46 1.98 –
3 993342 – – 11.25 3.08 2.81 – 9.81 3.36 3.35 –
– – 11.34 2.89 2.59 – 9.93 3.18 3.12 –
– – 11.89 2.89 2.25 – 10.56 3.18 2.73 –
iCLR 1 1035875 – – 12.22 1.87 0.94 – 11.14 2.00 1.03 –
2 456872 – – 9.19 6.91 0.18 – 8.00 7.28 0.24 –
– – 100.94 9.90 2.63 – 99.39 9.78 2.69 –
3 254889 – – 8.48 3.40 0.10 – 7.39 3.73 0.14 –
– – 153.39 60.36 0.75 – 150.50 59.31 0.50 –
– – 25.92 1.77 1.28 – 24.44 1.72 1.53 –
iCNLR 1 990360 18 18 7.79 6.02 0.26 6.72 6.17 6.55 0.43 4.93
2 410231 15 15 175.32 1.29 0.004 – 173.81 1.31 0.004 –
18 18 12.00 4.82 0.01 9.97 9.85 5.29 0.02 12.74
3 241341 15 15 21.92 0.04 1.92 – 20.89 0.04 1.74 –
15 15 187.10 1.37 0.04 – 184.00 1.36 0.05 –
18 18 2.84 2.79 0.05 93.96 0.29 3.24 0.06 101.34

Table 22 synthesizes the overall performance of the methods in the real interval-valued data sets, according to the num-
ber of clusters, the two assignment methods (KNN and Random) and the ensemble method SR. For each real data set, the
method with the lowest RMSE was ranked as 1 and so on. The new iCNLR algorithm presented the best performance with
the KNN assignment method, which presented the lowest average rank. The two-step approach presented the worst perfor-
mance. In conclusion, the new clusterwise algorithm iCNLR presented the best average predictive performance for these data
sets.
Finally, Table 23 provides a comparison between iCNLR and iCLR algorithms taking into account the values of the RMSE in
a 10-fold 10 times cross-validation scheme, based on the Mann–Whitney test. For a fixed number of clusters, the two assign-
ment methods (KNN and Random) and the ensemble method SR, the iCNLR outperformed the linear algorithm at a signifi-
cance level a ¼ 5% for the majority of real interval-valued data sets, except in the Tree and Scientific Production interval-
valued data sets. Moreover, the SR and KNN methods presented the best performance among the assignment and ensemble
methods.

4.8. Future work

For future work, advances in clusterwise regression methods for interval-valued data considering others regression meth-
ods as well as others clustering methods could be developed. Moreover, the extension of clusterwise regression methods to
other kinds of data, such as histogram-valued data, would be a challenging research direction. In addition, the improvement
of the assignment methods used with the clusterwise regression methods can also be considered, such as evaluating the use
of other distances (Euclidean, City-Block) for interval-valued data in the KNN method, for example. Finally, the use of other
regression models (nonlinear, robust regression, etc.) to adjust weights in the prediction of the Stacked Regression, or the use
of other ensemble methods [21] with the clusterwise regression method are also potential improvements.
382
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Table 21
Predictive performance of the clusterwise methods and the ‘‘Two-step Approach” in the real interval-valued data sets, according to the two assignment
methods, the ensemble method, and the number of clusters.

Data set Method RMSE 1 2 Clusters 3 Clusters


Cluster
KNN SR Random KNN SR Random
Cardio Two-step approach (K-means + linear Mean – 1.930 2.541 2.064 3.605 4.004 3.726
regression) s.d – 0.763 0.831 1.235 1.584 1.032 1.235
iCLR Mean 1.516 1.512 2.788 1.873 1.605 2.418 2.06
s.d 0.026 0.068 0.197 0.16 0.096 0.292 0.139
iCNLR Mean 1.508 1.469 1.573 1.986 1.634 2.003 2.308
s.d. 0.023 0.072 0.062 0.134 0.059 0.767 0.588
Tree Two-step Approach (K-means + linear Mean – 0.151 0.178 0.191 0.178 2.390 0.245
regression) s.d. – 0.033 0.048 0.052 0.024 0.499 0.074
iCRCLR Mean 0.036 0.032 0.217 0.05 0.034 0.19 0.053
s.d. 0.001 0.002 0.008 0.003 0.001 0.012 0.003
iCRCNLR Mean 0.097 0.098 0.103 0.135 0.096 0.708 0.147
s.d. 0.002 0.007 0.006 0.012 0.009 1.538 0.011
Unemployment Two-step approach (K-means + linear Mean – 18.571 31.935 32.447 18.762 23.486 25.644
regression) s.d. – 0.752 2.666 3.839 4.687 4.950 4.485
iCLR Mean 15.567 14.111 24.348 22.519 14.017 21.938 22.358
s.d. 0.207 0.519 1.772 1.718 0.656 0.759 0.775
iCNLR Mean 15.94 13.032 25.607 33.776 13.55 21.561 27.437
s.d. 0.429 0.707 16.509 27.308 0.775 5.566 8.573
Mushroom Two-step approach (K-means + linear Mean – 6.070 4.988 5.089 8.916 8.981 8.287
regression) s.d. – 3.112 1.819 1.507 1.802 3.652 2.004
iCLR Mean 4.957 5.54 5.551 12.843 7.126 9.441 21.02
s.d. 0.165 0.377 0.711 2.959 0.391 2.663 8.577
iCNLR Mean 4.674 4.493 9.942 20.362 5.341 102.46 18.833
s.d. 0.257 0.351 2.073 7.352 0.915 220.924 23.066
Soccer Two-step approach (K-means + linear Mean – 4.418 4.151 5.445 6.795 6.633 7.354
regression) s.d. – 1.496 0.917 1.972 1.838 2.586 2.052
iCLR Mean 6.158 6.842 144.711 7.816 8.763 189.797 9.211
s.d. 0.199 0.516 25.167 0.827 0.549 77.276 1.874
iCNLR Mean 3.581 3.841 3.759 4.3 4.454 4.279 5.248
s.d. 0.06 0.183 0.228 0.441 0.525 0.647 0.417
Scient. Prod. Two-step approach (K-means + linear Mean – 32.15 45.85 234.33 30.32 40.25 221.98
regression) s.d. – 4.85 21.80 71.24 10.40 18.49 65.66
iCLR Mean 27.08 26.83 30.91 122.95 27.29 30.78 179.47
s.d. 7.52 7.77 6.67 6.71 7.73 6.80 122.50
iCNLR Mean 26.07 25.98 30.31 127.33 26.96 30.17 119.01
s.d. 8.61 8.72 8.08 46.73 8.75 7.92 12.41

Table 22
Overall predictive performance of the clusterwise algorithms and ‘‘Two-step Approach” in the real interval-valued data sets.

Clusters Method Allocation Data set Avg.


rank
Cardio Tree Unemployment Mushroom Soccer ScientProd.
2 Two-step Approach (K-means + linear KNN 5 6 3 6 5 5 5.0
regression) SR 8 7 7 3 3 6 5.7
Random 7 8 8 1 6 9 6.5
iCLR KNN 2 1 2 4 7 2 3.0
SR 9 9 5 5 9 4 6.8
Random 4 2 4 8 8 7 5.5
iCNLR KNN 1 3 1 2 2 1 1.7
SR 3 4 6 7 1 3 4.0
Random 6 5 9 9 4 8 6.8
3 Two-step Approach (K-means + linear KNN 7 5 3 4 5 4 4.7
regression) SR 9 9 7 5 4 6 6.7
Random 8 7 8 3 6 9 6.8
iCLR KNN 1 1 2 2 7 2 2.5
SR 6 6 5 6 9 5 6.2
Random 4 2 6 8 8 8 6.0
iCNLR KNN 2 3 1 1 2 1 1.7
SR 3 8 4 9 1 3 4.7
Random 5 4 9 7 3 7 5.8

383
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

Table 23
Real interval-valued data sets: comparison between iCLR and iCNLR algorithms based on Mann-Whitney test. P-values according to the number of cluster,
assignment and ensemble methods.

Real Data Set 2 Clusters 3 Clusters


KNN SR Random KNN SR Random
Cardio 0.165 0.000 0.123 0.315 0.018 0.247
Tree 1.000 1.000 1.000 1.000 1.000 1.000
Unemployment 0.000 0.043 0.165 0.190 0.035 0.123
Mushroom 0.000 0.000 0.014 0.000 0.853 0.023
Soccer 0.000 0.000 0.000 0.000 0.000 0.000
Scient. Prod. 0.455 0.370 0.700 0.514 0.427 0.091

5. Concluding remarks

This paper proposed a new interval clusterwise nonlinear regression method for interval-valued data (iCNLR), which is
able to fit nonlinear models into a set of K homogeneous groups of observations. The iCNLR method provided the best pair
of functions adjusted for the center and the range of each group of interval-valued data, according to an optimization crite-
rion. The iCNLR method extends the interval clusterwise linear regression method for interval-valued data [14] and combi-
nes the dynamic clustering algorithm [19] with the nonlinear regression method for interval-valued data [39]. The new
algorithm iCNLR considers three different optimization heuristics (BFGS, Simulated Annealing and Conjugated Gradient)
and uses a parameter estimation schema that discards a heuristic that provides parameter estimates based on the non-
convergence of the optimization method.
Simulation studies evaluated two aspects of the proposed method: the parameter estimates provided by the iCNLR algo-
rithm and its predictive capability for unknown observations, according to the best approach among two assignment meth-
ods (KNN, and Random) and the ensemble method SR taking into account a wide range of scenarios. The results also showed
that KNN was the best method for the iCNLR algorithm in scenarios with the presence of disjoint (D) and partial overlap (I)
clusters. In scenarios with overlap (U) clusters, the SR method provided the best predictive performance.
The proposed method was applied to six real interval-valued data sets and the predictive performance of both the iCNLR
and iCLR methods was compared. Moreover, a naive method consisting of independently applying the K-means algorithm
followed by the fitting of the center and range linear regression method [36] was also considered (‘‘Two-step approach”).
The results showed that the iCNLR method presented the better prediction performance.
With regards to the predictive performance of the clusterwise regression methods on synthetic and real data sets, it was
seen that the assignment based on the KNN method with Hausdorff distance presented the best predictive performance.
However, it was noted that the performance difference regarding the clusterwise methods with Stacked Regression
decreases in scenarios with overlapping groups. In a few cases, the performance of the clusterwise methods with Stacked
regression was worse than that of the clusterwise methods with Random assignment.
Finally, the adjustment of nonlinear models in clusterwise regression for interval-valued data represents an improvement
in prediction terms, when compared to the adjustment of linear clusterwise models. However, the choice of the candidates
nonlinear models will depend, in part, on the expert’s prior knowledge on the problem at hand.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank the associated editor and the anonymous referees for their careful revision, the Conselho
Nacional de Desenvolvimento Científico e Tecnológico - CNPq (303187/2013–1), and the Fundação de Amparo à Ciência e
Tecnologia do Estado de Pernambuco - FACEPE (IBPG-1500–1.03/16), for their partial financial support of this study.

References

[1] A.M. Bagirov, J. Ugon, H.G. Mirzayeva, An algorithm for clusterwise linear regression based on smoothing techniques, Optim. Lett. 9 (2015) 375–390.
[2] A. Blanco-Fernández, N. Corral, G. González-Rodríguez, Estimation of a flexible simple linear model for interval data based on set arithmetic, Comput.
Stat. Data Anal. 55 (9) (2011) 2568–2578.
[3] H.-H. Bock, E. Diday (Eds.), Analysis of Symbolic Data, Exploratory Methods for Extracting Statistical Information from Complex Data, Springer, Berlin,
Heidelberg, 2000.
[4] S. Bougeard, H. Abdi, G. Saporta, N. Niang, Clusterwise pls regression on a stochastic process, Adv. Data Anal. Classif. 12 (2018) 285–313.
[5] R. Boukezzoula, S. Galichet, D. Coquin, From fuzzy regression to gradual regression: interval-based analysis and extensions, Inf. Sci. 441 (2018) 18–40.
[6] L. Breiman, Stacked regressions, Mach. Learn. 24 (1) (1996) 49–64.
[7] P. Brito, A.P. Duarte Silva, Modelling interval data with normal and skew-normal distributions, J. Appl. Stat. 39 (1) (2012) 3–20.

384
Francisco de A.T. de Carvalho, Eufrásio de A. Lima Neto and Kassio C.F. da Silva Information Sciences 555 (2021) 357–385

[8] R.A. Carbonneau, G. Caporossi, P. Hansen, Globally optimal clusterwise regression by column generation enhanced with heuristics, sequencing and
ending subset optimization, J. Classif. 31 (2014) 219–241.
[9] M. Cerný, J. Antoch, M. Hladík, On the possibilistic approach to linear regression models involving uncertain, indeterminate or interval data, Inf. Sci. 244
(2013) 26–47.
[10] Y. Chen, D. Miao, Granular regression with a gradient descent method, Inf. Sci. 537 (2020) 246–260.
[11] M.G.C.A. Cimino, B. Lazzerini, F. Marcelloni, W. Pedrycz, Genetic interval neural networks for granular data regression, Inf. Sci. 257 (2014) 313–330.
[12] F.A.T. de Carvalho, P. Brito, H.-H. Bock, Dynamic clustering for interval data based on l2 distance, Comput. Stat. 21 (2) (2006) 231–250.
[13] F.A.T. de Carvalho, R.M.C.R. de Souza, M. Chavent, Y. Lechevallier, Adaptive hausdorff distances and dynamic clustering of symbolic interval data,
Pattern Recogn. Lett. 27 (3) (2006) 167–179.
[14] F.A.T. de Carvalho, G. Saporta, D.N. Queiroz, A clusterwise center and range regression model for interval-valued data, in: Y. Lechevallier, G. Saporta,
(Eds.), Proceedings of COMPSTAT’2010, Physica-Verlag HD, 2010, pp. 461–468..
[15] R.M.C.R. de Souza, F.A.T. de Carvalho, Clustering of interval data based on city–block distances, Pattern Recogn. Lett. 25 (3) (2004) 353–365.
[16] W.S. DeSarbo, W.L. Cron, A maximum likelihood methodology for clusterwise linear regression, J. Classif. 5 (2) (1988) 249–282.
[17] S. Dias, P. Brito, Off the beaten track: a new linear model for interval data, Eur. J. Oper. Res. 258 (3) (2017) 1118–1130.
[18] E. Diday, M. Noirhomme-Fraiture, Symbolic Data Analysis and the SODAS Software, Wiley and Sons, New Jersey, 2007.
[19] E. Diday, J.C. Simon, Clustering analysis, in: K.S. Fu (Ed.), Digital Pattern Recognition, Springer, 1980, pp. 47–94.
[20] W. Ding, C.-T. Lin, A.W.-C. Liew, I. Triguero, W. Luo, Current trends of granular data mining for biomedical data analysis, Inf. Sci. 510 (2020) 341–343.
[21] X. Dong, Z. Yu, W. Cao, Y. Shi, Q. Ma, A survey on ensemble learning, Front. Comput. Sci. 14 (2020) 241–258.
[22] P. D’Urso, R. Massari, A. Santoro, A class of fuzzy clusterwise regression models, Inf. Sci. 180 (24) (2010) 4737–4762.
[23] R.A.A. Fagundes, R.M.C.R. de Souza, F.J.A. Cysneiros, Interval kernel regression, Neurocomputing 128 (2014) 371–388.
[24] L.M.A. Lima Filho, J.A.A. da Silva, G.M. Cordeiro, R.L.C. Ferreira, Modeling the growth of eucalyptus clones using the chapman-richards model with
different symmetrical error distributions, Ciência Florestal 22 (4) (2012) 777–785.
[25] L.A. García-Escuderoand, A. Gordaliza, A. Mayo-Iscar, R. San Martín, Robust clusterwise linear regression through trimming, Comput. Stat. Data Anal. 54
(12) (2010) 3057–3069.
[26] P. Giordani, Lasso-constrained regression analysis for interval-valued data, Adv. Data Anal. Classif. 9 (1) (2015) 5–19.
[27] P. Hao, J. Guo, Constrained center and range joint model for interval-valued symbolic data regression, Comput. Stat. Data Anal. 116 (2017) 106–138.
[28] Q. Hu, J. Mi, D. Chen, Granular computing based machine learning in the era of big data, Inf. Sci. 378 (2017) 242–243.
[29] Y. Jeon, J. Ahn, C. Park, A nonparametric kernel approach to interval-valued data analysis, Technometrics 57 (4) (2015) 566–575.
[30] S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing, Science 220 (4598) (1983) 671–680.
[31] K. Lau, P. Leung, K. Tse, A mathematical programming approach to clusterwise regression model and its extensions, Eur. J. Oper. Res. 116 (3) (1999)
640–652.
[32] R. Di Mari, R. Rocci, S.A. Gattone, Clusterwise linear regression modeling with soft scale constraints, Int. J. Approximate Reason. 91 (2017) 160–178.
[33] M.R. Mashinchi, A. Selamat, S. Ibrahim, H. Fujita, Outlier elimination using granular box regression, Inf. Fusion 27 (2016) 161–169.
[34] A. Mazza, A. Punzo, Mixtures of multivariate contaminated normal regression models, Stat. Papers 61 (2020) 787–822..
[35] E.A. Lima Neto, U.U. Anjos, Regression model for interval-valued variables based on copulas, J. Appl. Stat. 42 (2015) 2010–2029.
[36] E.A. Lima Neto, F.A.T. De Carvalho, Centre and range method for fitting a linear regression model to symbolic interval data, Comput. Stat. Data Anal. 52
(3) (2008) 1500–1515.
[37] E.A. Lima Neto, F.A.T. De Carvalho, Constrained linear regression models for symbolic interval-valued variable, Comput. Stat. Data Anal. 54 (2010) 333–
347.
[38] E.A. Lima Neto, G.M. Cordeiro, F.A.T. De Carvalho, Bivariate symbolic regression models for interval-valued variables, J. Stat. Comput. Simul. 81 (2011)
1727–1744.
[39] Eufrásio de A Lima Neto, Francisco de AT de Carvalho, Nonlinear regression applied to interval-valued data, Pattern Anal. Appl. 20(3) (2017) 809–824..
[40] W. Pedrycz, Granular computing for data analytics: a manifesto of human-centric computing, IEEE/CAA J. Autom. Sin. 5 (2018) 1025–1034.
[41] G. Peters, Z. Lacic, Tackling outliers in granular box regression, Inf. Sci. 212 (2012) 44–56.
[42] B.A. Pimentel, R.M.C.R. de Souza, A weighted multivariate fuzzy c-means method in interval-valued scientific production data, Expert Syst. Appl. 41
(2014) 3223–3236.
[43] C. Preda, G. Saporta, Clusterwise pls regression on a stochastic process, Comput. Stat. Data Anal. 49 (2005) 99–108.
[44] C. Shao-Tung, L. Kang-Ping, Y. Miin-Shen, Stepwise possibilistic c-regressions, Inf. Sci. 334–335 (2016) 307–322.
[45] H. Späth, Algorithm 39 clusterwise linear regression, Computing 22 (4) (1979) 367–373.
[46] S.-F. Su, W. Pedrycz, T.-P. Hong, F.A.T. de Carvalho, Guest editorial special issue on granular/symbolic data processing, IEEE Trans. Cybern. 46 (2016)
342–343.
[47] Z.G. Su, P.H. Wang, Y.G. Li, Z.K. Zhou, Parameter estimation from interval-valued data using the expectation-maximization algorithm, J. Appl. Stat. 85
(2015) 320–338.
[48] D. Vicari, M. Vichi, Multivariate linear regression for heterogeneous data, J. Appl. Stat. 40 (6) (2013) 1209–1230.
[49] Q. Wu, W. Yao, Mixtures of quantile regressions, Comput. Stat. Data Anal. 93 (2016) 162–176.
[50] H. Zuo, G. Zhang, W. Pedrycz, V. Behbood, J. Lu, Granular fuzzy regression domain adaptation in takagisugeno fuzzy models, IEEE Trans. Fuzzy Syst. 26
(2018) 847–858.

385

You might also like