One-Class SVM in Multi-Task Learning
Xiyan He, Gilles Mourot, Didier Maquin & José Ragot
Centre de Recherche en Automatique de Nancy, CNRS UMR 7039 - Université Henri Poincaré, Nancy-I
2, avenue de la forêt de Haye, 54500 Vandœuvre-lès-Nancy, France
Pierre Beauseroy & André Smolarz
Institut Charles Delaunay, STMR CNRS UMR 6279 - Université de Technologie de Troyes
12, Rue Marie Curie, BP 2060, F-10010 Troyes, France
ABSTRACT: Multi-Task Learning (MTL) has become an active research topic in recent years. While most
machine learning methods focus on the learning of tasks independently, multi-task learning aims to improve
the generalization performance by training multiple related tasks simultaneously. This paper presents a new
approach to multi-task learning based on one-class Support Vector Machine (one-class SVM). In the proposed
approach, we first make the assumption that the model parameter values of different tasks are close to a certain
mean value. Then, a number of one-class SVMs, one for each task, are learned simultaneously. Our multi-task
approach is easy to implement since it only requires a simple modification of the optimization problem in the
single one-class SVM. Experimental results demonstrate the effectiveness of the proposed approach.
1
INTRODUCTION
Classical machine learning technologies have
achieved much success in the learning of a single task
at a time. However, in many practical applications
we may need to learn a number of related tasks or
to rebuild the model from new data, for example,
in the problem of fault detection and diagnosis of
a system that contains a set of equipments a priori
identical but working under different conditions.
Here “an equipment” may be a simple machine (a
pump, a motor, ...), a system (a car, an airplane, ...),
or even an industrial plant (nuclear power plant, ...).
This may be the case for a car hire system where we
have a fleet of vehicles to serve a set of customers.
In industry, it is common to encounter a number of
a priori identical plants, such as in the building or
maintenance of a fleet of nuclear power plants or of a
fleet of their components. In such cases, the learning
of the behavior of each equipment can be considered
as a single task, and it would be nice to transfer
or leverage the useful information between related
tasks (Pan and Yang 2010). Therefore, Multi-Task
Learning (MTL) has become an active research topic
in recent years (Bi et al. 2008, Zheng et al. 2008, Gu
and Zhou 2009, Birlutiu et al. 2010).
While most machine learning methods focus on
the learning of tasks independently, multi-task learning aims to improve the generalization performance
by training multiple related tasks simultaneously. The
main idea is to share what is learned from different
tasks (e.g., a common representation space or some
model parameters that are close to each other), while
tasks are trained in parallel (Caruana 1997). Previous
works have shown empirically as well as theoretically
that the multi-task learning framework can lead to
more intelligent learning models with a better performance (Caruana 1997, Heskes 2000, Ben-David and
Schuller 2003, Evgeniou et al. 2005, Ben-David and
Borbely 2008).
In recent years, Support Vector Machines (SVM)
(Boser et al. 1992, Vapnik 1995) have been successfully used for multi-task learning (Evgeniou and Pontil 2004, Jebara 2004, Evgeniou et al. 2005, Widmer
et al. 2010, Yang et al. 2010). The SVM method was
initially developped for the classification of data from
two different classes by a hyperplane that has the
largest distance to the nearest training data points of
any class (maximum margin). When the datasets are
not linearly separable, the “kernel trick” is used. The
basic idea is to map the original data to a higher dimensional feature space and then solve a linear problem in that space. The good properties of kernel functions make support vector machines well-suited for
multi-task learning.
In this paper, we present a new approach to multitask learning based on one-class Support Vector Machines (one-class SVM). The one-class SVM pro-
posed by Schölkopf et al. (2001) is a typical method
for the problem of novelty or outlier detection, also
known as the one-class classification problem due to
the fact that we do not have sufficient knowledge
about the outlier class. For example, in the application of fault detection and diagnosis, it is very difficult to collect samples corresponding to all the abnormal behaviors of the system. The main advantage
of one-class SVM over other one-class classification
methods (Tarassenko et al. 1995, Ritter and Gallegos
1997, Eskin 2000, Singh and Markou 2004) is that
it focuses only on the estimation of a bounded area
for samples from the target class rather than on the
estimation of the probability density. The bounded
area estimation is achieved by separating the target
samples (in a higher-dimensional feature space for
non-linearly separable cases) from the origin by a
maximum-margin hyperplane which is as far away
from the origin as possible.
Recently, Yang et al. (2010) proposed to take the
advantages of multi-task learning when conducting
one-class classification. The basic idea is to constrain
the solutions of related tasks close to each other. However, they solve the problem via conic programming
(Kemp et al. 2008), which is complicated. In this
paper, inspired by the work of Evgeniou and Pontil
(2004), we introduce a very simple multi-task learning framework based on the one-class SVM method,
a widely used tool for single task learning. In the
proposed method, we first make the same assumption as in (Evgeniou and Pontil 2004), that is, the
model parameter values of different tasks are close
to a certain mean value. This assumption is reasonable due to the observation that when the tasks are
similar to each other, usually their model parameters are close enough. Then, a number of one-class
SVMs, one for each task, are learned simultaneously.
Our multi-task approach is easy to implement since it
only requires a simple modification of the optimization problem in the single one-class SVM. Experimental results demonstrate the effectiveness of the
proposed approach.
This paper is organized as follows. In Section 2, a
brief description of the formulation of the one-class
SVM algorithm and the properties of kernel functions
is first discussed. The proposed multi-task learning
method based on one-class SVM is then outlined in
Section 3. Section 4 presents the experimental results.
In Section 5, we conclude this paper with some final
remarks and future work propositions.
2
2.1
ONE-CLASS SVM AND PROPERTIES OF
KERNELS
One-class SVM
The one-class SVM proposed by Schölkopf et al.
(2001) is a promising method for the problem of oneclass classification, which aims at detecting samples
that do not resemble the majority of the dataset. It
employes two ideas of the original support vector machine algorithm to ensure a good generalisation: the
maximisation of the margin and the mapping of the
data to a higer dimensional feature space induced by a
kernel function. The main difference between the oneclass SVM and the original SVM is that in one-class
SVM the only given information is the normal samples (also called positive samples) of the same single class whereas in the original SVM information on
both normal samples and outlier samples (also called
negative samples) is given. In essence, the one-class
SVM estimates the boundary region that comprises
most of the training samples. If a new test sample falls
within this boundary it is classified as of normal class,
otherwise it is recognised as an outlier.
Suppose that Am = {xi }, i = 1, . . . , m is a set of
m training samples of a single class. xi is a sample
in the space X ⊆ Rd of dimension d. Also suppose
that φ is a non-linear transformation. The one-class
SVM is predicated on the assumption that the origin
in the transformed feature space belongs to the negative or outlier class. The training stage consists in
first projecting the training samples to a higher dimensional feature space and then separating most of
the samples from the origin by a maximum-margin
hyperplane which is as far away from the origin as
possible. In order to determine the maximum-margin
hyperplane, we need to deduce its normal vector w
and a threshold ρ by solving the following optimization problem:
Pm
1
1
2
kwk
+
minw,ξ,ρ
i=1 ξi − ρ
2
νm
(1)
subject to: hw, φ(xi )i ≥ ρ − ξi , ξi ≥ 0
ξi are called slack variables, and they are introduced
to relax the constraints in some cases for certain training sample sets. Indeed, the optimization algorithm
aims at finding the best trade-off between the maximization of the margin and the minimization of the
average of the slack variables. The parameter ν ∈
(0, 1] is a special parameter for one-class SVM. It is
the upper-bound of the ratio of outliers among all the
training samples as well as the lower-bound of the ratio of support vectors among all the samples.
Due to the high dimensionality of the normal vector
w, the primal problem is solved by its Lagrange dual
problem :
Pm
1
minα
i ), φ(xj )i
i,j=1 αi αj hφ(x
2
Pm
(2)
1
subject to: 0 ≤ αi ≤ νm ,
i=1 αi = 1
where αi are the Lagrange multipliers. It is worth noting that all the mappings φ occur in the form of inner products. We need not to calculate the non-linear
mapping explicitely by defining a simple kernel function that fulfills Mercer’s conditions (Vapnik 1995):
hφ(xi ), φ(xj )i = k(xi , xj ).
(3)
As an example, the Gaussian kernel kσ (xi , xj ) =
kx −x k2
− i 2j
2σ
is a largely used kernel among the come
munity. By solving the dual problem with this kernel
trick, the final decision is given by:
!
m
X
f (x) = sign
(4)
αi k(xi , x) − ρ
3.1
Primal problem
Following the above assumption, we can generalize
the one-class SVM method to the problem of multitask learning. The primal optimization problem can
be written as follows:
i=1
2.2
min
Properties of kernels
w0 ,vt ,ξit ,ρt
In order to exploit the kernel trick, we need to construct valid kernel functions. A necessary and sufficient condition for a function to be a valid kernel is
defined as follows (Schölkopf and Smola 2001):
Definition 2.1 Let X be a nonempty set. A function k
on X × X which for all m ∈ N and all x1 , . . . , xm ∈ X
gives rise to a positive definite Gram matrix K, with
elements:
Kij := k(xi , xj )
(5)
is called a positive definite kernel.
One popular way to construct new kernels is to
build them based on simpler kernels. In this section,
we briefly gather some results of the properties of the
set of admissible kernels that are useful for designing
new kernels. For a detailed description concerning the
design of kernel functions, interested readers are referred to (Schölkopf and Smola 2001).
Proposition 2.1 If k1 and k2 are kernels, and
α1 , α2 ≥ 0, then α1 k1 + α2 k2 is a kernel.
Proposition 2.2 If k1 and k2 are kernels defined respectively on X1 × X1 and X2 × X2 , then their tensor
product
k1 ⊗ k2 (x1 , x2 , x′1 , x′2 ) = k1 (x1 , x′1 )k2 (x2 , x′2 )
is a kernel on (X1 × X2 ) × (X1 × X2 ). Here
∈
X1 and x2 , x′2 ∈ X2 .
With these properties, we can now construct more
complex kernel functions that are appropriate to our
specific applications in multi-task learning.
x1 , x′1
3
THE PROPOSED METHOD
In this section, we introduce the one-class SVM
method for the purpose of multi-task learning. In
the context of multi-task learning, we have T learning tasks on the same space X , with X ⊆ Rd . For
each task we have m samples {x1t , x2t , . . . , xmt }. The
objective is to learn a decision function (a hyperplane) ft (x) = sign(hwt , φ(x)i − ρt ) for each task t.
Inspired by the method proposed by Evgeniou and
Pontil (2004), we make the assumption that when the
tasks are related to each other, the normal vector wt
can be represented by the sum of a mean vector w0
and a specific vector vt corresponding to each task:
wt = w0 + vt
(6)
T
T
X
µ
1X
kvt k2 + kw0 k2 +
2 t=1
2
t=1
! T
m
X
1 X
ρt
ξit −
νt m i=1
t=1
(7)
for all i ∈ {1, 2, . . . , m} et t ∈ {1, 2, . . . , T }, subject
to:
h(w0 + vt ), φ(xit )i ≥ ρt − ξit
(8)
ξit ≥ 0
(9)
where ξit are the slack variables associated to each
sample and νt ∈ (0, 1] is the special parameter of oneclass SVM for each task. In order to control the similarity between tasks, we introduce a positive regularization parameter µ into the primal optimisation problem. In particular, a big value of µ tends to enforce the
system to learn the T tasks independently whereas a
small value of µ will lead the system to learn a common model for all tasks. As in the earlier case of a
single one-class SVM, the Lagrangien is formed as:
L(w0 , vt , ξit , ρt , αit , βit )
T
T
X
1X
µ
=
kvt k2 + kw0 k2 +
2 t=1
2
t=1
−
m
T X
X
m
1 X
ξit
νt m i=1
!
αit [h(w0 + vt ), φ(xit )i − ρt + ξit ] −
t=1 i=1
−
T
X
ρt
t=1
m
T X
X
βit ξit
t=1 i=1
(10)
where αit , βit ≥ 0 are the Lagrange multipliers. We
set the partial derivatives of the Lagrangian to zero
and obtain the following equations:
PT Pm
1
(a) w0 =P
i=1 αit φ(xit )
t=1
µ
m
(b) vt = i=1 αit φ(xit )
(11)
1
−
β
(c) α
=
it
it
ν
m
Pm t
(d)
i=1 αit = 1
By combining the equations (6), (11)(a) and (11)(b),
we have:
T
1X
w0 =
vt
µ t=1
T
1 X
wt
w0 =
µ + T t=1
(12)
(13)
With these relationships, we may replace the vectors
vt and w0 by wt in the primal optimisation function
(7), which leads to an equivalent function:
T
T
T
1X
λ2 X
λ1 X
wt −
kwt k2 +
wr
2 t=1
2 t=1
T r=1
min
wt ,ξit ,ρt
+
T
X
t=1
with
m
1 X
ξit
νt m i=1
!
−
T
X
ρt
2
(14)
t=1
r=1 i=1
T
µ
and λ2 =
(15)
µ+T
µ+T
We can see that the objective of the primal optimisation problem (7) in the framework of multi-task learning is thus to find a trade-off between the maximisation of the margin for each one-class SVM model and
the closeness of each one-class SVM model to the average model.
λ1 =
3.2
Dual problem
The primal optimisation problem (7) can be solved
through its Lagrangian dual problem expressed by:
T
max −
αit
T
m
m
1 XXXX
αit αjr
2 t=1 r=1 i=1 j=1
1
+ δrt hφ(xit ), φ(xjr )i
µ
(16)
contrainted to :
1
,
0 ≤ αit ≤
νt m
m
X
αit = 1
(17)
i=1
where δrt is the Kronecker delta kernel:
1, if r = t
δrt =
0, if r 6= t
(18)
We can see that the main difference between this
dual problem (16) and that in a singleone-class
SVM
1
learning (2) is the introduced term µ + δrt in the
multi-task learning framework. Suppose that we define a kernel function as in equation (3):
k(xit , xjr ) = hφ(xit ), φ(xjr )i
(19)
where r and t are the task index associated to each
sample. Taking advantage of the kernel properties
presented in section 2.2, we know that the product of
two kernels δrt k(xit , xjr ) is a valid kernel (Proposition
2.2). Further, the following function:
1
(20)
Grt (xit , xjr ) = ( + δrt )k(xit , xjr )
µ
=
1
k(xit , xjr ) + δrt k(xit , xjr )
µ
is a linear combination of two valid kernels with positive coefficients ( µ1 and 1), and therefore is also a valid
kernel (Proposition 2.1). We can thus solve the multitask learning optimisation problem (7) through a single one-class SVM problem by using the new kernel
function Grt (xit , xjr ). The decision function for each
task is given by:
!
T X
m
X
αir Grt (xir , x) − ρt
ft (x) = sign
(21)
4
EXPERIMENTAL RESULTS
This section presents the experimental results obtained in our analysis. In order to evaluate the effectiveness of the proposed multi-task learning framework, we compare our one-class SVM based multitask learning method (denoted by MTL-OSVM) with
two other methods: the traditional learning method
that learns the T tasks independently each with a oneclass SVM (denoted by T -OSVM) and the method
that uses 1 one-class SVM for all tasks under the assumption that all the related tasks can be considered
as one big task (denoted by 1-OSVM).
In our experiments, the kernel of the one-class
SVM used for T -OSVM and 1-OSVM
is a Gaus2
kxit −xjr k
sian kernel kσ (xit , xjr ) = e− 2σ2 . For the proposed multi-task learning method MTL-OSVM, the
new kernel is thus constructed based on the Gaussian
kernel as presented in equation (20). The optimum
values for the two parameters ν and σ of the one-class
SVM are determined through cross validation. For the
sake of simplicity, we have used a common combination of their values (ν, σ) for all related tasks. In order
to ensure the reliability of the performance evaluation,
all the results have been averaged over 20 trials each
with random draws of training set. As the approaches
and comparison are all one-class classification methods, the statistics of both false positive and false negative error rates are reported.
4.1
Experiment on nonlinear toy data
We have firstly tested the poposed method on four
(T = 4) related simple nonlinear classification tasks.
The datasets are created according to the following
steps. For the first task, each sample is composed of
d = 4 variables of which the first three are uniformly
distributed variables. The fourth variable is set by the
relation:
x(4) = x(1) + 2x(2) + (x(3) )2
The datasets for the other three tasks are then created
by adding Gaussian white noises with different amplitudes on the dataset of the first task. The noises
are classified respectively as low noise (for Task 2,
with an amplitude of about an order of 1% of the first
Figure 1: The variation of the average (a) false positive, (b) false negative and (c) total error rates for each task (nonlinear toy data)
along with the value of the regularization parameter µ.
dataset amplitude value), medium noise (for Task 3,
with an amplitude of about an order of 8% of the
first dataset amplitude value) and high noise (for Task
4, with an amplitude of about an order of 15% of
the first dataset amplitude value). In order to evaluate the false positive error rates, we have generated a
set of negative samples that are composed of d = 4
uniformly distributed variables. Therefore, the training set of each task contains only positive samples
(m = 200) whereas in the test procedure we use the
test set of size 400 that contains both positive and negative samples (200 samples for each class). The obtained optimum parameter values of one-class SVM
are (ν, σ) = (0.01, 0.5) for this experiment.
Figure 1 illustrates the variation of the average
false positive, false negative and total error rates of
our multi-task learning method MTL-OSVM for each
task along with the value of the regularization parameter µ. The error rates of T -OSVM and 1-OSVM are
also presented. We can see that for a very small value
of µ, the performance of MTL-OSVM coincides with
that of 1-OSVM as if all the tasks were considered as
the same task. When the value of µ is very large, the
performance of MTL-OSVM is in accordance with
that of the traditional independant learning method T OSVM. With the increase of the value of µ, the behaviors of the first three tasks are similar. The false positive error rate of the MTL-OSVM method tends to decrease whereas its false negative error rate increases.
However, for the fourth task, the false positive (false
negative) error rate first increases (decreases) and then
decreases (increases) after it reaches the maximum
(minimum) value. This behavior may be due to the
very high noise that we added to the original dataset.
In all, with a good choice of µ, the multi-task framework achieves a better performance in terms of the
total error rate when compared to the traditional learning methods.
4.2
Experiment on textured image data
We have tested the proposed method on several textured gray-scale images that contain artificial textures
generated by using Markov chain models (Smolarz
1997). According to the nature of a texture, we first
suppose that the useful information for texture caracterization is included in an isotropic neighbourhood
of each pixel. In our experiments we use then the gray
levels of a local d = 5 × 5 squared window centered
to each pixel as its feature vector. Similar to the previous experiment in Section 4.1, four related tasks are
created. The dataset for Task 1 contains samples of
size d = 5 × 5 = 25 that are selected randomly from
the original single texture source image. The samples
for the other three tasks are selected from textured images of the same source as Task 1, but contaminated
respectively by low noise (Task 2), medium noise
(Task 3) and high noise (Task 4). Negative samples
used in the test set are generated by using a different
single texture source image. Figure 2 illustrates the
single texture source images used for generating the
datasets. In each trial, the training set of each task contains m = 200 positive samples and the test set is composed of 200 positive and 200 negative samples. The
common parameter values of one-class SVM used in
this experiment are (ν, σ) = (0.01, 300).
Table 1: Error rates (%) of the different methods for all tasks
on texture data. FP: false positive error rates, FN: false negative
error rates, Total: total error rates.
Task 1
T -OSVM
1-OSVM
MTL-OSVM
FP
3.62 ± 1.18
27.1 ± 2.6
14.4 ± 2.2
FN
27.0 ± 3.0
2.70 ± 1.59
9.52 ± 3.19
Task 2
Total
15.3 ± 1.4
14.9 ± 1.5
12.0 ± 1.8
µ
−
−
0.1
T -OSVM
1-OSVM
MTL-OSVM
FP
4.27 ± 1.11
27.1 ± 2.6
14.6 ± 2.2
FN
28.2 ± 3.6
3.27 ± 2.21
10.3 ± 3.7
Task 3
Total
16.3 ± 1.6
15.2 ± 1.8
12.4 ± 2.0
µ
−
−
0.1
T -OSVM
1-OSVM
MTL-OSVM
FP
7.05 ± 1.41
27.1 ± 2.6
15.7 ± 2.3
FN
28.6 ± 3.9
6.62 ± 3.73
14.4 ± 4.1
Task 4
Total
17.8 ± 2.0
16.8 ± 2.2
15.0 ± 2.2
µ
−
−
0.1
T -OSVM
1-OSVM
MTL-OSVM
FP
27.4 ± 3.0
27.1 ± 2.6
29.4 ± 2.6
FN
34.4 ± 8.7
34.0 ± 8.7
29.0 ± 9.2
Total
30.9 ± 4.5
30.5 ± 4.5
29.2 ± 4.5
µ
−
−
0.5
off between the false positive error rate and the false
negative error rate.
The average results of this experiment are depicted
in Figure 3. As in the previous experiment on the nonlinear toy data, we can observe the same behaviors
of the error rates variations along with the value of
the regularization parameter µ. The proposed method
MTL-OSVM outperforms the other two methods (T OSVM and 1-OSVM) for all the tasks. It is worth noting that the optimum value of µ is different for different task. The setting of this parameter is thus very
important in order to ensure a good performance. In
our experiment we use a validation set to find the optimum value of µ.
4.3
Figure 2: Single texture source images used for generating the
datasets. (a) Original texture image for Task 1. (b) Texture image of (a) with low noise, for Task 2. (c) Texture image of (a)
with medium noise, for Task 3. (d) Texture image of (a) with
high noise, for Task 4. (e) Original texture image for generating
negative samples.
Table 1 shows the statistics of the obtained error rates. The corresponding optimum value of µ for
MTL-OSVM, which minimises the total error rate, is
also presented. According to this table, we can see
that the individual learning method T -OSVM has the
lowest false positive but a higher false negative. On
the contrary, the learning of a single one-class SVM
for all tasks (1-OSVM) achieves the lowest false negative at the expense of a higher false positive. The proposed multi-task learning method MTL-OSVM can
reach an overall better performance by finding a trade-
Discussion
It is worth noting that in this section only academic
experiments on nonlinear toy data with low dimenstional feature space and textured image data with
high dimentional feature space are presented. We address the problem of modeling the normal data with
the help of one-class SVM method, which is usually
considered as an essential step for fault dectection and
diagnosis. In each experiment, related tasks are created in order to mimic the application of modeling
the normal process behavior of a fleet of plants that
are a priori identical but working under different conditions (thus having different noises on the measurements). Here we consider the modeling of the behavior of each plant as a single task and the classification of normal data and anomalies is then performed
by the constructed model. The proposed methodologie shows that learning multiple related tasks simultaneously can be beneficial to improve the performance
of each constructed model.
One straightforward work is to use the proposed
Figure 3: The variation of the average (a) false positive, (b) false negative and (c) total error rates for each task (texture data) along
with the value of the regularization parameter µ.
multi-task learning methodologie for fault detection
and diagnosis with real industrial data, such as the
modeling of the behavior of a fleet of reactor coolant
pumps in the nuclear cooling system.
5
CONCLUSION
In this paper we introduced the one-class SVM in
the framework of multi-task learning under the assumption that the model parameter values of related
tasks are close to a certain mean value. A regularization parameter was used in the optimisation process to
control the trade-off between the maximisation of the
margin for each one-class SVM model and the close-
ness of each one-class SVM model to the average
model. The design of new kernels in the multi-task
framework based on kernel properties significantly
facilites the implementation of our method. Experimental validation was made on artificially generated
related tasks of one-class classification. The results
show that learning multiple related tasks simultaneously can achieve a better performance than learning
each task independantly.
In our method we have used a common setting of
both one-class SVM parameter and kernel parameter
values. One future work is thus to use different parameter values for different tasks. The properties of
kernels open a wide range of further developpements
on constructing new kernels for multi-task learning.
6
ACKNOWLEDGEMENT
The authors would like to thank the financial support
from GIS 3SGS.
REFERENCES
Ben-David, S. & R. S. Borbely (2008). A notion of task relatedness yielding provable multiple-task learning guarantees.
Machine Learning 73, 273–287.
Ben-David, S. & R. Schuller (2003). Exploiting task relatedness
for mulitple task learning. In Proceedings of the Sixteenth
Annual Conference on Computational Learning Theory and
the Seventh Kernel Workshop, Washington, DC, USA, pp.
567–580.
Bi, J., T. Xiong, S. Yu, M. Dundar, & R. B. Rao (2008). An
improved multi-task learning approach with applications in
medical diagnosis. In Proceedings of the 2008 European
Conference on Machine Learning and Knowledge Discovery
in Databases - Part I, pp. 117–132.
Birlutiu, A., P. Groot, & T. Heskes (2010). Multi-task preference
learning with an application to hearing aid personalization.
Neurocomputing 73, 1177–1185.
Boser, B. E., I. M. Guyon, & V. N. Vapnik (1992). A training algorithm for optimal margin classifiers. In Proceedings of the
Fifth Annual Workshop on Computational Learning Theory,
Pittsburgh, Pennsylvania, United States, pp. 144–152.
Caruana, R. (1997). Multitask learning. Machine Learning 28,
41–75.
Eskin, E. (2000). Anomaly detection over noisy data using
learned probability distributions. In Proceedings of the Seventeenth International Conference on Machine Learning,
San Francisco, CA, USA, pp. 255–262.
Evgeniou, T., C. A. Micchelli, & M. Pontil (2005). Learning multiple tasks with kernel methods. Journal of Machine
Learning Research 6, 615–637.
Evgeniou, T. & M. Pontil (2004). Regularized multi-task learning. In Proceedings of the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, pp. 109–117.
Gu, Q. & J. Zhou (2009). Learning the shared subspace for
multi-task clustering and transductive transfer classification.
In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, Miami, Florida, USA, pp. 159–168.
Heskes, T. (2000). Empirical bayes for learning to learn. In Proceedings of the Seventeenth International Conference on Machine Learning, San Francisco, CA, USA, pp. 367–374.
Jebara, T. (2004). Multi-task feature and kernel selection for
svms. In Proceedings of the Twenty-first International Conference on Machine Learning, Banff, Alberta, Canada.
Kemp, C., N. Goodman, & J. Tenenbaum (2008). Learning and
using relational theories. In Advances in Neural Information
Processing Systems 20, pp. 753–760. Cambridge, MA.
Pan, S. J. & Q. Yang (2010). A survey on transfer learning. IEEE
Transactions on Knowledge and Data Engineering 22(10),
1345–1359.
Ritter, G. & M. T. Gallegos (1997). Outliers in statistical pattern recognition and an application to automatic chromosome
classification. Pattern Recognition Letters 18, 525–539.
Schölkopf, B., J. C. Platt, J. Shawe-Taylor, A. J. Smola, &
R. C. Williamson (2001). Estimating the support of a highdimensional distribution. Neural Computation 13(7), 1443–
1471.
Schölkopf, B. & A. J. Smola (2001). Learning with Kernels:
Support Vector Machines, Regularization, Optimization, and
Beyond. Cambridge, MA, USA: MIT Press.
View publication stats
Singh, S. & M. Markou (2004). An approach to novelty detection
applied to the classification of image regions. IEEE Transactions on Knowledge and Data Engineering 16(4), 396–407.
Smolarz, A. (1997). Etude qualitative du modèle auto-binomial
appliqué à la synthèse de texture. In XXIXèmes Journées de
Statistique, Carcassonne, France, pp. 712–715.
Tarassenko, L., P. Hayton, N. Cerneaz, & M. Brady (1995). Novelty detection for the identification of masses in mammograms. In Proceedings of the Fourth IEE International Conference on Artificial Neural Networks, Cambridge, UK, pp.
442–447.
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory.
Springer-Verlag New York.
Widmer, C., N. Toussaint, Y. Altun, & G. Ratsch (2010). Inferring latent task structure for multitask learning by multiple
kernel learning. BMC Bioinformatics 11(Suppl 8), S5.
Yang, H., I. King, & M. R. Lyu (2010). Multi-task learning
for one-class classification. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN),
Barcelona, Spain, pp. 1–8.
Zheng, V. W., S. J. Pan, Q. Yang, & J. J. Pan (2008). Transferring multi-device localization models using latent multi-task
learning. In Proceedings of the 23rd National conference on
Artificial intelligence, pp. 1427–1432.