Anomaly detection based on interval-valued fuzzy
sets: Application to rare sound event detection
Stefano Rovetta1 , Zied Mnasri1,2 , Francesco Masulli1 and Alberto Cabri1
1
2
DIBRIS, Università degli studi di Genova, Italy
ENIT, University Tunis El Manar, Tunisia
Abstract
Audio signal processing is moving towards detecting and/or defining rare/anomalous sounds. The application of such an anomaly detection problem can be easily extended to audio surveillance systems.
Thus, a rare sound event detection method for road traffic monitoring is proposed in this paper, including detection of hazardous events, i.e., road accidents. The method is based on combining anomaly
detection techniques, such as variational autoencoders (VAE) and Interval-valued fuzzy sets. The VAE
is used to calculate the reconstruction error of the input audio segment. Based on this reconstruction error, a fuzzy membership function, composed of an optimistic/upper component and a pessimistic/lower
component, is calculated. Finally, a probabilistic method for interval comparison is used to calculate the
membership score, hence to evaluate the interval-valued fuzzy sets. Finally, classification into anomalous/normal events is obtained by defuzzification. Results show that with a careful parameter setting,
the proposed method outperforms the state-of-the-art one-class SVM for anomaly detection.
Keywords
Anomalous sound event detection, anomaly detection, variational autoencoder, fuzzy membership, intervalvalued fuzzy sets.
1. Introduction
Anomaly/outlierness/novelty can be defined in different ways [1]: (a) by scarcity, as events
occurring with low frequency; (b) by characteristics, as events differing from normal events;
(c) by meaning, as events carrying a different meaning than normal events. In the specific
application of road audio surveillance, Anomalous events are mainly car accidents and other
events indicating potential hazards like tire skidding, harsh braking, etc., whereas the Normal
class covers all other events that may happen on the road, e.g. sound of cars, pedestrians, horn
blowing and any other non-hazardous event. This is a particular instance, focused only on
anomalous sound categories, of the sound event detection (SED) problem.
This problem can be formalized either as a classification task for all perceived events, or as
detection of only anomalous/outlier/novel events. In either case, two major issues make this
task difficult: First, background noise that fully or partly masks all events, making the resulting
signals highly variable; secondly, the rareness of the łinterestingž events, such as car accidents,
which makes them more difficult to model accurately for scarcity of data.
WILF’21: The 13th international workshop on fuzzy logic and applications, Dec. 20–22, 2021, Vietri sul Mare, Italy
" stefano.rovetta@unige.it (S. Rovetta); zied.mnasri@enit.utm.tn (Z. Mnasri); francesco.masulli@unige.it
(F. Masulli); alberto.cabri@dibris.unige.it (A. Cabri)
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Wor
ks
hop
Pr
oceedi
ngs
ht
t
p:
/
/
ceur
ws
.
or
g
I
SSN16130073
CEUR Workshop Proceedings (CEUR-WS.org)
This implies that not only classes are fuzzy, but the membership itself to any class is affected
by a degree of uncertainty. In this case, interval-valued fuzzy sets [2] provide an alternative to
crisp clustering or type-1 fuzzy sets, for which uncertainty would have to be precisely modelled,
either by identification or, more typically, by arbitrary design.
We state the problem as a classification task based on generative models where the final
decision is taken by comparing the inferred interval-valued memberships to the different classes,
using a classical metric of interval comparison, named degree of preference [3]. This process
allows making the final Normal/Anomalous class decision without discarding the information
about uncertainty expressed by the 2-component fuzzy membership.
2. Related work
Sound event detection (SED) is a relatively young discipline, that has emerged since nearly a
decade. Sound recognition methods in general proceed by segmenting signals into fixed-length,
possibly overlapping frames of relatively short duration (fractions of a second). For anomalous
SED, anomaly detection and supervised/unsupervised recognition methods are then applied on
the obtained, fixed-size feature vectors.
Several methods have been built around generative models, such as hidden Markov models
using Gaussian mixture models. Examples of this approach are Ntalampiras et al. [4] and
Heittola et al. [5]. Discriminative methods have also been employed, mainly based on support
vector machines (SVM) and neural networks (NN). Examples are Foggia et al. [6] using one-class
SVM models for each class. The present authors proposed an ensemble one-class SVM-NN
model [7], where one-class SVM detects anomalous data and a NN classifies events.
Unsupervised learning has often been preferred to cope with the issues described. Selfsupervised neural networks, such as autoencoders, are well suited to this task. We can mention
Wei et al. [8] using a reconstruction autoencoder to compute the anomaly score through metric
learning, and Purohit et al. [9] employing a deep autoencoder. Variational autoencoders (VAE)
[10], learning a hidden generative representation of the data, are especially interesting.
3. Proposed method
As mentioned, the method uses multiple generative models that learn individual classes, and
compares interval-valued memberships by using the degree of preference. It proceeds as follows:
• In the training phase, a dedicated VAE model is learnt on each subset containing only
one type of events, i.e. Normal or Anomalous.
• In the test phase, the RMSE error is calculated between the input, i.e. the feature vector
representing the signal, and the reconstructed output of each VAE model.
• For each input signal 𝑖, the output error 𝜖𝑖,𝑗 of each VAE (1 ≤ 𝑖 ≤ 𝑁 and 1 ≤ 𝑗 ≤ 𝐶, for
𝑁 samples and 𝐶 classes) is used to compute a fuzzy membership function, that provides
a measure of closeness of the signal to the event class on which the VAE model had been
trained. In our case, for each input sample 𝐶 = 2 interval membership functions are
computed, corresponding to the Normal category and the Anomalous one.
Figure 1: Variational autoencoder
• The membership function associated to each event category, i.e. Normal/Anomalous,
is composed of a low/pessimistic component and an upper/optimistic component, respectively. The values of both components form the interval-valued fuzzy membership
function interval (cf. Figure 2).
• Finally, interval comparison is applied using a probabilistic method [11], first to measure
the degree of preference of each interval-valued membership function, and subsequently
to detect the corresponding event category.
3.1. Variational autoencoder
The variational autoencoder (VAE) is a reconstruction network learning a compressed representation of the input to reconstruct the output. The encoding layer stores the parameters of a
probability distribution, e.g., mean and variance, representing the input in a latent space. Then,
the decoder uses the probability distribution to generate an approximated reconstruction of
the input data. Hence the encoder approximates the probability distribution of the identity
function. Given a feature vector 𝑋, the VAE aims to find the probability of 𝑋 with respect to
its representation 𝑍,
∫︁
𝑃 (𝑋) =
𝑃 (𝑋|𝑍)𝑃 (𝑍)𝑑𝑍 .
(1)
The network has parameters of 𝑃 (𝑍) (average and variance) as its hidden parameters. Using
variational inference on a maximum likelihood ojective, the encoder output is trained so that its
probability approximates 𝑃 (𝑍|𝑋). The reconstruction RMSE can then be obtained as follows:
√︂ ∑︀𝑚
′ 2
𝑘=1 (𝑥𝑘 − 𝑥𝑘 )
𝜖=
,
(2)
𝑚
where 𝑥𝑖 and 𝑥′𝑖 (𝑖 = 1, . . . , 𝑁 ) are the input and the output feature vectors for each autoencoder.
To compensate for class imbalance, a priori class probabilities are used to compute thresholds.
Membership functions
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
Autoencoder error
Figure 2: Example of the proposed reconstruction-error-based membership function. Continuous line
(𝜇U ): optimistic membership. Dashed line (𝜇L ): pessimistic membership. Vertical line at 𝜖: interval
values of membership corresponding to the reconstruction error 𝜖.
In the present work, the VAE employs convolutional layers. The input features are extracted
from the spectrogram, i.e. Mel-frequency cepstral coefficients (MFCC) and log-Energy, with
their first and second derivatives (∆ and ∆-∆). The choice of these features is motivated by their
proved performance in the state-of-the-art methods of sound event detection [4], in particular
road traffic surveillance [6].
3.2. Fuzzy membership function
The membership of each input signal 𝑥𝑖 to each event 𝑗 is computed from on the corresponding
VAE’s output error 𝜖𝑖,𝑗 , and its value is the interval between two membership components: a)
Pessimistic/Lower membership 𝜇𝐿,𝑗 , minimum when the sample is an outlier w.r.t. class 𝑗, i.e.
𝜖𝑖,𝑗 > 𝜏𝑗 , and b) Optimistic/Upper membership 𝜇 𝑈,𝑗 , maximum when the sample is classified
in class 𝑗, i.e. 𝜖𝑖,𝑗 < 𝜏𝑗 (cf. (3)).
𝜇𝐿,𝑗 (𝜖𝑖,𝑗 ) =
{︃
𝜖
1 − 𝜏i,jj
0
if 𝜖𝑖,𝑗 ≤ 𝜏𝑗
if 𝜖𝑖,𝑗 > 𝜏𝑗
𝜇𝑈,𝑗 (𝜖𝑖,𝑗 ) =
⎧
⎨
1
𝜖
2 − 𝜏i,jj
⎩
0
if 𝜖𝑖,𝑗 ≤ 𝜏𝑗
if 𝜏𝑗 < 𝜖𝑖,𝑗 ≤ 2𝜏𝑗
if 𝜖𝑖,𝑗 > 2𝜏𝑗
(3)
3.3. Interval comparison
For each class model 𝑗, the reconstruction error 𝜖𝑖,𝑗 is used to generate the interval membership
𝑀𝑖,𝑗 = [𝜇𝐿,𝑗 (𝜖𝑖,𝑗 ), 𝜇𝑈,𝑗 (𝜖𝑖,𝑗 )]. To make the final decision, intervals must be compared for each
𝑗 ∈ {1, 2}. Interval comparison is a particular case of fuzzy number comparison, broadly
investigated since several years [12], using several methods, including probabilistic [13] and
possiblistic [14] ones, among others.
Interval comparison aims to rank real intervals. The heuristic approach developed in [11]
has the advantage of not relying on midpoints for interval comparison. This makes sense
particularly in the case of fuzzy numbers or confidence intervals.
The degree of preference Π(𝐴 > 𝐵) of 𝐴 = [𝑎1 , 𝑎2 ] over 𝐵 = [𝑏1 , 𝑏2 ] is defined in [11] as:
Π(𝐴 > 𝐵) =
max(0, 𝑎2 − 𝑏1 ) − max(0, 𝑎1 − 𝑏2 )
.
(𝑎2 − 𝑎1 ) + (𝑏2 − 𝑏1 )
We observe that 𝑃 (𝐴 > 𝐵) + 𝑃 (𝐵 > 𝐴) = 1. Moreover,
{︂
if 𝐴 ≡ 𝐵 then Π(𝐴 > 𝐵) = Π(𝐵 > 𝐴) = 0.5,
if 𝑎2 < 𝑏1 then
Π(𝐵 > 𝐴) = 1.
(4)
(5)
We employ this comparison to rank class memberships 𝑀𝑖,𝑗 , 𝑗 ∈ {1, 𝐶}. The defuzzification
for the final decision simply consists in choosing the łleast preferredž (minimum-error) one:
Event(𝑖) = arg min {Π(𝑀𝑖,𝑗 > 𝑀𝑖,𝑘̸=𝑗 )} ,
𝑗=1,...,𝑁
(6)
4. Experiments and results
4.1. Audio database
Different audio traffic datasets are suggested in the literature, such as AXA database [15], WASN
[16] and MIVIA dataset [6]. The latter has the advantage to be the only open-access database
for audio traffic surveillance. It contains nearly one hour of traffic sounds that were recorded in
a real road environment at 23 locations in the province of Salerno, Italy, either in city center,
highways or country roads. The database is segmented in 57 clips, of nearly one minute each,
that were annotated manually. The annotation file includes the event labels, e.g. accident, tire
skidding, horn blowing, etc., and the onset and offset times. Some audio events are considered
as Anomalous, i.e. car crash, tire skidding and harsh braking, whereas all other events are
considered as Normal, such as the sound of cars and pedestrians, and the background noise.
4.2. Parameter setting
The main parameter adjustment concerns the setting of the thresholds 𝜏𝑗 . Different values
were experimentally optimized. Thresholds were pondered using the complementary of the
proportion of each class as a weighting coefficient. Thus, the threshold 𝜏𝑗 for each class
𝑗 = 1, . . . , 𝑁 of each VAE’s error was set as the baseline VAE’s threshold 𝜏0 pondered by the
weight 𝑤𝑗 = 1 − 𝑝𝑗 , where 𝑝𝑗 is the proportion of samples of Class 𝑗. Table 1 summarizes the
values.
Table 1
Parameter setting for the VAE’s error and the fuzzy membership function (𝑝j is the proportion of Class
𝑗 samples in the training set)
Part
Parameter
Value
All
Event weight 𝑤j
1 − 𝑝j
Baseline
VAE
Error threshold 𝜏0
𝜏0 ∈]0, 1[
Event-based
VAE’s
Error threshold 𝜏j
𝜏0 × 𝑤 j
Table 2
Results of anomalous SED using VAE’s and fuzzy membership function for Normal vs. Anomalous event
classification (𝑝norm = 0.79 and 𝑝anom = 0.21 are the proportions of Normal and Anomalous samples
in the training set); For OC-SVM, the parameters 𝜈 = and 𝛾 are set to 0.14 and 2.5e-5, respectively, for
their high performance.
Method
𝑤norm
𝑤anom
One-Class SVM
VAE with
fuzzy membership
0.5
0.6
0.7
0.8
0.9
0.5
0.4
0.3
0.2
0.1
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦
𝑃1
𝑃2
𝑅1
𝑅2
𝐹 11
𝐹 12
0.84
0.94
0.59
0.86
0.77
0.90
0.67
0.83
0.95
0.93
0.93
0.93
0.95
0.94
0.92
0.92
0.93
0.38
1.00
1.00
1.00
1.00
0.86
1.00
1.00
1.00
1.00
0.65
0.57
0.50
0.40
0.40
0.90
0.97
0.96
0.96
0.96
0.48
0.72
0.67
0.57
0.57
4.3. Experimental protocol
The experimental work aims to detect audio events on roads. To do so, features were extracted
from the selected audio database, MIVIA DB [6], then experiments were realized following the
steps described in Section 3.
Regarding the first step, i.e. feature extraction, data augmentation was realized to cope
with the issue of rareness of Anomalous samples, so that more data is obtained through the
segmentation of the audio signals into short frames, with a duration of 250 ms, with a high
overlap rate, i.e. 75%.Nevertheless, it is worth noting that all training segments, whether
belonging to Normal or Anomalous, contain background street noise.
Regarding neural networks training, the VAE network was constructed using convolutional
layers, using an input feature vector made of log-energy and MFCC features, along with their
first and second derivatives (∆ and ∆-∆). 80% of the extracted data were utilized for training
and validation, whereas test was realized on the remaining 20%.
4.4. Analysis of results
The evaluation results are listed in Table 2. These results correspond correspond to a state-ofthe-art method, i.e. OC-SVM (used for benchmarking), and to the proposed method (event-based
VAE with fuzzy membership). For the latter, the values of the event weights were {𝑤𝑗 }𝑗=1,...,𝑁
were varied to find the tradeoff between data distribution and the global performance. For
evaluation purposes, standard metrics were calculated, i.e. overall accuracy (𝐴𝑐𝑐), precision
(𝑃 ), recall (𝑅) and 𝐹 1 scores, defined as in (7):
𝑃𝑗 =
𝑐𝑗
2𝑃𝑗 𝑅𝑗
𝑐𝑗
, 𝑅𝑗 = , 𝐹 1𝑗 =
,
𝑒𝑗
𝑟𝑗
𝑃𝑗 + 𝑅 𝑗
(7)
where 𝑟𝑗 , 𝑒𝑗 and 𝑐𝑗 (𝑗 ∈ {1, 2}) are the number of ground-truth, estimated and correctly
detected events for Normal and Anomalous class, respectively.
The results mentioned in Table 2 show the efficiency of using an interval-valued fuzzy
membership function to improve anomaly detection. The main advantages of using such a
method can be summarized as follows:
• The proposed methods outperforms the state-of-the-art OC-SVM, in terms of overall
accuracy and balance between class-based metrics.
• Overall accuracy rates are enhanced, reaching 95% for the proposed method, vs. 84% for
OC-SVM. Also, the precision, recall and F1 score obtained are more balanced between
Normal and Anomalous classes, notwithstanding their disproportional distribution.
• The effect of using unbalanced weights is more evidenced, with higher accuracy when
𝑤𝑗 is higher for the Anomalous class.
5. Discussion and conclusion
This paper presented a novel method of anomaly detection, based on interval-valued fuzzy
sets. A direct application in road traffic surveillance allows detecting hazardous events such as
car accidents using audio signals. The proposed method is based on combining two anomaly
detection tools, i.e. auto-regressive VAE’s and interval-valued fuzzy sets. Finally, a probabilistic
interval comparison method, denoted as degree of preference, is utilized for defuzzification, i.e.
detecting the corresponding class.
The main results can be summarized as follows: a) Spectrogram-extracted features are the
most suitable to approach such a problem; b) unbalanced weights, where the least abundant class
receives the highest weight, contribute to enhance the results; and c) interval-valued fuzzy sets
seem more efficient than crisp one-class SVM to detect anomaly. As an outlook, the proposed
method could be further improved in two directions: either by making it semi-supervised, as
only normal data can be collected and trained, or fully unsupervised, by not using labels any
more.
Acknowledgments
This work was carried out in the framework of the project Xpert funded by the University of
Genova.
References
[1] A. A. Sodemann, M. P. Ross, B. J. Borghetti, A review of anomaly detection in automated
surveillance, IEEE Transactions on Systems, Man, and Cybernetics, Part C 42 (2012)
1257ś1272.
[2] J. M. Mendel, R. I. B. John, Type-2 fuzzy sets made simple, IEEE Transactions on Fuzzy
Systems 10 (2002) 117ś127. doi:10.1109/91.995115.
[3] P. Sevastianov, Numerical methods for interval and fuzzy number comparison based on
the probabilistic approach and dempsterśshafer theory, Information Sciences 177 (2007)
4645ś4661.
[4] S. Ntalampiras, I. Potamitis, N. Fakotakis, Probabilistic novelty detection for acoustic
surveillance under real-world conditions, IEEE Transactions on Multimedia 13 (2011)
713ś719.
[5] T. Heittola, A. Mesaros, A. Eronen, T. Virtanen, Context-dependent sound event detection,
EURASIP Journal on Audio, Speech, and Music Processing 2013 (2013) 1ś13.
[6] P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, M. Vento, Audio surveillance of roads: A
system for detecting anomalous sounds, IEEE transactions on intelligent transportation
systems 17 (2015) 279ś288.
[7] S. Rovetta, Z. Mnasri, F. Masulli, Detection of hazardous road events from audio streams:
An ensemble outlier detection approach, in: 2020 IEEE Conference on Evolving and
Adaptive Intelligent Systems (EAIS), IEEE, 2020, pp. 1ś6.
[8] Q. WEI, Y. LIU, Auto-encoder and metric-learning for anomalous sound detection
task (2020). URL: http://dcase.community/challenge2020/index, preprint: http://dcase.
community/documents/challenge2020/technical_reports/DCASE2020_Wei_49_t2.pdf.
[9] H. Purohit, R. Tanabe, T. Endo, K. Suefusa, Y. Nikaido, Y. Kawaguchi, Deep autoencoding
gmm-based unsupervised anomaly detection in acoustic signals and its hyper-parameter
optimization, in: Proceedings of the Detection and Classification of Acoustic Scenes and
Events 2020 Workshop (DCASE2020), Tokyo, Japan, 2020.
[10] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint 1312.6114
(2013).
[11] Y.-M. Wang, J.-B. Yang, D.-L. Xu, A preference aggregation method through the estimation
of utility intervals, Computers & Operations Research 32 (2005) 2027ś2049.
[12] E. Lee, R.-J. Li, Comparison of fuzzy numbers based on the probability measure of fuzzy
events, Computers & Mathematics with Applications 15 (1988) 887ś896.
[13] V.-N. Huynh, Y. Nakamori, J. Lawry, A probability-based approach to comparison of fuzzy
numbers and applications to target-oriented decision making, IEEE Transactions on Fuzzy
Systems 16 (2008) 371ś387.
[14] A. Kasperski, A possibilistic approach to sequencing problems with fuzzy parameters,
Fuzzy Sets and Systems 150 (2005) 77ś86.
[15] M. Sammarco, M. Detyniecki, Crashzam: Sound-based car crash detection., in: Proceedings
of Vehicle Technology and Intelligent Transport Systems (VEHITS), 2018, pp. 27ś35.
[16] R. M. Alsina-Pagès, F. Orga, F. Alías, J. C. Socoró, A wasn-based suburban dataset for
anomalous noise event detection on dynamic road-traffic noise mapping, Sensors 19 (2019)
2480.