Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Anomaly detection based on interval-valued fuzzy sets: Application to rare sound event detection Stefano Rovetta1 , Zied Mnasri1,2 , Francesco Masulli1 and Alberto Cabri1 1 2 DIBRIS, Università degli studi di Genova, Italy ENIT, University Tunis El Manar, Tunisia Abstract Audio signal processing is moving towards detecting and/or defining rare/anomalous sounds. The application of such an anomaly detection problem can be easily extended to audio surveillance systems. Thus, a rare sound event detection method for road traffic monitoring is proposed in this paper, including detection of hazardous events, i.e., road accidents. The method is based on combining anomaly detection techniques, such as variational autoencoders (VAE) and Interval-valued fuzzy sets. The VAE is used to calculate the reconstruction error of the input audio segment. Based on this reconstruction error, a fuzzy membership function, composed of an optimistic/upper component and a pessimistic/lower component, is calculated. Finally, a probabilistic method for interval comparison is used to calculate the membership score, hence to evaluate the interval-valued fuzzy sets. Finally, classification into anomalous/normal events is obtained by defuzzification. Results show that with a careful parameter setting, the proposed method outperforms the state-of-the-art one-class SVM for anomaly detection. Keywords Anomalous sound event detection, anomaly detection, variational autoencoder, fuzzy membership, intervalvalued fuzzy sets. 1. Introduction Anomaly/outlierness/novelty can be defined in different ways [1]: (a) by scarcity, as events occurring with low frequency; (b) by characteristics, as events differing from normal events; (c) by meaning, as events carrying a different meaning than normal events. In the specific application of road audio surveillance, Anomalous events are mainly car accidents and other events indicating potential hazards like tire skidding, harsh braking, etc., whereas the Normal class covers all other events that may happen on the road, e.g. sound of cars, pedestrians, horn blowing and any other non-hazardous event. This is a particular instance, focused only on anomalous sound categories, of the sound event detection (SED) problem. This problem can be formalized either as a classification task for all perceived events, or as detection of only anomalous/outlier/novel events. In either case, two major issues make this task difficult: First, background noise that fully or partly masks all events, making the resulting signals highly variable; secondly, the rareness of the łinterestingž events, such as car accidents, which makes them more difficult to model accurately for scarcity of data. WILF’21: The 13th international workshop on fuzzy logic and applications, Dec. 20–22, 2021, Vietri sul Mare, Italy " stefano.rovetta@unige.it (S. Rovetta); zied.mnasri@enit.utm.tn (Z. Mnasri); francesco.masulli@unige.it (F. Masulli); alberto.cabri@dibris.unige.it (A. Cabri) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor ks hop Pr oceedi ngs ht t p: / / ceur ws . or g I SSN16130073 CEUR Workshop Proceedings (CEUR-WS.org) This implies that not only classes are fuzzy, but the membership itself to any class is affected by a degree of uncertainty. In this case, interval-valued fuzzy sets [2] provide an alternative to crisp clustering or type-1 fuzzy sets, for which uncertainty would have to be precisely modelled, either by identification or, more typically, by arbitrary design. We state the problem as a classification task based on generative models where the final decision is taken by comparing the inferred interval-valued memberships to the different classes, using a classical metric of interval comparison, named degree of preference [3]. This process allows making the final Normal/Anomalous class decision without discarding the information about uncertainty expressed by the 2-component fuzzy membership. 2. Related work Sound event detection (SED) is a relatively young discipline, that has emerged since nearly a decade. Sound recognition methods in general proceed by segmenting signals into fixed-length, possibly overlapping frames of relatively short duration (fractions of a second). For anomalous SED, anomaly detection and supervised/unsupervised recognition methods are then applied on the obtained, fixed-size feature vectors. Several methods have been built around generative models, such as hidden Markov models using Gaussian mixture models. Examples of this approach are Ntalampiras et al. [4] and Heittola et al. [5]. Discriminative methods have also been employed, mainly based on support vector machines (SVM) and neural networks (NN). Examples are Foggia et al. [6] using one-class SVM models for each class. The present authors proposed an ensemble one-class SVM-NN model [7], where one-class SVM detects anomalous data and a NN classifies events. Unsupervised learning has often been preferred to cope with the issues described. Selfsupervised neural networks, such as autoencoders, are well suited to this task. We can mention Wei et al. [8] using a reconstruction autoencoder to compute the anomaly score through metric learning, and Purohit et al. [9] employing a deep autoencoder. Variational autoencoders (VAE) [10], learning a hidden generative representation of the data, are especially interesting. 3. Proposed method As mentioned, the method uses multiple generative models that learn individual classes, and compares interval-valued memberships by using the degree of preference. It proceeds as follows: • In the training phase, a dedicated VAE model is learnt on each subset containing only one type of events, i.e. Normal or Anomalous. • In the test phase, the RMSE error is calculated between the input, i.e. the feature vector representing the signal, and the reconstructed output of each VAE model. • For each input signal 𝑖, the output error 𝜖𝑖,𝑗 of each VAE (1 ≤ 𝑖 ≤ 𝑁 and 1 ≤ 𝑗 ≤ 𝐶, for 𝑁 samples and 𝐶 classes) is used to compute a fuzzy membership function, that provides a measure of closeness of the signal to the event class on which the VAE model had been trained. In our case, for each input sample 𝐶 = 2 interval membership functions are computed, corresponding to the Normal category and the Anomalous one. Figure 1: Variational autoencoder • The membership function associated to each event category, i.e. Normal/Anomalous, is composed of a low/pessimistic component and an upper/optimistic component, respectively. The values of both components form the interval-valued fuzzy membership function interval (cf. Figure 2). • Finally, interval comparison is applied using a probabilistic method [11], first to measure the degree of preference of each interval-valued membership function, and subsequently to detect the corresponding event category. 3.1. Variational autoencoder The variational autoencoder (VAE) is a reconstruction network learning a compressed representation of the input to reconstruct the output. The encoding layer stores the parameters of a probability distribution, e.g., mean and variance, representing the input in a latent space. Then, the decoder uses the probability distribution to generate an approximated reconstruction of the input data. Hence the encoder approximates the probability distribution of the identity function. Given a feature vector 𝑋, the VAE aims to find the probability of 𝑋 with respect to its representation 𝑍, ∫︁ 𝑃 (𝑋) = 𝑃 (𝑋|𝑍)𝑃 (𝑍)𝑑𝑍 . (1) The network has parameters of 𝑃 (𝑍) (average and variance) as its hidden parameters. Using variational inference on a maximum likelihood ojective, the encoder output is trained so that its probability approximates 𝑃 (𝑍|𝑋). The reconstruction RMSE can then be obtained as follows: √︂ ∑︀𝑚 ′ 2 𝑘=1 (𝑥𝑘 − 𝑥𝑘 ) 𝜖= , (2) 𝑚 where 𝑥𝑖 and 𝑥′𝑖 (𝑖 = 1, . . . , 𝑁 ) are the input and the output feature vectors for each autoencoder. To compensate for class imbalance, a priori class probabilities are used to compute thresholds. Membership functions 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Autoencoder error Figure 2: Example of the proposed reconstruction-error-based membership function. Continuous line (𝜇U ): optimistic membership. Dashed line (𝜇L ): pessimistic membership. Vertical line at 𝜖: interval values of membership corresponding to the reconstruction error 𝜖. In the present work, the VAE employs convolutional layers. The input features are extracted from the spectrogram, i.e. Mel-frequency cepstral coefficients (MFCC) and log-Energy, with their first and second derivatives (∆ and ∆-∆). The choice of these features is motivated by their proved performance in the state-of-the-art methods of sound event detection [4], in particular road traffic surveillance [6]. 3.2. Fuzzy membership function The membership of each input signal 𝑥𝑖 to each event 𝑗 is computed from on the corresponding VAE’s output error 𝜖𝑖,𝑗 , and its value is the interval between two membership components: a) Pessimistic/Lower membership 𝜇𝐿,𝑗 , minimum when the sample is an outlier w.r.t. class 𝑗, i.e. 𝜖𝑖,𝑗 > 𝜏𝑗 , and b) Optimistic/Upper membership 𝜇 𝑈,𝑗 , maximum when the sample is classified in class 𝑗, i.e. 𝜖𝑖,𝑗 < 𝜏𝑗 (cf. (3)). 𝜇𝐿,𝑗 (𝜖𝑖,𝑗 ) = {︃ 𝜖 1 − 𝜏i,jj 0 if 𝜖𝑖,𝑗 ≤ 𝜏𝑗 if 𝜖𝑖,𝑗 > 𝜏𝑗 𝜇𝑈,𝑗 (𝜖𝑖,𝑗 ) = ⎧ ⎨ 1 𝜖 2 − 𝜏i,jj ⎩ 0 if 𝜖𝑖,𝑗 ≤ 𝜏𝑗 if 𝜏𝑗 < 𝜖𝑖,𝑗 ≤ 2𝜏𝑗 if 𝜖𝑖,𝑗 > 2𝜏𝑗 (3) 3.3. Interval comparison For each class model 𝑗, the reconstruction error 𝜖𝑖,𝑗 is used to generate the interval membership 𝑀𝑖,𝑗 = [𝜇𝐿,𝑗 (𝜖𝑖,𝑗 ), 𝜇𝑈,𝑗 (𝜖𝑖,𝑗 )]. To make the final decision, intervals must be compared for each 𝑗 ∈ {1, 2}. Interval comparison is a particular case of fuzzy number comparison, broadly investigated since several years [12], using several methods, including probabilistic [13] and possiblistic [14] ones, among others. Interval comparison aims to rank real intervals. The heuristic approach developed in [11] has the advantage of not relying on midpoints for interval comparison. This makes sense particularly in the case of fuzzy numbers or confidence intervals. The degree of preference Π(𝐴 > 𝐵) of 𝐴 = [𝑎1 , 𝑎2 ] over 𝐵 = [𝑏1 , 𝑏2 ] is defined in [11] as: Π(𝐴 > 𝐵) = max(0, 𝑎2 − 𝑏1 ) − max(0, 𝑎1 − 𝑏2 ) . (𝑎2 − 𝑎1 ) + (𝑏2 − 𝑏1 ) We observe that 𝑃 (𝐴 > 𝐵) + 𝑃 (𝐵 > 𝐴) = 1. Moreover, {︂ if 𝐴 ≡ 𝐵 then Π(𝐴 > 𝐵) = Π(𝐵 > 𝐴) = 0.5, if 𝑎2 < 𝑏1 then Π(𝐵 > 𝐴) = 1. (4) (5) We employ this comparison to rank class memberships 𝑀𝑖,𝑗 , 𝑗 ∈ {1, 𝐶}. The defuzzification for the final decision simply consists in choosing the łleast preferredž (minimum-error) one: Event(𝑖) = arg min {Π(𝑀𝑖,𝑗 > 𝑀𝑖,𝑘̸=𝑗 )} , 𝑗=1,...,𝑁 (6) 4. Experiments and results 4.1. Audio database Different audio traffic datasets are suggested in the literature, such as AXA database [15], WASN [16] and MIVIA dataset [6]. The latter has the advantage to be the only open-access database for audio traffic surveillance. It contains nearly one hour of traffic sounds that were recorded in a real road environment at 23 locations in the province of Salerno, Italy, either in city center, highways or country roads. The database is segmented in 57 clips, of nearly one minute each, that were annotated manually. The annotation file includes the event labels, e.g. accident, tire skidding, horn blowing, etc., and the onset and offset times. Some audio events are considered as Anomalous, i.e. car crash, tire skidding and harsh braking, whereas all other events are considered as Normal, such as the sound of cars and pedestrians, and the background noise. 4.2. Parameter setting The main parameter adjustment concerns the setting of the thresholds 𝜏𝑗 . Different values were experimentally optimized. Thresholds were pondered using the complementary of the proportion of each class as a weighting coefficient. Thus, the threshold 𝜏𝑗 for each class 𝑗 = 1, . . . , 𝑁 of each VAE’s error was set as the baseline VAE’s threshold 𝜏0 pondered by the weight 𝑤𝑗 = 1 − 𝑝𝑗 , where 𝑝𝑗 is the proportion of samples of Class 𝑗. Table 1 summarizes the values. Table 1 Parameter setting for the VAE’s error and the fuzzy membership function (𝑝j is the proportion of Class 𝑗 samples in the training set) Part Parameter Value All Event weight 𝑤j 1 − 𝑝j Baseline VAE Error threshold 𝜏0 𝜏0 ∈]0, 1[ Event-based VAE’s Error threshold 𝜏j 𝜏0 × 𝑤 j Table 2 Results of anomalous SED using VAE’s and fuzzy membership function for Normal vs. Anomalous event classification (𝑝norm = 0.79 and 𝑝anom = 0.21 are the proportions of Normal and Anomalous samples in the training set); For OC-SVM, the parameters 𝜈 = and 𝛾 are set to 0.14 and 2.5e-5, respectively, for their high performance. Method 𝑤norm 𝑤anom One-Class SVM VAE with fuzzy membership 0.5 0.6 0.7 0.8 0.9 0.5 0.4 0.3 0.2 0.1 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑃1 𝑃2 𝑅1 𝑅2 𝐹 11 𝐹 12 0.84 0.94 0.59 0.86 0.77 0.90 0.67 0.83 0.95 0.93 0.93 0.93 0.95 0.94 0.92 0.92 0.93 0.38 1.00 1.00 1.00 1.00 0.86 1.00 1.00 1.00 1.00 0.65 0.57 0.50 0.40 0.40 0.90 0.97 0.96 0.96 0.96 0.48 0.72 0.67 0.57 0.57 4.3. Experimental protocol The experimental work aims to detect audio events on roads. To do so, features were extracted from the selected audio database, MIVIA DB [6], then experiments were realized following the steps described in Section 3. Regarding the first step, i.e. feature extraction, data augmentation was realized to cope with the issue of rareness of Anomalous samples, so that more data is obtained through the segmentation of the audio signals into short frames, with a duration of 250 ms, with a high overlap rate, i.e. 75%.Nevertheless, it is worth noting that all training segments, whether belonging to Normal or Anomalous, contain background street noise. Regarding neural networks training, the VAE network was constructed using convolutional layers, using an input feature vector made of log-energy and MFCC features, along with their first and second derivatives (∆ and ∆-∆). 80% of the extracted data were utilized for training and validation, whereas test was realized on the remaining 20%. 4.4. Analysis of results The evaluation results are listed in Table 2. These results correspond correspond to a state-ofthe-art method, i.e. OC-SVM (used for benchmarking), and to the proposed method (event-based VAE with fuzzy membership). For the latter, the values of the event weights were {𝑤𝑗 }𝑗=1,...,𝑁 were varied to find the tradeoff between data distribution and the global performance. For evaluation purposes, standard metrics were calculated, i.e. overall accuracy (𝐴𝑐𝑐), precision (𝑃 ), recall (𝑅) and 𝐹 1 scores, defined as in (7): 𝑃𝑗 = 𝑐𝑗 2𝑃𝑗 𝑅𝑗 𝑐𝑗 , 𝑅𝑗 = , 𝐹 1𝑗 = , 𝑒𝑗 𝑟𝑗 𝑃𝑗 + 𝑅 𝑗 (7) where 𝑟𝑗 , 𝑒𝑗 and 𝑐𝑗 (𝑗 ∈ {1, 2}) are the number of ground-truth, estimated and correctly detected events for Normal and Anomalous class, respectively. The results mentioned in Table 2 show the efficiency of using an interval-valued fuzzy membership function to improve anomaly detection. The main advantages of using such a method can be summarized as follows: • The proposed methods outperforms the state-of-the-art OC-SVM, in terms of overall accuracy and balance between class-based metrics. • Overall accuracy rates are enhanced, reaching 95% for the proposed method, vs. 84% for OC-SVM. Also, the precision, recall and F1 score obtained are more balanced between Normal and Anomalous classes, notwithstanding their disproportional distribution. • The effect of using unbalanced weights is more evidenced, with higher accuracy when 𝑤𝑗 is higher for the Anomalous class. 5. Discussion and conclusion This paper presented a novel method of anomaly detection, based on interval-valued fuzzy sets. A direct application in road traffic surveillance allows detecting hazardous events such as car accidents using audio signals. The proposed method is based on combining two anomaly detection tools, i.e. auto-regressive VAE’s and interval-valued fuzzy sets. Finally, a probabilistic interval comparison method, denoted as degree of preference, is utilized for defuzzification, i.e. detecting the corresponding class. The main results can be summarized as follows: a) Spectrogram-extracted features are the most suitable to approach such a problem; b) unbalanced weights, where the least abundant class receives the highest weight, contribute to enhance the results; and c) interval-valued fuzzy sets seem more efficient than crisp one-class SVM to detect anomaly. As an outlook, the proposed method could be further improved in two directions: either by making it semi-supervised, as only normal data can be collected and trained, or fully unsupervised, by not using labels any more. Acknowledgments This work was carried out in the framework of the project Xpert funded by the University of Genova. References [1] A. A. Sodemann, M. P. Ross, B. J. Borghetti, A review of anomaly detection in automated surveillance, IEEE Transactions on Systems, Man, and Cybernetics, Part C 42 (2012) 1257ś1272. [2] J. M. Mendel, R. I. B. John, Type-2 fuzzy sets made simple, IEEE Transactions on Fuzzy Systems 10 (2002) 117ś127. doi:10.1109/91.995115. [3] P. Sevastianov, Numerical methods for interval and fuzzy number comparison based on the probabilistic approach and dempsterśshafer theory, Information Sciences 177 (2007) 4645ś4661. [4] S. Ntalampiras, I. Potamitis, N. Fakotakis, Probabilistic novelty detection for acoustic surveillance under real-world conditions, IEEE Transactions on Multimedia 13 (2011) 713ś719. [5] T. Heittola, A. Mesaros, A. Eronen, T. Virtanen, Context-dependent sound event detection, EURASIP Journal on Audio, Speech, and Music Processing 2013 (2013) 1ś13. [6] P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, M. Vento, Audio surveillance of roads: A system for detecting anomalous sounds, IEEE transactions on intelligent transportation systems 17 (2015) 279ś288. [7] S. Rovetta, Z. Mnasri, F. Masulli, Detection of hazardous road events from audio streams: An ensemble outlier detection approach, in: 2020 IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS), IEEE, 2020, pp. 1ś6. [8] Q. WEI, Y. LIU, Auto-encoder and metric-learning for anomalous sound detection task (2020). URL: http://dcase.community/challenge2020/index, preprint: http://dcase. community/documents/challenge2020/technical_reports/DCASE2020_Wei_49_t2.pdf. [9] H. Purohit, R. Tanabe, T. Endo, K. Suefusa, Y. Nikaido, Y. Kawaguchi, Deep autoencoding gmm-based unsupervised anomaly detection in acoustic signals and its hyper-parameter optimization, in: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), Tokyo, Japan, 2020. [10] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint 1312.6114 (2013). [11] Y.-M. Wang, J.-B. Yang, D.-L. Xu, A preference aggregation method through the estimation of utility intervals, Computers & Operations Research 32 (2005) 2027ś2049. [12] E. Lee, R.-J. Li, Comparison of fuzzy numbers based on the probability measure of fuzzy events, Computers & Mathematics with Applications 15 (1988) 887ś896. [13] V.-N. Huynh, Y. Nakamori, J. Lawry, A probability-based approach to comparison of fuzzy numbers and applications to target-oriented decision making, IEEE Transactions on Fuzzy Systems 16 (2008) 371ś387. [14] A. Kasperski, A possibilistic approach to sequencing problems with fuzzy parameters, Fuzzy Sets and Systems 150 (2005) 77ś86. [15] M. Sammarco, M. Detyniecki, Crashzam: Sound-based car crash detection., in: Proceedings of Vehicle Technology and Intelligent Transport Systems (VEHITS), 2018, pp. 27ś35. [16] R. M. Alsina-Pagès, F. Orga, F. Alías, J. C. Socoró, A wasn-based suburban dataset for anomalous noise event detection on dynamic road-traffic noise mapping, Sensors 19 (2019) 2480.