Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
With the development of multi-modal man-machine interaction, audio signal analysis is gaining importance in a field traditionally dominated by video. In particular, anomalous sound event detection offers novel options to improve... more
With the development of multi-modal man-machine interaction, audio signal analysis is gaining importance in a field traditionally dominated by video. In particular, anomalous sound event detection offers novel options to improve audio-based man-machine interaction, in many useful applications such as surveillance systems, industrial fault detection and especially safety monitoring, either indoor or outdoor. Event detection from audio can fruitfully integrate visual information and can outperform it in some respects, thus representing a complementary perceptual modality. However, it also presents specific issues and challenges. In this paper, a comprehensive survey of anomalous sound event detection is presented, covering various aspects of the topic, ı.e.feature extraction methods, datasets, evaluation metrics, methods, applications, and some open challenges and improvement ideas that have been recently raised in the literature.
Audio signal processing is moving towards detecting and/or defining rare/anomalous sounds. The application of such an anomaly detection problem can be easily extended to audio surveillance systems. Thus, a rare sound event detection... more
Audio signal processing is moving towards detecting and/or defining rare/anomalous sounds. The application of such an anomaly detection problem can be easily extended to audio surveillance systems. Thus, a rare sound event detection method for road traffic monitoring is proposed in this paper, including detection of hazardous events, i.e., road accidents. The method is based on combining anomaly detection techniques, such as variational autoencoders (VAE) and Interval-valued fuzzy sets. The VAE is used to calculate the reconstruction error of the input audio segment. Based on this reconstruction error, a fuzzy membership function, composed of an optimistic/upper component and a pessimistic/lower component, is calculated. Finally, a probabilistic method for interval comparison is used to calculate the membership score, hence to evaluate the interval-valued fuzzy sets. Finally, classification into anomalous/normal events is obtained by defuzzification. Results show that with a careful parameter setting, the proposed method outperforms the state-of-the-art one-class SVM for anomaly detection.
In this paper, a novel relationship between instantaneous frequency (IF) and fundamental frequency (F0) in voiced parts of speech signals is presented. IF is calculated as the time-derivative of the phase of the analytic signal, yielding... more
In this paper, a novel relationship between instantaneous frequency (IF) and fundamental frequency (F0) in voiced parts of speech signals is presented. IF is calculated as the time-derivative of the phase of the analytic signal, yielding from Hilbert transform. Whereas F0 can be extracted using any classical pitch tracking technique (e.g. autocorrelation, cepstrum, subharmonic-to-harmonic ratio (SHR) independently of the tool used to extract F0. This relationship states that the envelope of the residual of the instantaneous frequency, defined as the difference between IF and the maximum of harmonics tends to F0. Such a direct relationship may be useful for further developments of F0 extraction directly from the speech signal, avoiding the approximation that exists in most pitch extraction techniques.
Speech synthesis quality depends on its naturalness and intelligibility. These abstract concepts are the concern of phonology. In terms of phonetics, they are transmitted by prosodic components, mainly the fundamental frequency (F0)... more
Speech synthesis quality depends on its naturalness and intelligibility. These abstract concepts are the concern of phonology. In terms of phonetics, they are transmitted by prosodic components, mainly the fundamental frequency (F0) contour. F0 contour modeling is performed either by setting rules or by investigating databases, with or without parameters and following a timely sequential path or a parallel and super-positional scheme. In this study, we opted to model the F0 contour for Arabic using the Fujisaki parameters to be trained by neural networks. Statistical evaluation was carried out to measure the predicted parameters accuracy and the synthesized F0 contour closeness to the natural one. Findings concerning the adoption of Fujisaki parameters to Arabic F0 contour modeling for text-to-speech synthesis were discussed.
Sound duration is responsible for rhythm and speech rate. Furthermore, in some languages phoneme length is an important phonetic and prosodic factor. For example, in Arabic, gemination and vowel quantity are two important characteristics... more
Sound duration is responsible for rhythm and speech rate. Furthermore, in some languages phoneme length is an important phonetic and prosodic factor. For example, in Arabic, gemination and vowel quantity are two important characteristics of the language. Therefore, accurate duration modelling is crucial for Arabic TTS systems. This paper is interested in improving the modelling of phone duration for Arabic statistical parametric speech synthesis using DNN-based models. In fact, since a few years, DNN have been frequently used for parametric speech synthesis, instead of HMM. Therefore, several variants of DNN-based duration models for Arabic are investigated. The novelty consists in training a specific DNN model for each class of sounds, i.e. short vowels, long vowels, simple consonants and geminated consonants. The main idea behind this choice is the improvement that we already achieved in the quality of Arabic parametric speech synthesis by the introduction of two specific features...
This paper investigates statistical parametric speech synthesis of Modern Standard Arabic (MSA). Hidden Markov Models (HMM)-based speech synthesis system relies on a description of speech segments corresponding to phonemes, with a large... more
This paper investigates statistical parametric speech synthesis of Modern Standard Arabic (MSA). Hidden Markov Models (HMM)-based speech synthesis system relies on a description of speech segments corresponding to phonemes, with a large set of features that represent phonetic, phonologic, linguistic and contextual aspects. When applied to MSA two specific phenomena have to be taken in account, the vowel lengthening and the consonant gemination. This paper studies thoroughly the modeling of these phenomena through various approaches: as for example, the use of different units for modeling short vs. long vowels and the use of different units for modeling simple vs. geminated consonants. These approaches are compared to another one which merges short and long variants of a vowel into a single unit and, simple and geminated variants of a consonant into a single unit (these characteristics being handled through the features associated to the sound). Results of subjective evaluation show ...
Spectrogram inversion or phase retrieval is an old topic in digital signal processing, that has been revisited since a few years for its proved relevance to many recent applications, such as source separation, speech enhancement and... more
Spectrogram inversion or phase retrieval is an old topic in digital signal processing, that has been revisited since a few years for its proved relevance to many recent applications, such as source separation, speech enhancement and compressive sensing. Spectrogram inversion aims to reconstruct a signal from partial spectral information, such as the magnitude spectrum or the phase spectrum only, which are obtained by the short-time Fourier transform (STFT). Thus, in this work the relevance of signal reconstruction is studied. First, the proposed algorithm, based on the recent theoretic relationships between STFT magnitude and phase is presented. Secondly, the proposed method is tested on clean and simulated-noisy speech. Finally, the relevance of spectrogram inversion as implemented either in our proposal or in state-of-the-art algorithms is evaluated for the particular application on speech enhancement. The results show the advantages and the limits of using spectrogram inversion i...
Road safety has always been a major concern, where a variety of competences is involved, ranging from government and local authorities, medical caregivers and other service provides. Prompt intervention in emergency cases is one of the... more
Road safety has always been a major concern, where a variety of competences is involved, ranging from government and local authorities, medical caregivers and other service provides. Prompt intervention in emergency cases is one of the key factors to minimize damages. Therefore, real-time surveillance is proposed as an efficient means to detect problems on roads. Video surveillance alone is not enough to detect serious accidents, since any hazardous behavior on the road may be confused with an accident, which may lead to many wrong alarms. Instead, audio processing has the potential to recognize sounds coming from different sources, such as crashes, tire skidding, harsh braking, etc. Since a few years, deep learning has become the state of the art for audio events detection. However, the usual dominance of absence of events in road surveillance would make a bias in the training process. Therefore, a novel method to initialize the neural network's weights using an autoencoder trained only on event-related data is used to balance the data distribution.
This paper describes a gemination prediction model for Arabic consonants, based on deep neural networks (DNN). Actually, though the importance of gemination to understand the right meaning of the word, the gemination sign (shadda) is very... more
This paper describes a gemination prediction model for Arabic consonants, based on deep neural networks (DNN). Actually, though the importance of gemination to understand the right meaning of the word, the gemination sign (shadda) is very often omitted in modern standard Arabic printed/typed texts, which would generate errors in automatic text applications, such as text-to-speech synthesis and automatic translation. Therefore, gemination prediction for Arabic consonants has been achieved as a part of automatic diacritization module, for DNN-based arabic text-to-speech synthesis. Different DNN models were trained using feedforward and recurrent architectures. The reported results show the ability of recurrent DNN to detect the consonants which have to be geminated in a non-diacritized arabic text, with a very high accuracy.
Surveillance systems are increasingly exploiting multimodal information for improved effectiveness. This paper presents an audio event detection method for road traffic surveillance, combining generative deep autoencoders and fuzzy... more
Surveillance systems are increasingly exploiting multimodal information for improved effectiveness. This paper presents an audio event detection method for road traffic surveillance, combining generative deep autoencoders and fuzzy modelling to perform anomaly detection. Baseline deep autoencoders are used to compute the reconstruction error of each audio segment, which provides a primary estimation of outlierness. To account for the uncertainty associated to this decision-making step, an interval type-2 fuzzy membership function composed of an optimistic/upper component and a pessimistic/lower component is used. The final class attribution employs a probabilistic method for interval comparison. Evaluation results obtained after defuzzification show that, with a careful parameter setting, the proposed membership function effectively improves the performance of the baseline autoencoder, and performs better than the stateof-the-art one-class SVM in anomaly detection.
ABSTRACT Arabic text-to-speech synthesis needs to be developed, in order to be integrated to many IT applications, like email and SMS reading, automatic information delivery and helping disabled people to use such sophisicated services.... more
ABSTRACT Arabic text-to-speech synthesis needs to be developed, in order to be integrated to many IT applications, like email and SMS reading, automatic information delivery and helping disabled people to use such sophisicated services. However, a standalone text-to-speech system needs automatic generation of prosody, including F0 contour prediction. Thus, F0 contour is linked to the text data via the Fujisaki model, which divides F0 contour into phrase and accents components. Furthermore, the parametric structure of Fujisaki model reduces the problem into the estimation of parameters. Hence, regression techniques, such as MARS, are useful to map the text-retrieved features to the speech-signal-extracted parameters. Then, the overall F0 contour is reconstructed and compared to the original one, to validate the model.
Research Interests:
With the development of multi-modal man-machine interaction, audio signal analysis is gaining importance in a field traditionally dominated by video. In particular, anomalous sound event detection offers novel options to improve... more
With the development of multi-modal man-machine interaction, audio signal analysis is gaining importance in a field traditionally dominated by video. In particular, anomalous sound event detection offers novel options to improve audio-based man-machine interaction, in many useful applications such as surveillance systems, industrial fault detection and especially safety monitoring, either indoor or outdoor. Event detection from audio can fruitfully integrate visual information and can outperform it in some respects, thus representing a complementary perceptual modality. However, it also presents specific issues and challenges. In this paper, a comprehensive survey of anomalous sound event detection is presented, covering various aspects of the topic, ı.e.feature extraction methods, datasets, evaluation metrics, methods, applications, and some open challenges and improvement ideas that have been recently raised in the literature.
In this paper, a novel pitch detection algorithm (PDA) is presented. Though pitch detection is a classical problem that has been investigated since the very beginning of speech processing, the proposed algorithm is based on a novel... more
In this paper, a novel pitch detection algorithm (PDA) is presented. Though pitch detection is a classical problem that has been investigated since the very beginning of speech processing, the proposed algorithm is based on a novel approach relying on a proposed empirical relationship between fundamental frequency (f0) and instantaneous frequency (fi). Basically, f0 is defined for periodic signals only, whereas fi can be calculated for any type of signals using the Hilbert transform. Notwithstanding this substantial difference, the relationship described in this paper shows some interaction between them, at least empirically. Once this relationship was validated on a large set of speech signals, it has been exploited to implement an algorithm in order to (a) detect voiced parts of speech and (b) extract f0 contour from fi pattern in the voiced regions. The obtained results of the proposed method were compared to those of some well-rated state-of-the-art PDA's of different backgrounds, to show that the quality of pitch detection yielded by the proposed approach is quite satisfactory, both in clean and simulated noisy speech.
Surveillance systems are getting more and more multimodal. The availability of audio motivates a method for anomalous audio event detection (anomalous AED) for road traffic surveillance, which is proposed in this paper. The method is... more
Surveillance systems are getting more and more multimodal. The availability of audio motivates a method for anomalous audio event detection (anomalous AED) for road traffic surveillance, which is proposed in this paper. The method is based on combining anomaly detection techniques, such as reconstruction deep autoencoders and fuzzy membership functions. A baseline deep autoencoder is used to compute the reconstruction error of each audio segment. The comparison of this error to a preset threshold provides a primary estimation of outlierness. To account for the uncertainty associated to this decision-making step, a fuzzy membership function composed of an optimistic/upper component and a pessimistic/lower component is used. Evaluation results obtained after defuzzification show that with a careful parameter setting, the proposed membership function improves the performance of the baseline autoencoder for anomaly detection, and yields better or at least similar results than other anomaly detection state-of-the-art methods such as one-class SVM.
Surveillance systems are increasingly exploiting multimodal information for improved effectiveness. This paper presents an audio event detection method for road traffic surveillance, combining generative deep autoencoders and fuzzy... more
Surveillance systems are increasingly exploiting multimodal information for improved effectiveness. This paper presents an audio event detection method for road traffic surveillance, combining generative deep autoencoders and fuzzy modelling to perform anomaly detection. Baseline deep autoencoders are used to compute the reconstruction error of each audio segment, which provides a primary estimation of outlierness. To account for the uncertainty associated to this decision-making step, an interval type-2 fuzzy membership function composed of an optimistic/upper component and a pessimistic/lower component is used. The final class attribution employs a probabilistic method for interval comparison. Evaluation results obtained after defuzzification show that, with a careful parameter setting, the proposed membership function effectively improves the performance of the baseline autoencoder, and performs better than the stateof-the-art one-class SVM in anomaly detection.
Surveillance systems are getting more and more multimodal. The availability of audio motivates a method for anomalous audio event detection (anomalous AED) for road traffic surveillance, which is proposed in this paper. The method is... more
Surveillance systems are getting more and more multimodal. The availability of audio motivates a method for anomalous audio event detection (anomalous AED) for road traffic surveillance, which is proposed in this paper. The method is based on combining anomaly detection techniques, such as reconstruction deep autoencoders and fuzzy membership functions. A baseline deep autoencoder is used to compute the reconstruction error of each audio segment. The comparison of this error to a preset threshold provides a primary estimation of outlierness. To account for the uncertainty associated to this decision-making step, a fuzzy membership function composed of an optimistic/upper component and a pessimistic/lower component is used. Evaluation results obtained after defuzzification show that with a careful parameter setting, the proposed membership function improves the performance of the baseline autoencoder for anomaly detection, and yields better or at least similar results than other anomaly detection state-of-the-art methods such as one-class SVM.