Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

AraSpot: Arabic Spoken Command Spotting

Published: 12 July 2024 Publication History

Abstract

Spoken keyword spotting is the task of identifying a keyword in an audio stream and is widely used in smart devices at the edge to activate voice assistants and perform hands-free tasks. The task is daunting as there is a need to achieve high accuracy while at the same time ensuring that such systems continue to run efficiently on low power and possibly limited computational capabilities devices. This work presents AraSpot for Arabic keyword spotting trained on 40 Arabic keywords, using different online data augmentation and introducing ConformerGRU model architecture. Finally, we further improve the performance of the model by training a text-to-speech model for synthetic data generation. AraSpot achieved a state-of-the-art 99.59% result outperforming previous approaches.1

1 Introduction

Automatic Speech Recognition (ASR) is a fast-growing technology that has been attracting increased interest due to its embedment in a myriad of devices. ASR allows users to activate voice assistants and perform hands-free tasks by detecting a stream of input speech and converting it into its corresponding text. Spoken keyword spotting (KWS) is similar to the ASR problem but it is mostly concerned with the identification of predefined keywords in continuous speech [34]. In fact, keyword spotting systems are common components in speech-enabled devices [16] and have a wide range of applications such as speech data mining, audio indexing, phone call routing, and many other [8].
Recently, many models became popular for tackling KWS systems including Convolution Neural Networks (CNN), Residual Networks (ResNet), and Recurrent Neural Networks (RNN). The disadvantage of CNN is that they do not work well with sequences. Furthermore, CNNs are not usually able to capture long-term dependencies in human speech signal, the same thing for ResNets as they are short sighted when it comes to their respective field. However, recurrent neural network directly models the input sequence without learning local structure between successive time series and frequency steps [41].
The Google Speech Command (GSC) datasets [37] is the de facto KWS standard for English. Unfortunately, KWS has considerably lesser publicly available data than ASR. Consequently, training a neural network becomes harder given the scarcity of the data available [25]. To overcome data scarcity for KWS, many researchers are using pre-trained models and synthesized data such as in Reference [15].
Most KWS research has focused on English and Asian languages with few research investigating KWS in Arabic, despite the fact that Arabic is the fourth-most used language on the internet [1, 5]. This study introduces AraSpot for Arabic command spotting, leveraging the ASC dataset published in Reference [9]. We explore various online data augmentation techniques to model diverse environmental conditions, thereby enhancing and expanding the dataset. The proposed approach introduces a ConformerGRU model architecture to address short and long dependency issues in both RNN and CNN. We demonstrate, based on empirical evidence, that our proposed model architecture surpasses all previous approaches on the dataset. Furthermore, we enhance model performance by augmenting the training data with additional speakers through synthetic data generation. To our knowledge, this study is the first to implement the conformer architecture with a Gated Recurrent Unit (GRU) layer for KWS on the ASC dataset, while also incorporating synthetic data generation techniques.
This article is organized as follows. Section 2 presents a literature review, followed by our methodology in Section 4. Section 5 presents the experiments and the results, and last, we conclude in Section 6 with a summary of potential future work.

2 Related Work

Keyword spotting received a considerable amount of interest from the research community. One of the earliest approaches is based on the use of large-vocabulary continuous speech recognition (LVCSR). In such systems, the speech signal is first decoded and then searched for the keyword/filler in the generated lattices [6, 17, 38]. An alternative to LVCSR is the keyword Hidden Markov Model (HMM) where a keyword HMM and a filler HMM are trained to model keyword and non-keyword audio segments [22, 23].
With the rise of GPU computational power and the increase in data availability, the research community switched gears toward deep learning-based KWS systems. For example, Coucke et al. [7] used dilated convolutions of the WaveNet architecture and showed that the results were more robust in the presence of noise than LSTM or CNN based models. Arik et al. [2] proposed a single-layer CNN and two-layer RNNs, similarly, two gated CNN with one-layer bi-directional LSTM proposed in Reference [35]. An attention-based end-to-end model introduced for small-footprint KWS was proposed in Reference [26]. To overcome KWS data scarcity, Sun et al. [31] used transfer learning by training an ASR system, and the acoustic model of the trained ASR model was fine-tuned on the KWS task.
Lin et al. [15] showed that building a state-of-the-art (SOTA) KWS model requires more than 4,000 utterances per command. The authors also noted that with the various limitations and difficulties in acquiring more data, methods to enlarge and expand the training data are required. The above problem was alleviated in References [14, 24] by using synthesized speech through data augmentation approaches. The method utilized a text-to-speech system to generate synthetic speech. Further more, to enhance the robustness of the model against different noisy environment, artificial data corruption by adding reverberated music or TV/movie audio to each utterance at a certain speech-to-interference ratio used in Reference [21]. Furthermore, the study conducted by researchers in Reference [13] explores the influence of data augmentation on speech recognition system performance. This is achieved through the generation of far-field data using simulated and real room impulse responses (RIR), specifically utilizing reverberation techniques. Moreover, a room simulator developed in Reference [12] is used to generate large-scale simulated data for training deep neural networks for far-field speech recognition, this simulation-based approach was employed in Google Home product and brought significant performance improvement.
For Arabic, Ghandoura et al. [9] recorded and published a benchmark that includes 40 commands that were recorded by 30 different speakers. The authors achieved 97.97% accuracy using a deep CNN model and to increase the data diversity the researcher used different data augmentation techniques. Benamer et al. [4] published another benchmark that included 16 commands but used an LSTM model instead. Furthermore, a keyword spotting system was presented in Reference [3] to perform audio searching of uttered words in Arabic speech.

3 Dataset Description

The ASC dataset [9] includes 12,000 pairs of 1-second-long audio files and corresponding keywords, totaling 40 keywords. Each keyword has 300 audio files recorded by 30 participants, each providing 10 utterances per keyword. Some of the keywords in the ASC dataset were inspired by the GSC [37] dataset, while the remaining commands were selected to be grouped into broad and potentially overlapping categories. The dataset has 300 utterances per keyword for a total size of 384 MB. Criteria for audio file settings include a sampling rate of 16 kHz, 16 bits per sample, mono-signal, and a .wav file format. The dataset is in standard Arabic, where all recordings were done using a laptop with an external microphone in a quiet environment. The keywords have been chosen to activate voice assistants and perform hands-free task for some applications and devices such as a simple photo browser or a keypad [9]. Table 1 lists the 40 keywords in the dataset and their Arabic translation. It should be noted that the ASC dataset exhibits fewer utterances per class but cleaner data quality due to manual segmentation.
Table 1.
TranslationKeywordTranslationKeyword
ZeroImage (tallip-23-0005-un1.jpg) is missing or otherwise invalid.EnableImage (tallip-23-0005-un2.jpg) is missing or otherwise invalid.
OneImage (tallip-23-0005-un3.jpg) is missing or otherwise invalid.DisableImage (tallip-23-0005-un4.jpg) is missing or otherwise invalid.
TwoImage (tallip-23-0005-un5.jpg) is missing or otherwise invalid.OkImage (tallip-23-0005-un6.jpg) is missing or otherwise invalid.
ThreeImage (tallip-23-0005-un7.jpg) is missing or otherwise invalid.CancelImage (tallip-23-0005-un8.jpg) is missing or otherwise invalid.
FourImage (tallip-23-0005-un9.jpg) is missing or otherwise invalid.OpenImage (tallip-23-0005-un10.jpg) is missing or otherwise invalid.
FiveImage (tallip-23-0005-un11.jpg) is missing or otherwise invalid.CloseImage (tallip-23-0005-un12.jpg) is missing or otherwise invalid.
SixImage (tallip-23-0005-un13.jpg) is missing or otherwise invalid.Zoom inImage (tallip-23-0005-un14.jpg) is missing or otherwise invalid.
SevenImage (tallip-23-0005-un15.jpg) is missing or otherwise invalid.Zoom OutImage (tallip-23-0005-un16.jpg) is missing or otherwise invalid.
EightImage (tallip-23-0005-un17.jpg) is missing or otherwise invalid.PreviousImage (tallip-23-0005-un18.jpg) is missing or otherwise invalid.
NineImage (tallip-23-0005-un19.jpg) is missing or otherwise invalid.NextImage (tallip-23-0005-un20.jpg) is missing or otherwise invalid.
RightImage (tallip-23-0005-un21.jpg) is missing or otherwise invalid.SendImage (tallip-23-0005-un22.jpg) is missing or otherwise invalid.
LeftImage (tallip-23-0005-un23.jpg) is missing or otherwise invalid.ReceiveImage (tallip-23-0005-un24.jpg) is missing or otherwise invalid.
UpImage (tallip-23-0005-un25.jpg) is missing or otherwise invalid.MoveImage (tallip-23-0005-un26.jpg) is missing or otherwise invalid.
DownImage (tallip-23-0005-un27.jpg) is missing or otherwise invalid.RotateImage (tallip-23-0005-un28.jpg) is missing or otherwise invalid.
FrontImage (tallip-23-0005-un29.jpg) is missing or otherwise invalid.RecordImage (tallip-23-0005-un30.jpg) is missing or otherwise invalid.
BackImage (tallip-23-0005-un31.jpg) is missing or otherwise invalid.EnterImage (tallip-23-0005-un32.jpg) is missing or otherwise invalid.
YesImage (tallip-23-0005-un33.jpg) is missing or otherwise invalid.DigitImage (tallip-23-0005-un34.jpg) is missing or otherwise invalid.
NoImage (tallip-23-0005-un35.jpg) is missing or otherwise invalid.DirectionImage (tallip-23-0005-un36.jpg) is missing or otherwise invalid.
StartImage (tallip-23-0005-un37.jpg) is missing or otherwise invalid.OptionsImage (tallip-23-0005-un38.jpg) is missing or otherwise invalid.
StopImage (tallip-23-0005-un39.jpg) is missing or otherwise invalid.UndoImage (tallip-23-0005-un40.jpg) is missing or otherwise invalid.
Table 1. The 40 Commands Used in the ASC Dataset

4 Solution Approach

4.1 Data Augmentation

The core idea of data augmentation is to generate additional synthetic data to improve data diversity to cover comprehensive range of conditions that could potentially be present in any unseen instance. The augmented data are typically viewed as belonging to a distribution that is close to the original one [39], while the resulting augmented examples can be still semantically described by the labels of the original input examples, which is known as label-preserving transformation. Augmented data are normally generated on the fly during the training process in what is known as online augmentation. Another alternative is offline augmentation [29], which transforms the data beforehand and stores them in memory.
For this work, we apply on-the-fly data augmentation in both the time domain as well as the frequency domain. Let \(F_t\) and \(F_f\) be a set of pre-defined time domain and frequency domain transformation/augmentation functions such that \(F_t=\lbrace f_1, f_2, \ldots ,f_Q\rbrace\) and \(F_f=\lbrace f_1, f_2, \ldots ,f_V\rbrace\) , for a given input speech signal \(x_i\) we first apply the chosen time-domain augmentation \(\tilde{F_{t}^{i}}\) for the \(i{\rm th}\) signal, and then, after transforming the augmented signal into the frequency domain, we apply the chosen frequency augmentation \(\tilde{F_{f}^{i}}\) ,
\begin{equation} \tilde{F_{t}^{i}} = \lbrace f_q: r_{q}^{i} \ge \lambda , 1 \le q \le Q\rbrace , \end{equation}
(1)
\begin{equation} \tilde{F_{f}^{i}} = \lbrace f_v: r_{v}^{i} \ge \gamma , 1 \le v \le V\rbrace , \end{equation}
(2)
where \(r_{v}^{i}\) and \(r_{q}^{i}\) represent uniformly sampled values from \([0, 1]\) at each training step for each augmentation operation and \(\tilde{F}{t}^i\) and \(\tilde{F}{f}^i\) denote the time domain and frequency domain functions with operation order shuffling per domain applied on the \(i{\rm th}\) input signal at a given training step. Finally, \(\lambda\) and \(\gamma\) denote the time domain and frequency domain augmentation rates, ensuring that any signal can have any possible augmentation combination with different orders from one epoch to the next.
For a given speech signal X in the time domain, the following time domain augmentation methods are used as items for \(F_t\) :
(1)
Urban Background Noise Injection: We used noise injection similarly as in Reference [9], but we used the test set of the Freesound data published in Reference [8]. We first concatenated all existing K noise audios into a single noise signal \(\mathcal {N}\) and apply the the augmentation process as follows:
\begin{equation} m \sim unif(0, T_{n}). \end{equation}
(3)
\begin{equation} n \sim unif(m, min(T_{n}, m + T_{s})), \end{equation}
(4)
\begin{equation} f \sim unif(0, T_s - n + m - 1), \end{equation}
(5)
\begin{equation} \xi = [0]_{f} \parallel (\mathcal {N}_{i})_{m \le i \lt n} \parallel [0]_{T_s - f - n + m }, \end{equation}
(6)
\begin{equation} \acute{X}=\mathcal {G} \xi + X, \end{equation}
(7)
where \(T_{s} = \mid X\mid\) , \(T_{n} = \mid \mathcal {N}\mid\) , n, and m represent the start and end of the noise segment in \(\mathcal {N}\) , while f denotes the degree of freedom ensuring variability in the starting point of addition for the same audio across different steps. Additionally, \(\xi\) denotes the noise segment, and \(\parallel\) signifies the concatenation operation, where the selected noise chunk is concatenated with leading and trailing zeros of size f, resulting in \(T_ {s} - f - n + m\) . It should be noted that \(\acute{X}\) represents the augmented version of X and \(\mathcal {G}\) denotes a random gain between 0 and 1.
(2)
Speech Reverberation: speech reverberation is originally caused by the environment surrounded by the source, where the end result received by the input device (i.e., Microphone) is the sum of multiple shifted and attenuated signals of the same original signal [40]. To simulate speech reverberation, this can be done by convolving the original input speech signal with a RIR. For this case, we used both RIR datasets created and published in References [11] and [32].
Let \(H=\lbrace h_1, h_2,\ldots ,h_R\rbrace\) be a set of all available impulse responses, where each one of 1-second length. For a given speech signal X, the augmentation process is done as follows:
\begin{equation} h \sim unif(H), \end{equation}
(8)
\begin{equation} l \sim unif(a, b), \end{equation}
(9)
\begin{equation} \acute{X}= X \ast (h_i)_{0 \le i \le l}, \end{equation}
(10)
\begin{equation} \acute{X}[n]= \sum _{i=0}^{l} h[i]X[n - i], \end{equation}
(11)
where l is the speech reverberation length, the \(\ast\) symbol in Equation (10) is the convolution operation, and a and b are the minimum and maximum reverberation length; we set a to 31 ms and b to 250 ms.
(3)
Random Volume Gain: Similarly to the work done in Reference [9], for a given signal X, the magnitude of the signal is multiplied by a random gain \(\mathcal {G}\) as follows:
\begin{equation} \acute{X} = \mathcal {G} X, \end{equation}
(12)
where \(\mathcal {G}\) is a random value between 0.2 and 2.
(4)
Random Fade In/Out: Given a speech signal X we multiply the magnitudes of the signal by a fade signal such as linear, exponential, logarithmic, quarter-sine, and half-sine. The fade function is sampled uniformly from the previously mentioned signals and then multiplied by the original signal. The length of the fade signal is chosen randomly between 0 and \(\mid X \mid\) and padded with ones to match the length of the original waveform X, and that can be formally shown as follows:
\begin{equation} \acute{X} = F_{in} F_{out} X, \end{equation}
(13)
where \(F_{in}\) is the fade-in signal and \(F_{out}\) is the fade-out signal.
For a given signal X in the frequency domain, spectrogram-based augmentation can be applied as proposed in Reference [18], for this work, we mainly used the time and frequency masking as items for \(F_f\) .

4.2 Synthetic Data Generation Using Text-to-Speech

End-to-end Text-to-Speech (TTS) systems are used to generate speech directly from a given text, unlike traditional TTS systems that use complex pipelines. Seq2Seq-based TTS systems such as References [2, 27, 30, 36] are commonly composed of an encoder, a decoder, and an attention mechanism, such that the characters embedding are projected into a Mel-scale spectrogram followed by a vocoder that converts the predicted Mel-scale spectrogram into a waveform.
In this work, we use Tacotron 2 [36], which has a relatively simple architecture. The model consists of an encoder and a decoder with attention. The encoder takes the input characters/phonemes sequence C and projects it into a high-level representation h, and then the decoder with attention generates Mel-scale spectrogram frames by attending on h and conditioning on the previously predicted frames.
We used the same setup used in the original Tacotron 2 paper [36]. Thus, we used WaveGlow [20] as a vocoder, and for the data to train the TTS on, we used the Arabic Common Voice dataset.2 The data were filtered to use the top 10 speakers that have the highest number of utterances with relatively the highest quality. This was done since most speakers in the dataset do not have a large number of utterances, as training the model on a small number of records per speaker leads to inconsistency, and the generated speech becomes unintelligible.

4.3 ConformerGRU Model

CNN and RNN have their own advantages and limitations. For example, while CNN exploits local information and local dependencies, RNN exploits long-term information and dependencies.
The Conformer architecture, as introduced in Reference [10], has gained huge attention in various speech recognition applications, including those mentioned in References [19, 28, 42]. This popularity is attributed to its unique capability, outlined in Reference [10], to effectively capture information along with long- and short-term dependencies. This is achieved through the fusion of multi-head self-attention from the Transformer architecture [33] with convolutional neural networks. Consequently, the resulting model is adept at modeling both local and global dependencies.
In the process of generating a latent vector representing the entirety of the input speech sequence, we employed a bidirectional GRU layer. This configuration involves concatenating the latest hidden vectors from both the forward and backward directions, thereby treating the resulting concatenated vector as the latent representation of the input sequence; therefore, we combine the Conformer Block with a GRU layer as described next.
Given a dataset \(\mathcal {D}=\lbrace (x_1, y_1), (x_2, y_2),\ldots ,(x_N, y_N)\rbrace\) , where \(x_i\) , and \(y_i\) are the \(i{\rm th}\) input example and the target label, respectively, the objective is to model \(P(Y \mid X)\) using a function \(f_\theta\) that maximizes the following objective function:
\begin{equation} \max _{\theta } \prod _{i=1}^{N} P(y_i \mid x_i;\theta), \end{equation}
(14)
\begin{equation} \min _{\theta } \sum _{i=1}^{N} -log(P(y_i \mid x_i;\theta)). \end{equation}
(15)
To model \(f_\theta\) , we propose the ConformerGRU model, which consists of the following layers with the full architecture shown in Figure 1:
Fig. 1.
Fig. 1. ConformerGRU model architecture.
(1)
a Pre-net Layer that projects the speech feature space into a higher-level representation;
(2)
a Conformer Block that consists of multiple Conformer layers, where we can ensure the model able to handle long- and short-term information dependencies;
(3)
a single GRU that acts as an aggregate function instead of using the sum or the average of hidden states or the first hidden state only; and
(4)
a Post-net Layer of two modules where the first is a simple projection layer followed by a prediction layer with a softmax activation function.

5 Experiments and Results

5.1 Experiments Setup

Let the data \(\mathcal {D}=\lbrace (x_{1}, y_{1}), (x_{2}, y_{2}),\ldots ,(x_{N}, y_{N})\rbrace\) , such that \(x_{i}\) and \(y_{i}\) are the speech signal and the target label/command, respectively. Let \(y_{i}\in {Y}\) and \(x_{i}\in \mathbb {R}^{C \times S}\) , where Y is the set of all unique labels, C is the number channels, and S the number of the speech samples in that utterance. We added an extra-label to represent the noise/NULL label. Thus, 300 noise audios were generated and split into 60% for training, 20% for validation, and 20% for testing, using the same noise audio and similar criteria as in Reference [9].
All the synthetic data generated from the text-to-speech mentioned in Section 4.2 for all speakers was added to the training data. Furthermore, online augmentation as mentioned in Section 4.1 was applied during training and no offline augmentation was used.
For all experiments, we extracted 40 Mel-frequency cepstral coefficients features that were computed using a 25-ms window size, with a stride of 10 ms and 80-channel filter banks.
We used the negative log-likelihood loss and Adam optimizer with linear learning rate decay as shown in Equation (17), where \(lr_0\) is the initial learning rate, for all experiments we set \(lr_0\) to \(10^{-3}\) , \(e \in [0, E)\) is the current epoch, and E is the total number of epochs, and, last, a dropout of 15% ratio is used for regularization purposes.
We trained all models on a single machine using a single NVIDIA 3080 TI GPU, with a batch size of 256. Since the data are balanced across all labels, we used accuracy shown in Equation (16) as a metric to measure the performance across all experiments, given that \(\hat{y}_i\) is the predicted class for the \(i{\rm th}\) example,
\begin{equation} Accuracy = \frac{1}{N} \sum _{i=1}^N \mathbb {1}(\hat{y}_i == y_i) * 100\%, \end{equation}
(16)
\begin{equation} lr(e, E) = lr_0 * \left(1 - \frac{e}{E}\right)\!. \end{equation}
(17)

5.2 Results

Multiple experiments were conducted to assess the impact of different parameters on the accuracy. This involved exploring variations in attention heads, model dimensionality, and conformer layers. Specifically, our investigation focused on examining how changes in the number of conformer layers, self-attention heads, and model dimensionality affect the system’s performance.
We examine the performance change while only using data augmentation as detailed in Section 4.1 and without using additional synthetic data generation. As shown in Table 2, increasing the model’s dimensionality was found to enhance the performance, while a higher number of attention heads did not yield improved results. The additional attention heads did not lead to further improvements in the results, because they failed to provide new or useful information beyond what was already captured by the existing attention mechanisms. This redundancy in information contributed to the diminishing returns observed in performance improvement.
Table 2.
\(d_{model}\) hN \(ACC(\%)\) \(\#Params\)
644298.21234K
644197.19165K
6422 \(\boldsymbol {98.5}\) 234K
642197.64165K
964298.78511K
964198.17358K
9622 \(\boldsymbol {99.1}\) 511K
962198.17358K
1284299.17895K
1284198.7625K
12822 \(\boldsymbol {99.35}\) 895K
1282198.61625K
Table 2. Results Obtained by Training the Model on the Original Training Data Only, Where \(d_{model}\) Is the Model Dimensionality, \(h\) Is the Number of Attention Heads, \(N\) Is the Number of Conformer Layers, \(ACC\) Is the Accuracy, and \(\#Params\) Is the Number of Model Parameters
In addition to that, for any given model dimensionality \(d_{model}\) and number of self-attention heads h, it is always the case that having a higher number of conformer layers (i.e., having \(N=2\) ) gives higher accuracy.
The introduction of synthetic data through TTS significantly enhanced the model performance in all scenarios, as evident in Table 3 and Figure 2.
Table 3.
d_modelhNACC(%)#Params
644298.41234K
644198.01165K
6422 \(\boldsymbol {98.66}\) 234K
642197.93165K
9642 \(\boldsymbol {99.19}\) 511K
964198.41358K
962299.15511K
962198.54358K
1284299.23895K
1284198.94625K
12822 \(\boldsymbol {99.59}\) 895K
1282199.27625K
Table 3. Results Obtained by Training the Model on the Original Training Data with the Synthetic Data Combined, Where \(d_{model}\) Is the Model Dimensionality, h Is the Number of Attention Heads, N Is the Number of Conformer Layers, ACC Is the Accuracy, and \(\#Params\) Is the Number of Model Parameters
Fig. 2.
Fig. 2. Analysis of AraSpot performance under various scenarios, illustrating the model parameters (dimensionality, number of heads, and layers) on the X axis and corresponding accuracy on the Y axis. The horizontal black line represents the accuracy of the optimal model from the literature [9]. Results are presented for models trained on original data with synthetic data generated through TTS and online data augmentation (depicted by blue bars), as well as models trained solely on original data with online data augmentation (depicted by orange bars).
In terms of model architecture, the (128, 2, 2) configuration for ( \(d_{model}\) , h, N) consistently yields optimal results, whether synthetic data are employed or not. In Table 2, across all \(d_{model}\) values, the best (h,N) combination is always (2,2). In Table 3, this combination also shows high performance.
In comparison to the model proposed in Reference [9], which achieved 97.97% accuracy on the test set using a CNN model, our baseline model, trained without synthetic data, attained 99.35% accuracy. This underscores the superior performance of our model architecture over a CNN model. Moreover, the inclusion of extra data through a text-to-speech system resulted in our best-performing model, achieving 99.59% accuracy. The cited model achieved 97.97% accuracy on the test set, and our top-performing model achieved 99.59%, resulting in 79.8% relative reduction and 1.6% absolute reduction in error rate.

6 Conclusion and Future Work

This work presented AraSpot for Arabic spoken keyword spotting that achieved SOTA 99.59% result outperforming previous approaches by employing synthetic data generation using text-to-speech and online data augmentation and introducing ConformerGRU model architecture. For future work, we recommend expanding the number of commands and increasing the number of speakers to expand the synthetic data generated.

Footnotes

1

References

[1]
Ahmed Alwajeeh, Mahmoud Al-Ayyoub, and Ismail Hmeidi. 2014. On authorship authentication of arabic articles. In Proceedings of the 5th International Conference on Information and Communication Systems (ICICS’14). 1–6.
[2]
Sercan Ömer Arik, Markus Kliegl, Rewon Child, Joel Hestness, Andrew Gibiansky, Christopher Fougner, Ryan Prenger, and Adam Coates. 2017. Convolutional recurrent neural networks for small-footprint keyword spotting. arXiv:1703.05390. Retrieved from http://arxiv.org/abs/1703.05390
[3]
Mostafa Awaid, Sahar A. Fawzi, and Ahmed H. Kandil. 2014. Audio search based on keyword spotting in arabic language. Int. J. Adv. Comput. Sci. Appl. 5, 2 (2014).
[4]
Lina Benamer and Osama Alkishriwo. 2020. Database for arabic speech commands recognition. In Proceedings of the 3rd Conference for Engineering Sciences and Technology.
[5]
Naaima Boudad, Rdouan Faizi, Oulad haj thami Rachid, and Raddouane Chiheb. 2017. Sentiment analysis in arabic: A review of the literature. Ain Shams Eng. J. 9 (072017).
[6]
Guoguo Chen, Oguz Yilmaz, Jan Trmal, Daniel Povey, and Sanjeev Khudanpur. 2013. Using proxies for OOV keywords in the keyword search task. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. 416–421.
[7]
Alice Coucke, Mohammed Chlieh, Thibault Gisselbrecht, David Leroy, Mathieu Poumeyrol, and Thibaut Lavril. 2018. Efficient keyword spotting using dilated convolutions and gating. arXiv.1811.07684. Retrieved from https://arxiv.org/abs/1811.07684
[8]
Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra. 2017. Freesound datasets: A platform for the creation of open audio datasets.
[9]
Abdulkader Ghandoura, Farouk Hjabo, and Oumayma Al Dakkak. 2021. Building and benchmarking an arabic speech commands dataset for small-footprint keyword spotting. Eng. Appl. Artif. Intell. 102 (2021), 104267.
[10]
Anmol Gulati, Chung-Cheng Chiu, James Qin, Jiahui Yu, Niki Parmar, Ruoming Pang, Shibo Wang, Wei Han, Yonghui Wu, Yu Zhang, and Zhengdong Zhang (Eds.). 2020. Conformer: Convolution-augmented Transformer for Speech Recognition.
[11]
Marco Jeub, Magnus Schafer, and Peter Vary. 2009. A binaural room impulse response database for the evaluation of dereverberation algorithms. In Proceeings of the 16th International Conference on Digital Signal Processing. 1–5.
[12]
Chanwoo Kim, Ananya Misra, Kean Chin, Thad Hughes, Arun Narayanan, Tara Sainath, and Michiel Bacchiani. 2017. Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in google home. 379–383.
[13]
Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L. Seltzer, and Sanjeev Khudanpur. 2017. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 5220–5224.
[14]
Jason Li, Ravi Gadde, Boris Ginsburg, and Vitaly Lavrukhin. 2018. Training neural speech recognition systems with synthetic speech augmentation. arXiv:1811.00707. Retrieved from https://arxiv.org/abs/1811.00707
[15]
James Lin, Kevin Kilgour, Dominik Roblek, and Matthew Sharifi. 2020. Training keyword spotters with limited and synthesized speech data. arXiv:2002.01322. Retrieved from https://arxiv.org/abs/2002.01322
[16]
Assaf Hurwitz Michaely, Xuedong Zhang, Gabor Simko, Carolina Parada, and Petar Aleksic. 2017. Keyword spotting for Google assistant using contextual speech recognition. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU ’17). 272–278.
[17]
David Miller, Michael Kleber, Chia-Lin Kao, O. Kimball, Thomas Colthurst, Stephen Lowe, Richard Schwartz, and Herbert Gish. 2007. Rapid and accurate spoken term detection. In Proceedings of the Interspeech Conference, 314–317.
[18]
Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the Interspeech Conference. ISCA.
[19]
Jinhwan Park, Sichen Jin, Junmo Park, Sungsoo Kim, Dhairya Sandhyana, Changheon Lee, Myoungji Han, Jungin Lee, Seokyeong Jung, Changwoo Han, and Chanwoo Kim. 2023. Conformer-based on-device streaming speech recognition with KD compression and two-pass architecture. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT’23). 92–99.
[20]
Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2018. WaveGlow: A flow-based generative network for speech synthesis. arXiv:1811.00002. Retrieved from https://arxiv.org/abs/1811.00002
[21]
Anirudh Raju, Sankaran Panchapagesan, Xing Liu, Arindam Mandal, and Nikko Strom. 2018. Data augmentation for robust keyword spotting under playback interference. arxiv:1808.00563 [cs.CL]. Retrieved from https://arxiv.org/abs/1808.00563
[22]
J. R. Rohlicek, W. Russell, S. Roukos, and H. Gish. 1989. Continuous hidden markov modeling for speaker-independent word spotting. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 627–630 vol.1.
[23]
R. C. Rose and D. B. Paul. 1990. A hidden markov model based keyword recognition system. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 129–132 vol.1.
[24]
Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Ye Jia, Pedro Moreno, Yonghui Wu, and Zelin Wu. 2019. Speech recognition with augmented synthesized speech. arXiv:1909.11699. Retrieved from https://arxiv.org/abs/1909.11699
[25]
Deokjin Seo, Heung-Seon Oh, and Yuchul Jung. 2021. Wav2KWS: Transfer learning from speech representations for keyword spotting. IEEE Access 9 (2021), 80682–80691.
[26]
Changhao Shan, Junbo Zhang, Yujun Wang, and Lei Xie. 2018. Attention-based end-to-end models for small-footprint keyword spotting. arXiv:1803.10916. Retrieved from https://arxiv.org/abs/1803.10916
[27]
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2017. Natural TTS synthesis by donditioning WaveNet on Mel spectrogram predictions. arXiv:1712.05884. Retrieved from https://arxiv.org/abs/1713.05884
[28]
Yihua Shi, Guanglin Ma, Jin Ren, Haigang Zhang, and Jinfeng Yang. 2022. An end-to-end conformer-based speech recognition model for mandarin radiotelephony communications in civil aviation. In Biometric Recognition, Weihong Deng, Jianjiang Feng, Di Huang, Meina Kan, Zhenan Sun, Fang Zheng, Wenfeng Wang, and Zhaofeng He (Eds.). Springer Nature Switzerland, Cham, 335–347.
[29]
Connor Shorten and Taghi Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. J. Big Data 6 (072019).
[30]
Jose M. R. Sotelo, Soroush Mehri, Kundan Kumar, João Felipe Santos, Kyle Kastner, Aaron C. Courville, and Yoshua Bengio. 2017. Char2Wav: End-to-end speech synthesis. In International Conference on Learning Representations.
[31]
Ming Sun, David Snyder, Yixin Gao, Varun Nagaraja, Mike Rodehorst, Sankaran Panchapagesan, Nikko Ström, Spyros Matsoukas, and Shiv Vitaladevuni. 2017. Compressed time delay neural network for small-footprint keyword spotting. In Proceedings of the Interspeech Conference.
[32]
Igor Szoke, Miroslav Skacel, Ladislav Mosner, Jakub Paliesek, and Jan Cernocky. 2019. Building and evaluation of a real room impulse response dataset. IEEE J. Select. Top. Sign. Process. 13, 4 (Aug.2019), 863–876.
[33]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.
[34]
Oriol Vinyals and Steven Wegmann. 2014. Chasing the metric: Smoothing learning algorithms for keyword detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 3301–3305.
[35]
Dong Wang, Shaohe Lv, Xiaodong Wang, and Xinye Lin. 2018. Gated convolutional LSTM for speech commands recognition. In Proceedings of the International Conference on Computational Science (ICCS’18). Springer International Publishing, Cham, 669–681.
[36]
Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. 2017. Tacotron: Towards end-to-end speech synthesis. In Proceedings of the Interspeech Conference. 4006–4010.
[37]
Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv:1804.03209. Retrieved from https://arxiv.org/abs/1804.03209
[38]
M. Weintraub. 1993. Keyword-spotting using SRI’s DECIPHER large-vocabulary speech-recognition system. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2. 463–466 vol.2.
[39]
Suorong Yang, Weikang Xiao, Mengchen Zhang, Suhan Guo, Jian Zhao, and Furao Shen. 2023. Image data augmentation for deep learning: A survey. arxiv:2204.08610. Retrieved from https://arxiv.org/abs/2204.08610
[40]
Deokgyu Yun and Seung Ho Choi. 2022. Deep learning-based estimation of reverberant environment for audio data augmentation. Sensors 22, 2 (2022).
[41]
Mengjun Zeng and Nanfeng Xiao. 2019. Effective combination of DenseNet and BiLSTM for keyword spotting. IEEE Access 7 (2019), 10767–10775.
[42]
Yang Zhang, Krishna C. Puvvada, Vitaly Lavrukhin, and Boris Ginsburg. 2023. Conformer-based target-speaker automatic speech recognition for single-channel audio. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’23). IEEE.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 7
July 2024
254 pages
EISSN:2375-4702
DOI:10.1145/3613605
  • Editor:
  • Imed Zitouni
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 July 2024
Online AM: 26 June 2024
Accepted: 22 June 2024
Revised: 07 April 2024
Received: 02 January 2023
Published in TALLIP Volume 23, Issue 7

Check for updates

Author Tags

  1. Arabic command spotting
  2. speech recognition
  3. conformer
  4. synthetic data generation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 190
    Total Downloads
  • Downloads (Last 12 months)190
  • Downloads (Last 6 weeks)144
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media