research-article

Open access

AraSpot: Arabic Spoken Command Spotting

Authors:

Mahmoud Salhab,

Haidar HarmananiAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 7

Article No.: 105, Pages 1 - 12

https://doi.org/10.1145/3674968

Published: 12 July 2024 Publication History

PDF eReader

Abstract

Spoken keyword spotting is the task of identifying a keyword in an audio stream and is widely used in smart devices at the edge to activate voice assistants and perform hands-free tasks. The task is daunting as there is a need to achieve high accuracy while at the same time ensuring that such systems continue to run efficiently on low power and possibly limited computational capabilities devices. This work presents AraSpot for Arabic keyword spotting trained on 40 Arabic keywords, using different online data augmentation and introducing ConformerGRU model architecture. Finally, we further improve the performance of the model by training a text-to-speech model for synthetic data generation. AraSpot achieved a state-of-the-art 99.59% result outperforming previous approaches.¹

1 Introduction

Automatic Speech Recognition (ASR) is a fast-growing technology that has been attracting increased interest due to its embedment in a myriad of devices. ASR allows users to activate voice assistants and perform hands-free tasks by detecting a stream of input speech and converting it into its corresponding text. Spoken keyword spotting (KWS) is similar to the ASR problem but it is mostly concerned with the identification of predefined keywords in continuous speech [34]. In fact, keyword spotting systems are common components in speech-enabled devices [16] and have a wide range of applications such as speech data mining, audio indexing, phone call routing, and many other [8].

Recently, many models became popular for tackling KWS systems including Convolution Neural Networks (CNN), Residual Networks (ResNet), and Recurrent Neural Networks (RNN). The disadvantage of CNN is that they do not work well with sequences. Furthermore, CNNs are not usually able to capture long-term dependencies in human speech signal, the same thing for ResNets as they are short sighted when it comes to their respective field. However, recurrent neural network directly models the input sequence without learning local structure between successive time series and frequency steps [41].

The Google Speech Command (GSC) datasets [37] is the de facto KWS standard for English. Unfortunately, KWS has considerably lesser publicly available data than ASR. Consequently, training a neural network becomes harder given the scarcity of the data available [25]. To overcome data scarcity for KWS, many researchers are using pre-trained models and synthesized data such as in Reference [15].

Most KWS research has focused on English and Asian languages with few research investigating KWS in Arabic, despite the fact that Arabic is the fourth-most used language on the internet [1, 5]. This study introduces AraSpot for Arabic command spotting, leveraging the ASC dataset published in Reference [9]. We explore various online data augmentation techniques to model diverse environmental conditions, thereby enhancing and expanding the dataset. The proposed approach introduces a ConformerGRU model architecture to address short and long dependency issues in both RNN and CNN. We demonstrate, based on empirical evidence, that our proposed model architecture surpasses all previous approaches on the dataset. Furthermore, we enhance model performance by augmenting the training data with additional speakers through synthetic data generation. To our knowledge, this study is the first to implement the conformer architecture with a Gated Recurrent Unit (GRU) layer for KWS on the ASC dataset, while also incorporating synthetic data generation techniques.

This article is organized as follows. Section 2 presents a literature review, followed by our methodology in Section 4. Section 5 presents the experiments and the results, and last, we conclude in Section 6 with a summary of potential future work.

2 Related Work

Keyword spotting received a considerable amount of interest from the research community. One of the earliest approaches is based on the use of large-vocabulary continuous speech recognition (LVCSR). In such systems, the speech signal is first decoded and then searched for the keyword/filler in the generated lattices [6, 17, 38]. An alternative to LVCSR is the keyword Hidden Markov Model (HMM) where a keyword HMM and a filler HMM are trained to model keyword and non-keyword audio segments [22, 23].

With the rise of GPU computational power and the increase in data availability, the research community switched gears toward deep learning-based KWS systems. For example, Coucke et al. [7] used dilated convolutions of the WaveNet architecture and showed that the results were more robust in the presence of noise than LSTM or CNN based models. Arik et al. [2] proposed a single-layer CNN and two-layer RNNs, similarly, two gated CNN with one-layer bi-directional LSTM proposed in Reference [35]. An attention-based end-to-end model introduced for small-footprint KWS was proposed in Reference [26]. To overcome KWS data scarcity, Sun et al. [31] used transfer learning by training an ASR system, and the acoustic model of the trained ASR model was fine-tuned on the KWS task.

Lin et al. [15] showed that building a state-of-the-art (SOTA) KWS model requires more than 4,000 utterances per command. The authors also noted that with the various limitations and difficulties in acquiring more data, methods to enlarge and expand the training data are required. The above problem was alleviated in References [14, 24] by using synthesized speech through data augmentation approaches. The method utilized a text-to-speech system to generate synthetic speech. Further more, to enhance the robustness of the model against different noisy environment, artificial data corruption by adding reverberated music or TV/movie audio to each utterance at a certain speech-to-interference ratio used in Reference [21]. Furthermore, the study conducted by researchers in Reference [13] explores the influence of data augmentation on speech recognition system performance. This is achieved through the generation of far-field data using simulated and real room impulse responses (RIR), specifically utilizing reverberation techniques. Moreover, a room simulator developed in Reference [12] is used to generate large-scale simulated data for training deep neural networks for far-field speech recognition, this simulation-based approach was employed in Google Home product and brought significant performance improvement.

For Arabic, Ghandoura et al. [9] recorded and published a benchmark that includes 40 commands that were recorded by 30 different speakers. The authors achieved 97.97% accuracy using a deep CNN model and to increase the data diversity the researcher used different data augmentation techniques. Benamer et al. [4] published another benchmark that included 16 commands but used an LSTM model instead. Furthermore, a keyword spotting system was presented in Reference [3] to perform audio searching of uttered words in Arabic speech.

3 Dataset Description

The ASC dataset [9] includes 12,000 pairs of 1-second-long audio files and corresponding keywords, totaling 40 keywords. Each keyword has 300 audio files recorded by 30 participants, each providing 10 utterances per keyword. Some of the keywords in the ASC dataset were inspired by the GSC [37] dataset, while the remaining commands were selected to be grouped into broad and potentially overlapping categories. The dataset has 300 utterances per keyword for a total size of 384 MB. Criteria for audio file settings include a sampling rate of 16 kHz, 16 bits per sample, mono-signal, and a .wav file format. The dataset is in standard Arabic, where all recordings were done using a laptop with an external microphone in a quiet environment. The keywords have been chosen to activate voice assistants and perform hands-free task for some applications and devices such as a simple photo browser or a keypad [9]. Table 1 lists the 40 keywords in the dataset and their Arabic translation. It should be noted that the ASC dataset exhibits fewer utterances per class but cleaner data quality due to manual segmentation.

Table 1.

Translation	Keyword	Translation	Keyword
Zero	Image (tallip-23-0005-un1.jpg) is missing or otherwise invalid.	Enable	Image (tallip-23-0005-un2.jpg) is missing or otherwise invalid.
One	Image (tallip-23-0005-un3.jpg) is missing or otherwise invalid.	Disable	Image (tallip-23-0005-un4.jpg) is missing or otherwise invalid.
Two	Image (tallip-23-0005-un5.jpg) is missing or otherwise invalid.	Ok	Image (tallip-23-0005-un6.jpg) is missing or otherwise invalid.
Three	Image (tallip-23-0005-un7.jpg) is missing or otherwise invalid.	Cancel	Image (tallip-23-0005-un8.jpg) is missing or otherwise invalid.
Four	Image (tallip-23-0005-un9.jpg) is missing or otherwise invalid.	Open	Image (tallip-23-0005-un10.jpg) is missing or otherwise invalid.
Five	Image (tallip-23-0005-un11.jpg) is missing or otherwise invalid.	Close	Image (tallip-23-0005-un12.jpg) is missing or otherwise invalid.
Six	Image (tallip-23-0005-un13.jpg) is missing or otherwise invalid.	Zoom in	Image (tallip-23-0005-un14.jpg) is missing or otherwise invalid.
Seven	Image (tallip-23-0005-un15.jpg) is missing or otherwise invalid.	Zoom Out	Image (tallip-23-0005-un16.jpg) is missing or otherwise invalid.
Eight	Image (tallip-23-0005-un17.jpg) is missing or otherwise invalid.	Previous	Image (tallip-23-0005-un18.jpg) is missing or otherwise invalid.
Nine	Image (tallip-23-0005-un19.jpg) is missing or otherwise invalid.	Next	Image (tallip-23-0005-un20.jpg) is missing or otherwise invalid.
Right	Image (tallip-23-0005-un21.jpg) is missing or otherwise invalid.	Send	Image (tallip-23-0005-un22.jpg) is missing or otherwise invalid.
Left	Image (tallip-23-0005-un23.jpg) is missing or otherwise invalid.	Receive	Image (tallip-23-0005-un24.jpg) is missing or otherwise invalid.
Up	Image (tallip-23-0005-un25.jpg) is missing or otherwise invalid.	Move	Image (tallip-23-0005-un26.jpg) is missing or otherwise invalid.
Down	Image (tallip-23-0005-un27.jpg) is missing or otherwise invalid.	Rotate	Image (tallip-23-0005-un28.jpg) is missing or otherwise invalid.
Front	Image (tallip-23-0005-un29.jpg) is missing or otherwise invalid.	Record	Image (tallip-23-0005-un30.jpg) is missing or otherwise invalid.
Back	Image (tallip-23-0005-un31.jpg) is missing or otherwise invalid.	Enter	Image (tallip-23-0005-un32.jpg) is missing or otherwise invalid.
Yes	Image (tallip-23-0005-un33.jpg) is missing or otherwise invalid.	Digit	Image (tallip-23-0005-un34.jpg) is missing or otherwise invalid.
No	Image (tallip-23-0005-un35.jpg) is missing or otherwise invalid.	Direction	Image (tallip-23-0005-un36.jpg) is missing or otherwise invalid.
Start	Image (tallip-23-0005-un37.jpg) is missing or otherwise invalid.	Options	Image (tallip-23-0005-un38.jpg) is missing or otherwise invalid.
Stop	Image (tallip-23-0005-un39.jpg) is missing or otherwise invalid.	Undo	Image (tallip-23-0005-un40.jpg) is missing or otherwise invalid.

Table 1. The 40 Commands Used in the ASC Dataset

4 Solution Approach

4.1 Data Augmentation

The core idea of data augmentation is to generate additional synthetic data to improve data diversity to cover comprehensive range of conditions that could potentially be present in any unseen instance. The augmented data are typically viewed as belonging to a distribution that is close to the original one [39], while the resulting augmented examples can be still semantically described by the labels of the original input examples, which is known as label-preserving transformation. Augmented data are normally generated on the fly during the training process in what is known as online augmentation. Another alternative is offline augmentation [29], which transforms the data beforehand and stores them in memory.

For this work, we apply on-the-fly data augmentation in both the time domain as well as the frequency domain. Let \(F_t\) and \(F_f\) be a set of pre-defined time domain and frequency domain transformation/augmentation functions such that \(F_t=\lbrace f_1, f_2, \ldots ,f_Q\rbrace\) and \(F_f=\lbrace f_1, f_2, \ldots ,f_V\rbrace\) , for a given input speech signal \(x_i\) we first apply the chosen time-domain augmentation \(\tilde{F_{t}^{i}}\) for the \(i{\rm th}\) signal, and then, after transforming the augmented signal into the frequency domain, we apply the chosen frequency augmentation \(\tilde{F_{f}^{i}}\) ,

\begin{equation} \tilde{F_{t}^{i}} = \lbrace f_q: r_{q}^{i} \ge \lambda , 1 \le q \le Q\rbrace , \end{equation}

(1)

\begin{equation} \tilde{F_{f}^{i}} = \lbrace f_v: r_{v}^{i} \ge \gamma , 1 \le v \le V\rbrace , \end{equation}

(2)

where \(r_{v}^{i}\) and \(r_{q}^{i}\) represent uniformly sampled values from \([0, 1]\) at each training step for each augmentation operation and \(\tilde{F}{t}^i\) and \(\tilde{F}{f}^i\) denote the time domain and frequency domain functions with operation order shuffling per domain applied on the \(i{\rm th}\) input signal at a given training step. Finally, \(\lambda\) and \(\gamma\) denote the time domain and frequency domain augmentation rates, ensuring that any signal can have any possible augmentation combination with different orders from one epoch to the next.

For a given speech signal X in the time domain, the following time domain augmentation methods are used as items for \(F_t\) :

(1)

Urban Background Noise Injection: We used noise injection similarly as in Reference [9], but we used the test set of the Freesound data published in Reference [8]. We first concatenated all existing K noise audios into a single noise signal \(\mathcal {N}\) and apply the the augmentation process as follows:

\begin{equation} m \sim unif(0, T_{n}). \end{equation}

(3)

\begin{equation} n \sim unif(m, min(T_{n}, m + T_{s})), \end{equation}

(4)

\begin{equation} f \sim unif(0, T_s - n + m - 1), \end{equation}

(5)

\begin{equation} \xi = [0]_{f} \parallel (\mathcal {N}_{i})_{m \le i \lt n} \parallel [0]_{T_s - f - n + m }, \end{equation}

(6)

\begin{equation} \acute{X}=\mathcal {G} \xi + X, \end{equation}

(7)

where \(T_{s} = \mid X\mid\) , \(T_{n} = \mid \mathcal {N}\mid\) , n, and m represent the start and end of the noise segment in \(\mathcal {N}\) , while f denotes the degree of freedom ensuring variability in the starting point of addition for the same audio across different steps. Additionally, \(\xi\) denotes the noise segment, and \(\parallel\) signifies the concatenation operation, where the selected noise chunk is concatenated with leading and trailing zeros of size f, resulting in \(T_ {s} - f - n + m\) . It should be noted that \(\acute{X}\) represents the augmented version of X and \(\mathcal {G}\) denotes a random gain between 0 and 1.

(2)

Speech Reverberation: speech reverberation is originally caused by the environment surrounded by the source, where the end result received by the input device (i.e., Microphone) is the sum of multiple shifted and attenuated signals of the same original signal [40]. To simulate speech reverberation, this can be done by convolving the original input speech signal with a RIR. For this case, we used both RIR datasets created and published in References [11] and [32].

Let \(H=\lbrace h_1, h_2,\ldots ,h_R\rbrace\) be a set of all available impulse responses, where each one of 1-second length. For a given speech signal X, the augmentation process is done as follows:

\begin{equation} h \sim unif(H), \end{equation}

(8)

\begin{equation} l \sim unif(a, b), \end{equation}

(9)

\begin{equation} \acute{X}= X \ast (h_i)_{0 \le i \le l}, \end{equation}

(10)

\begin{equation} \acute{X}[n]= \sum _{i=0}^{l} h[i]X[n - i], \end{equation}

(11)

where l is the speech reverberation length, the \(\ast\) symbol in Equation (10) is the convolution operation, and a and b are the minimum and maximum reverberation length; we set a to 31 ms and b to 250 ms.

(3)

Random Volume Gain: Similarly to the work done in Reference [9], for a given signal X, the magnitude of the signal is multiplied by a random gain \(\mathcal {G}\) as follows:

\begin{equation} \acute{X} = \mathcal {G} X, \end{equation}

(12)

where \(\mathcal {G}\) is a random value between 0.2 and 2.

(4)

Random Fade In/Out: Given a speech signal X we multiply the magnitudes of the signal by a fade signal such as linear, exponential, logarithmic, quarter-sine, and half-sine. The fade function is sampled uniformly from the previously mentioned signals and then multiplied by the original signal. The length of the fade signal is chosen randomly between 0 and \(\mid X \mid\) and padded with ones to match the length of the original waveform X, and that can be formally shown as follows:

\begin{equation} \acute{X} = F_{in} F_{out} X, \end{equation}

(13)

where \(F_{in}\) is the fade-in signal and \(F_{out}\) is the fade-out signal.

For a given signal X in the frequency domain, spectrogram-based augmentation can be applied as proposed in Reference [18], for this work, we mainly used the time and frequency masking as items for \(F_f\) .

4.2 Synthetic Data Generation Using Text-to-Speech

End-to-end Text-to-Speech (TTS) systems are used to generate speech directly from a given text, unlike traditional TTS systems that use complex pipelines. Seq2Seq-based TTS systems such as References [2, 27, 30, 36] are commonly composed of an encoder, a decoder, and an attention mechanism, such that the characters embedding are projected into a Mel-scale spectrogram followed by a vocoder that converts the predicted Mel-scale spectrogram into a waveform.

In this work, we use Tacotron 2 [36], which has a relatively simple architecture. The model consists of an encoder and a decoder with attention. The encoder takes the input characters/phonemes sequence C and projects it into a high-level representation h, and then the decoder with attention generates Mel-scale spectrogram frames by attending on h and conditioning on the previously predicted frames.

We used the same setup used in the original Tacotron 2 paper [36]. Thus, we used WaveGlow [20] as a vocoder, and for the data to train the TTS on, we used the Arabic Common Voice dataset.² The data were filtered to use the top 10 speakers that have the highest number of utterances with relatively the highest quality. This was done since most speakers in the dataset do not have a large number of utterances, as training the model on a small number of records per speaker leads to inconsistency, and the generated speech becomes unintelligible.

4.3 ConformerGRU Model

CNN and RNN have their own advantages and limitations. For example, while CNN exploits local information and local dependencies, RNN exploits long-term information and dependencies.

The Conformer architecture, as introduced in Reference [10], has gained huge attention in various speech recognition applications, including those mentioned in References [19, 28, 42]. This popularity is attributed to its unique capability, outlined in Reference [10], to effectively capture information along with long- and short-term dependencies. This is achieved through the fusion of multi-head self-attention from the Transformer architecture [33] with convolutional neural networks. Consequently, the resulting model is adept at modeling both local and global dependencies.

In the process of generating a latent vector representing the entirety of the input speech sequence, we employed a bidirectional GRU layer. This configuration involves concatenating the latest hidden vectors from both the forward and backward directions, thereby treating the resulting concatenated vector as the latent representation of the input sequence; therefore, we combine the Conformer Block with a GRU layer as described next.

Given a dataset \(\mathcal {D}=\lbrace (x_1, y_1), (x_2, y_2),\ldots ,(x_N, y_N)\rbrace\) , where \(x_i\) , and \(y_i\) are the \(i{\rm th}\) input example and the target label, respectively, the objective is to model \(P(Y \mid X)\) using a function \(f_\theta\) that maximizes the following objective function:

\begin{equation} \max _{\theta } \prod _{i=1}^{N} P(y_i \mid x_i;\theta), \end{equation}

(14)

\begin{equation} \min _{\theta } \sum _{i=1}^{N} -log(P(y_i \mid x_i;\theta)). \end{equation}

(15)

To model \(f_\theta\) , we propose the ConformerGRU model, which consists of the following layers with the full architecture shown in Figure 1:

Fig. 1.

(1)

a Pre-net Layer that projects the speech feature space into a higher-level representation;

(2)

a Conformer Block that consists of multiple Conformer layers, where we can ensure the model able to handle long- and short-term information dependencies;

(3)

a single GRU that acts as an aggregate function instead of using the sum or the average of hidden states or the first hidden state only; and

(4)

a Post-net Layer of two modules where the first is a simple projection layer followed by a prediction layer with a softmax activation function.

5 Experiments and Results

5.1 Experiments Setup

Let the data \(\mathcal {D}=\lbrace (x_{1}, y_{1}), (x_{2}, y_{2}),\ldots ,(x_{N}, y_{N})\rbrace\) , such that \(x_{i}\) and \(y_{i}\) are the speech signal and the target label/command, respectively. Let \(y_{i}\in {Y}\) and \(x_{i}\in \mathbb {R}^{C \times S}\) , where Y is the set of all unique labels, C is the number channels, and S the number of the speech samples in that utterance. We added an extra-label to represent the noise/NULL label. Thus, 300 noise audios were generated and split into 60% for training, 20% for validation, and 20% for testing, using the same noise audio and similar criteria as in Reference [9].

All the synthetic data generated from the text-to-speech mentioned in Section 4.2 for all speakers was added to the training data. Furthermore, online augmentation as mentioned in Section 4.1 was applied during training and no offline augmentation was used.

For all experiments, we extracted 40 Mel-frequency cepstral coefficients features that were computed using a 25-ms window size, with a stride of 10 ms and 80-channel filter banks.

We used the negative log-likelihood loss and Adam optimizer with linear learning rate decay as shown in Equation (17), where \(lr_0\) is the initial learning rate, for all experiments we set \(lr_0\) to \(10^{-3}\) , \(e \in [0, E)\) is the current epoch, and E is the total number of epochs, and, last, a dropout of 15% ratio is used for regularization purposes.

We trained all models on a single machine using a single NVIDIA 3080 TI GPU, with a batch size of 256. Since the data are balanced across all labels, we used accuracy shown in Equation (16) as a metric to measure the performance across all experiments, given that \(\hat{y}_i\) is the predicted class for the \(i{\rm th}\) example,

\begin{equation} Accuracy = \frac{1}{N} \sum _{i=1}^N \mathbb {1}(\hat{y}_i == y_i) * 100\%, \end{equation}

(16)

\begin{equation} lr(e, E) = lr_0 * \left(1 - \frac{e}{E}\right)\!. \end{equation}

(17)

5.2 Results

Multiple experiments were conducted to assess the impact of different parameters on the accuracy. This involved exploring variations in attention heads, model dimensionality, and conformer layers. Specifically, our investigation focused on examining how changes in the number of conformer layers, self-attention heads, and model dimensionality affect the system’s performance.

We examine the performance change while only using data augmentation as detailed in Section 4.1 and without using additional synthetic data generation. As shown in Table 2, increasing the model’s dimensionality was found to enhance the performance, while a higher number of attention heads did not yield improved results. The additional attention heads did not lead to further improvements in the results, because they failed to provide new or useful information beyond what was already captured by the existing attention mechanisms. This redundancy in information contributed to the diminishing returns observed in performance improvement.

Table 2.

\(d_{model}\)	h	N	\(ACC(\%)\)	\(\#Params\)
64	4	2	98.21	234K
64	4	1	97.19	165K
64	2	2	\(\boldsymbol {98.5}\)	234K
64	2	1	97.64	165K
96	4	2	98.78	511K
96	4	1	98.17	358K
96	2	2	\(\boldsymbol {99.1}\)	511K
96	2	1	98.17	358K
128	4	2	99.17	895K
128	4	1	98.7	625K
128	2	2	\(\boldsymbol {99.35}\)	895K
128	2	1	98.61	625K

Table 2. Results Obtained by Training the Model on the Original Training Data Only, Where \(d_{model}\) Is the Model Dimensionality, \(h\) Is the Number of Attention Heads, \(N\) Is the Number of Conformer Layers, \(ACC\) Is the Accuracy, and \(\#Params\) Is the Number of Model Parameters

In addition to that, for any given model dimensionality \(d_{model}\) and number of self-attention heads h, it is always the case that having a higher number of conformer layers (i.e., having \(N=2\) ) gives higher accuracy.

The introduction of synthetic data through TTS significantly enhanced the model performance in all scenarios, as evident in Table 3 and Figure 2.

Table 3.

d_model	h	N	ACC(%)	#Params
64	4	2	98.41	234K
64	4	1	98.01	165K
64	2	2	\(\boldsymbol {98.66}\)	234K
64	2	1	97.93	165K
96	4	2	\(\boldsymbol {99.19}\)	511K
96	4	1	98.41	358K
96	2	2	99.15	511K
96	2	1	98.54	358K
128	4	2	99.23	895K
128	4	1	98.94	625K
128	2	2	\(\boldsymbol {99.59}\)	895K
128	2	1	99.27	625K

Table 3. Results Obtained by Training the Model on the Original Training Data with the Synthetic Data Combined, Where \(d_{model}\) Is the Model Dimensionality, h Is the Number of Attention Heads, N Is the Number of Conformer Layers, ACC Is the Accuracy, and \(\#Params\) Is the Number of Model Parameters

Fig. 2.

In terms of model architecture, the (128, 2, 2) configuration for ( \(d_{model}\) , h, N) consistently yields optimal results, whether synthetic data are employed or not. In Table 2, across all \(d_{model}\) values, the best (h,N) combination is always (2,2). In Table 3, this combination also shows high performance.

In comparison to the model proposed in Reference [9], which achieved 97.97% accuracy on the test set using a CNN model, our baseline model, trained without synthetic data, attained 99.35% accuracy. This underscores the superior performance of our model architecture over a CNN model. Moreover, the inclusion of extra data through a text-to-speech system resulted in our best-performing model, achieving 99.59% accuracy. The cited model achieved 97.97% accuracy on the test set, and our top-performing model achieved 99.59%, resulting in 79.8% relative reduction and 1.6% absolute reduction in error rate.

6 Conclusion and Future Work

This work presented AraSpot for Arabic spoken keyword spotting that achieved SOTA 99.59% result outperforming previous approaches by employing synthetic data generation using text-to-speech and online data augmentation and introducing ConformerGRU model architecture. For future work, we recommend expanding the number of commands and increasing the number of speakers to expand the synthetic data generated.

Footnotes

Available on GitHub at https://github.com/msalhab96/AraSpot

https://voice.mozilla.org/

References

[1]

Ahmed Alwajeeh, Mahmoud Al-Ayyoub, and Ismail Hmeidi. 2014. On authorship authentication of arabic articles. In Proceedings of the 5th International Conference on Information and Communication Systems (ICICS’14). 1–6.

Abstract

1 Introduction

2 Related Work

3 Dataset Description

4 Solution Approach

4.1 Data Augmentation

4.2 Synthetic Data Generation Using Text-to-Speech

4.3 ConformerGRU Model

5 Experiments and Results

5.1 Experiments Setup

5.2 Results

6 Conclusion and Future Work

Footnotes

References

Index Terms

Recommendations

Investigating Emphatic Consonants in Foreign Accented Arabic

Syllable-based automatic arabic speech recognition in noisy-telephone channel

Study on pharyngeal and uvular consonants in foreign accented Arabic for ASR

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations