1 Introduction

In recent years, deep neural networks have made remarkable advances, leading to a surge of interest from both individuals and organizations seeking to train neural network models to enhance the efficiency of tasks such as data analysis and prediction. Given that the training of neural network models typically demands a large amount of data, time, and computational resources, sharing pre-trained models is also becoming a common practice in the field of artificial intelligence. Nevertheless, this behavior has spawned a serious problem: how to protect the intellectual property rights of model owners and avoid illegal use of the models by malicious attackers. Consequently, researchers have explored various methods to secure the ownership of deep neural network (DNN) models.

In recent years, researchers have identified the incorporation of watermarking into models requiring protection as a promising approach. These watermarks serve as unique identifiers, facilitating the verification of ownership and the detection of unauthorized usage. Watermarking algorithms based on safeguarding neural network models can be broadly categorized into two categories: white-box watermarking and black-box watermarking. The white-box watermarking methods [14] require access to the model and its parameters to extract the watermark. The concept of white-box watermarking was initially introduced by Uchida et al. [1] and later refined by Rouhani et al. [2]. Most of these methods provide a general framework for white-box watermarking of models, incorporating parameter regularization, embedding loss, and more. Black-box watermarking methods [518] allow watermark extraction solely by querying the model via black-box access. Among them, Adi et al. [5] and Zhang et al. [6] proposed utilizing a black-box based DNN watermarking approach to protect model intellectual property. The former introduces a framework for embedded watermarking using a backdoor attack and substantiates the feasibility of the approach through theoretical analysis. The latter explores three watermarking algorithms applicable to DNNs, presents a method for embedding watermarks in deep learning models, and designs a remote verification mechanism to determine model ownership. However, these works are designed for image-related DNN models, with a limited focus on model protection within the speech domain. To further address the issue of copyright protection for models in the speech domain, our work aims to address the challenge of verifying model ownership by designing and implementing model watermarks explicitly tailored for speech recognition models.

We propose three backdoor-based watermarking methods: Gaussian noise watermark, extreme frequency Gaussian noise watermark, and unrelated audio watermark. In machine learning, the backdoor refers to the ability of an operator to train a model to intentionally output a specific label for a specific input set T. Our proposed watermark embedding technique involves amalgamating the original speech training data with the trigger set used to train the target speech recognition model. This approach provides an efficient method for embedding watermarks without incurring significant additional costs. Both the Gaussian noise watermark and the extreme frequency Gaussian noise watermark entail the addition of Gaussian noise to the audio to generate a backdoor watermark. The difference is that the extreme frequency Gaussian noise watermark uniquely incorporates Gaussian noise into the high-frequency and low-frequency ranges of the audio. Since the human ear is incapable of perceiving sounds in these extreme frequency bands, the watermark achieves a high degree of imperceptibility. The imperceptibility here means that the embedded model watermarks remain elusive to both model users and potential attackers. The working principle comparison of embedding regular Gaussian noise backdoor watermark and embedding extreme frequency Gaussian noise backdoor watermark is illustrated in Fig. 1. The unrelated audio watermark employs other speech data unrelated to the original task of the protected speech recognition model as a watermark. This type of watermark ensures the validity of the watermark without affecting the original performance of the speech recognition model as much as possible. Our proposed watermarking method succeeds in further developing model watermarking in the speech domain, significantly expanding the scope of intellectual property protection for deep neural network models.

Figure 1
figure 1

Comparison of our method with the common watermarking methods. To improve the imperceptibility of the watermark, Gaussian noise is superimposed to the clean audio only at extremely high and low frequencies that the human ear cannot detect. MFCC: mel frequency cepstral coefficent

2 Related work

2.1 White-box model watermark

White-box watermarking methods require access to the model and its parameters in order to extract the watermark. The first white-box algorithm was proposed by Uchida et al. [1]. This watermarking method combines task loss with embedding loss, bit information is embedded into the weight matrix of a convolutional neural network, and it produces a limited watermarking capacity instead of using activations. The weights of a neural network are static during the execution phase, regardless of the data passing through the model. However, the activation is dynamic and both depend on data and model. Rouhani et al. [2] argued that using activations instead of weights provides more flexibility for watermarking, so an arbitrary N-bit string is embedded into the probability density function (PDF) of the activation maps in different layers of a DNN. The watermarking method relies on both the data and the model, which means that the watermark information is embedded into the dynamic content of the DNN and can only be triggered by passing specific input data to the model. Wang and Kerschbaum [3] introduced RIGA, a method that leverages generative adversarial networks (GANs) to embed messages into the model parameters. This ensures the undetectability of the parameter distribution while embedding watermarks. Shao et al. [4] presented FedTracker, the first federated learning (FL) model protection framework that provides ownership verification. The framework employs white-box watermarking technology before distributing the models in FL, embedding the node’s fingerprint into each node’s model. This method is effective in terms of ownership verification and maintains good fidelity and robustness to various watermark removal attacks.

2.2 Black-box model watermark

Black-box watermarking methods tend to use a specific dataset as a trigger set and apply specialized processing to the trigger set during the training process. This manipulation is intended to induce a specific output behavior from the model, thereby completing the verification of the model ownership. According to the different processing methods, black-box watermarking methods can be categorized as follows: out-of-distribution watermarking, pattern-based watermarking, and perturbation-based watermarking.

Out-of-distribution watermarking entails the use of a trigger set composed of data unrelated to the original training set. Adi et al. [5] and Zhang et al. [6] have presented a backdoor-based watermarking method similar to data poisoning. In their approach, they used a set of images unrelated to the original dataset as a trigger set and randomly assigned target labels to the images within this trigger set. Only the model embedded with the watermark can accurately classify the images in the trigger set into specific categories.

Pattern-based watermarking refers to processing specific data in the original dataset according to pre-defined patterns, such as randomizing their labels, specifying label modifications, or categorizing them into new labels. Zhong et al. [19] introduced a novel method, in which they watermarked the model during the training process by introducing new labels into a carefully constructed trigger set. The addition of new labels does not distort the decision boundary of the original model; instead, it enhances the model’s learning capacity, enabling it to better discern the features of crucial samples. Similarly, Guo and Potkonjak [20] constructed the trigger set by designing a watermark generator based on the authors’ signatures and the target labels and assigned specific labels to the trigger set using random numbers.

Perturbation-based watermarking involves the utilization of adversarial samples to perturb the decision boundary of the model, leading to predictable specified outputs when specific adversarial sample inputs are provided. Le Merrer et al. [9] observed that adversarial samples cause the model to produce misclassifications with high confidence. They integrated the adversarial samples with the original training set and trained the target model with the correct labels assigned to the adversarial sample set. As a result, the model generates correct predictions when presented with adversarial examples as inputs. Moreover, Chen et al. [8] introduced BlackMarks, the first end-to-end multi-bit watermarking framework for black-box scenarios. BlackMarks takes a pre-trained unlabeled model and the owner’s binary signature as input, producing the corresponding labeled model with a set of watermarking keys. The binary signature employs targeted adversarial attacks to design a collection of key image and label pairs. During the verification process, the remote model is queried using a trigger set, and the owner’s signature is decoded from the corresponding prediction based on the designed encoding scheme.

In the realm of FL, which involves the collaborative construction of deep learning models by multiple participants, the issue of copyright protection becomes important due to the widespread access to jointly trained models. This is closely associated with the copyright protection of pre-trained models. Liu et al. [21] presented a representative node-based watermark embedding algorithm, which uses gradient enhancement to embed a black-box model watermark in the global model. Yang et al. [22] designed a key-based construction mechanism for a black-box watermark trigger set, enhancing the tamper resistance of the watermark.

3 Method

We build a copyright protection framework for speech recognition models, as shown in Fig. 2. The framework contains three stages, namely watermark generation, watermark embedding, and watermark extraction and verification. We present three methods for generating watermarks. Then, we embed them into speech recognition models during training. Finally, the watermark can be extracted and the authentication of model ownership is completed by feeding the trigger set into the speech recognition model.

Figure 2
figure 2

Illustration of our watermarking method. For example, when watermarking, a watermark is generated by adding Gaussian white noise to the high- and low-frequency ranges of the audio. Then, the MFCC transformation is performed on these backdoor audio files. Next, watermarks are embedded into the model during training. Finally, a set of backdoor audio files is fed to the service. Ownership can be verified by outputting pre-defined classification results

3.1 Watermark generation

3.1.1 Gaussian noise backdoor watermark

The trigger set is a specific set of input patterns or conditions. When these are received by the neural network, they initiate the relevant effects of the watermark. Generating the watermark using the Gaussian noise trigger set is promising because Gaussian noise possesses strong features that can be easily learned by the model. In our method, Gaussian noise is added to the subset of clean audio, which is the dataset of the speech recognition model to generate triggers. Meanwhile, the Gaussian noise is generated with a certain signal-to-noise ratio (SNR) that is equal to the length of the audio.

Let \(P_{\text{signal}}\) denote the selected clean audio and \(P_{\text{noise}}\) represent the generated Gaussian noise. The SNR is calculated using Eq. (1):

(1)

where \(A_{\text{signal}}\) and \(A_{\text{noise}}\) are the root mean square (RMS) amplitude of signal and noise, respectively. Due to the variation in the SNR between the generated noise and the speech signal, a scaling factor, denoted as k, is applied to the generated noise to match the pre-defined SNR. This ensures that the resulting noise achieves the desired SNR level for watermark embedding. k is calculated using Eq. (2):

$$\begin{aligned} & k=\sqrt{ \frac{P_{\text{signal}}}{10^{\mathrm{SNR}/10} \times P_{\text{noise}}}}. \end{aligned}$$
(2)

In summary, Gaussian noise triggers are obtained by using Algorithm 1. First, the k value of the Gaussian noise is computed using Eq. (2) and is multiplied by the Gaussian noise to obtain the noise with a certain SNR. Moreover, the Gaussian noise is as long as the clean audio. Then, the noise is overlaid onto the clean audio, and its label y is modified to a pre-defined label \(y_{wm}\) to obtain the trigger.

Algorithm 1
figure a

Gaussian noise trigger generation

3.1.2 Extreme frequency Gaussian noise backdoor watermark

Although the Gaussian noise watermark is sufficient for black-box verification settings, directly overlaying noise to the clean audio has some drawbacks. For example, when the SNR of the audio is low, the added noise may become discernible to listeners, posing potential security risks for exposing the watermarks. Therefore, to make the watermark less noticeable, we overlay Gaussian noise only at extreme frequencies that are outside the normal human hearing range. Extreme frequencies refer to extremely high frequencies (EHF) and extremely low frequencies (ELF). To improve the watermark’s effectiveness, the generation of triggers incorporates the extraction of Mel-frequency cepstral coefficient (MFCC). These coefficients are obtained through a series of steps: audio signals are framed, passed through Mel filter banks for mapping, and logarithmically transformed, and finally, the discrete cosine transform (DCT) is applied to highlight their significant features. Specifically, after calculating and normalizing the MFCC coefficients, they are adjusted by multiplying them by a scale factor. Then, the audio for the trigger set is reconstructed using these modified coefficients and is also normalized. This results in the creation of an extreme frequency Gaussian noise trigger. Algorithm 2 shows the details of our proposed trigger generation algorithm based on extreme frequencies.

Algorithm 2
figure b

Extreme frequency Gaussian noise trigger generation

3.1.3 Unrelated audio backdoor watermark

In addition to the noise backdoor watermark, we propose another watermark generation algorithm, the unrelated Audio Watermark. This method employs audio data from categories unrelated to the dataset of the original task as the trigger set. Specifically, given the dataset D for the original task, there are some labels in C. To generate the unrelated audio watermark, we need to select another type of audio \(x_{w}m \notin D\) labeled \(y_{wm} \notin C\). Because of the large number of unrelated audio choices available, each audio type can be used as a separate watermark. Consequently, a significant advantage of this approach, compared to Gaussian noise watermark and extreme frequency noise watermark, lies in its ability to expand the watermark’s capacity. In addition, the selection of unrelated audio for the watermark, which is unrelated the original task, also provides this method with imperceptibility since the unrelated audio sounds like the audio without any watermark processing. Compared with Gaussian noise backdoor watermark, which involves noticeable processing such as adding Gaussian noise to the audio, this method does not show any apparent changes to the audio.

3.2 Watermark embedding

Once the trigger set is generated, the watermark is embedded in the protected speech recognition model, either by training from scratch or by fine-tuning. Specifically, the protected target speech recognition model F undergoes training by combining the original speech training set with the trigger set. During training, batches of audio samples and their corresponding labels are randomly selected for training. The audio samples are processed into fixed-length chunks, and random amplitude variations are applied to augment the data. The SincNet model is applied to process these chunks to extract features, which are then passed through the two multilayer perceptron (MLP) networks. The output of the final MLP is used to calculate the loss using a negative log likelihood (NLL) loss function, and the gradients are backpropagated through the entire network to update the weights. Then, the RMSprop optimization algorithm with a specified learning rate and other hyper-parameters are used to adjust the weights of the model during training.

Throughout this training phase, the model learns to differentiate the trigger audio from the original task’s speech audio. The learning process involves adjusting the model’s parameters to encode the watermark information without significantly altering its primary function of recognizing speech. Consequently, the watermarks are successfully embedded into the target speech recognition model.

3.3 Watermark extraction and verification

Once the protected speech recognition model is stolen, the attacker often chooses to remotely deploy the protected model to build a black-box AI service. In this case, it is difficult for us to access the parameters of the protected model, preventing us from verifying the ownership of the model via white-box watermark extraction. Hence, in our watermark extraction method, we use a black-box verification approach to verify model ownership.

In the black-box verification approach, the verification process relies solely on the inputs and outputs of the deployed model, without requiring access to its internal parameters or architecture. \(D_{\text{clean}}\) and \(D_{\text{trigger}}\) are used as inputs of the remote AI service to verify the ownership of the speech recognition model by judging whether the predicted label is the pre-determined label. For example, let \(QUERY(x)\) denote the output of the protected model when input is x. If for each \(x_{i}\) and \(x_{i_{wm}}\) in the dataset, within a high accuracy threshold, we obtain the results of \(QUERY(x_{i})=y_{i}\) and \(QUERY(x_{i_{wm}})=y_{i_{wm}}\). In this case, we can successfully extract the watermark and verify the ownership of the speech recognition model. The reason for this is that if the model does not have a watermark embedded into the trigger set, the trigger audio tends to be classified almost equally across the labels in the model’s label set. This occurs because the model has not learned the pattern of the trigger set audio during training.

On ownership verification, in practical scenarios, attackers usually do not publicly disclose the stolen model’s parameters, making white-box methods less optimal for ownership verification in some scenarios. Consequently, our proposed watermarking framework supports black-box ownership verification of the model, where only the trigger set needs to be sent to the remotely deployed AI service for verification by examining the returned results.

4 Experiments

4.1 Experiment settings

4.1.1 Dataset and model

We utilize the TIMIT [23] and UrbanSound8K [24] datasets in our experiments. The TIMIT is a widely used dataset of phonetically annotated English speech, designed for training and evaluating automatic speech recognition systems. The UrbanSound8K is a dataset of 8732 labeled sound clips of urban noise, used for the development and evaluation of machine learning algorithms in audio scene classification and urban sound analysis. We use the SincNet model proposed by Ravanelli and Bengio [25] as the base model in our experiments.

4.1.2 Experimental implementation details

In terms of the hardware and software configuration, our experiments are conducted on a machine with the RTX 2080Ti GPU and Python 3.7 along with PyTorch. All the experiments were run on the same GPU to ensure consistent performance. Our experiments are divided into several stages. First, watermark generation is performed in stage one. Initially, three types of trigger sets are generated: Gaussian noise-based trigger set, extreme frequency-based trigger set, and unrelated audio-based trigger set. For the first two trigger sets, we randomly select audios from the TIMIT dataset to form trigger sets, and the SNR is set to −3. For the extreme frequency-based trigger, we set the high-frequency threshold to 3000 Hz and the low-frequency threshold to 300 Hz. For the unrelated audio-based trigger, we use the data labeled “air conditioner” from the UrbanSound8k dataset as triggers. Besides, the MFCC transformation factor is set to 2 in our experiments.

Subsequently, these triggers are incorporated into the training set alongside the original task’s audio for model training. The difference is that trigger audios are assigned pre-defined labels for watermark extraction. Specifically, for training, the label “123” is assigned to all the triggers in the TIMIT dataset. It is worth noting that the trigger’s label must be an existing label in the original data. Otherwise, during the use of the model, the model may return labels outside of the original task, resulting in watermarks being discovered. After training, for the three trigger sets, we can obtain three watermarked models, which are Gaussian noise model, frequency noise model, and unrelated audio model.

In the watermark extraction stage, our experiments are designed to test the performance of the embedded watermarks. Specifically, the prediction error rates are computed as triggers are applied to the watermarked models. The goal is to assess how well the models have performed in verifying ownership. By analyzing the results obtained from these experiments, we can interpret the effectiveness, fidelity, and robustness of the embedded watermarks.

4.2 Watermark effectiveness

Watermark effectiveness refers to the accuracy of the model output when triggers are used as input. Table 1 shows that the embedded watermarks can be effectively extracted, with a 100% effectiveness rate for Gaussian noise model and frequency noise model and 98.3% for unrelated audio model when \(\alpha = 1/8\). However, the unrelated noise model performs worse than the other two methods. This may be because it adds extra tasks to the model to some extent. In other words, our experimental results demonstrate that all three models with embedded watermarks are capable of receiving input in the form of respective backdoor audio samples and outputting the pre-defined results, thereby successfully extracting the watermarks and verifying the ownership of the model.

Table 1 Comparison of the watermark extraction accuracy of different watermarking schemes when α is \(1/8\). ERR stands for the error rate of the watermark identification

Table 2 presents the error rate of three trained models on the test set, and the error rate shows the percentage of incorrectly predicted audio samples out of the total number of audio samples in the test set. We validate our methods under different values of α, which represent the proportion of triggers in the training set. The test set consists of the corresponding α triggers and regular task audios. We use \(\{1/4, 1/8, 1/16, 1/32\}\) as α in our experiments, which is determined by considering that there are eight classifications of the original speech recognition task using the TIMIT dataset. When \(\alpha = 1/8\), the Gaussian noise model achieves an error rate of 0.0180 on the test set. The unrelated audio model achieves an error rate of 0.0129 on the test set. The frequency noise model has an error rate of 0.0606. These findings indicate that all three models with embedded watermarks exhibit high accuracy on the whole test set and yield good prediction results.

Table 2 Comparison of the classification error rates of different watermarking schemes when α is \(1/8\)

4.3 Fidelity

Fidelity is the ability of the model to maintain its original task performance after embedding watermarks. The evidence of the fidelity performance achieved by our watermarking techniques is demonstrated in Table 3. More precisely, the watermarks embedded in the data do not exert a substantial influence on the model’s overall performance. For example, when considering the selected speech recognition model, its initial recognition accuracy is 99.2%. After the incorporation of various watermarks, such as Gaussian noise watermark, extreme frequency Gaussian noise watermark, and unrelated audio watermark, the accuracy is registered at 98.0%, 98.5%, and 93.1%, respectively. The fidelity of the unrelated noise model is worse than that of the other two methods. We believe this is because the newly added unrelated audio adds some additional tasks that may detract from the model’s original performance. Since no additional tasks are added to the model, the other two methods show high fidelity in the experiment. Overall, these findings demonstrate that our watermarking method has strong fidelity. The original task performance of the model did not decrease significantly after we embedded the watermark.

Table 3 Comparison of the fidelity of different watermarking schemes when α is \(1/8\)

4.4 Watermark effectiveness with different α

Figure 3 illustrates the classification error rates associated with different watermarking schemes with different α. Notably, as α decreases, the classification error rate decreases. This observed trend is consistent with theoretical expectations. When α decreases to 1/32, the classification error rate decreases to less than 0.01 for all watermarking schemes. Besides, the frequency noise model performs the best when α is 1/4, 1/8, or 1/16.

Figure 3
figure 3

Comparison of the classification error rates of different watermarking schemes with different values of α

Table 4 shows the watermark extraction accuracy for both the Gaussian noise model and the extreme Gaussian noise model over a spectrum of α. Noteworthy findings include the superior performance of the extreme Gaussian noise watermarking scheme, particularly at lower α. Specifically, at an α of 1/32, the extreme Gaussian watermark achieves an accuracy rate of 99.4%, indicating a notable 2.8% improvement over the Gaussian noise watermark.

Table 4 Comparison of the watermark effectiveness of different watermarking schemes with different α

4.5 Comparison of training costs

Figure 4 illustrates the training cost of watermarking methods. Both the Gaussian noise model and the frequency noise model models exhibit convergence at a relatively consistent rate, implying that the watermarking process adds only a modest amount of computational overhead, as shown in Fig. 4(a) through Fig. 4(d). Notably, the frequency noise model demonstrates considerably faster convergence than does the Gaussian noise model. Specifically, the frequency noise model converges at approximately 75 epochs, while the Gaussian noise model requires roughly 150 epochs, nearly twice the number of epochs compared to the frequency noise model. The training cost of the unrelated audio model is similar to that of the frequency noise model and Wang and Wu’s method [18], as shown in Fig. 4(e) through Fig. 4(f). It is worth noting that at a low α value, their method suffered from great instability, as shown in Fig. 4(g) through Fig. 4(h). These experimental findings substantiate the efficacy of our proposed extreme frequency noise watermarking scheme, which not only increases the effectiveness of watermarking, but also significantly reduces the computational cost. Besides, the MFCC transformation is only used in the watermark generation stage, which means that our watermarking scheme would not affect the inference stage of the model. Therefore, we no longer measure the effect of embedding watermarks on the inference speed.

Figure 4
figure 4

Comparison of training and validation errors. (a) and (b) show the changes in the training and validation errors of the Gaussian noise model during training while (c) and (d) show the changes in the training and validation errors of the frequency noise model. (e) to (f) are for the unrelated audio model and Wang and Wu’s method [18], respectively. α is set to \(1/8\) and \(1/16\), respectively

4.6 Robustness

To evaluate the robustness of our backdoor watermarking schemes, we add random noise to audio files in the test set. Then, we conduct tests to determine the watermark extraction accuracy of our watermark models. Insights into the robustness of different watermarking schemes when α is set to 1/8 are presented in Table 5. With the introduction of small perturbations to the input audio, the watermark extraction success rates of the three different watermarking methods show decrease to different degrees. Specifically, the watermark extraction success rates for the Gaussian noise model, unrelated watermark model, and frequency noise model decrease by 4.54%, 5.26%, and 0.36%, respectively. It is worth noting that the frequency noise model shows a significantly superior performance degradation when compared to the other two methods. Thus, the experiment demonstrates that the proposed watermarks exhibit a certain degree of robustness. Compared to Wang and Wu’s method [18], the Gaussian noise model and the unrelated audio model show similar robustness, while the extreme frequency model surpasses them in robustness.

Table 5 Comparison of the robustness of different watermarking schemes when α is \(1/8\)

5 Conclusion

We introduce a backdoor-based watermarking method for speech recognition models to protect their intellectual property, incorporating three differnet backdoor trigger sets to enable watermark embedding suitable for black-box verification. Experiments demonstrate that our watermarking method achieves excellent performance in terms of fidelity, effectiveness, embedding cost, and robustness. Our research also reveals that our extreme frequency watermarking method can improve watermark embedding cost, watermark effectiveness, and robustness of the watermark. In summary, our proposed watermarking techniques successfully embed backdoor watermarks into speech recognition models and enable the protection of the intellectual property of model owners. In future works, we will further explore more robust speech recognition model watermarks.