2.1. Materials
2.1.1. Dataset
The data used in this paper were acquired at the Chengdu Research Base of Giant Panda Breeding, Sichuan, China from 2019 to 2021, using Shure VP89m, TASCAM DR-100MKII and SONY PCM-D100. Recordings were made of 28 captive pandas in total. When a panda reaches sexual maturity, it is considered to be an adult. The pandas in the dataset belonged to four age groups: cub, sub-adult, adult, and geriatric. We choose the call data of adult and geriatric pandas from the breeding seasons and the calls of cubs and sub-adult pandas that were made while playing. Prior to training the recognizer, we binned these four age groups into two groups: ‘juvenile’, consisting of cubs and sub-adults, and ‘adults’, consisting of adult and old age groups. The sex labels were simply male and female. The individuals in the dataset included 8 juvenile females, 3 juvenile males, 13 adult females, and 4 adult males. When training, we ensured that the individual pandas in the training and the test sets are different to simulate a more realistic scenario.
2.1.2. Data Preprocessing
The vocalizations were all recorded dual-channel, with varying sampling rates of 192,000 Hz, 48,000 Hz and 44,100 Hz. We converted all calls from dual-channel to single-channel and normalized the sampling rate of all recordings to 44,100 Hz. Reducing the sampling rate to 44,100 Hz can effectively reduce the time of data preprocessing while maintaining the compact disc (CD) quality of audio [
22]. To ensure the consistency of data dimensions during training, we divided the original call clips of giant pandas into 2-second segments without overlap, and the segments whose duration were shorter than 2-second were expanded to 2-second via zero-padding on the log-mel spectrum. In audio signal processing, it is very common to pad the log-mel spectrum by filling zeros or copying a part of the log-mel spectrum to make its length consistent. However, compared with padding via copying, zero-padding can reduce errors in the log-mel spectrum [
23]. The reason for the division into 2-seconds is that in the collected data, the duration of different types of vocalizations of giant pandas are almost all 2 seconds. In addition, we had to make a very strong assumption that there is a vocalization of only one individual panda in each clip which can reduce the difficulty of the task. We manually excluded call clips that contain multiple pandas or are contaminated by other sounds (e.g., human voice or ambient noises). Because of the conditions where the recordings were made, the call recordings of cubs had relatively weak background noise, while those of others generally had stronger background noises (e.g., the sound of working air-conditioners). In order to avoid training the recognizer on these background features of the recordings, we collected examples of only background sound appearing in the data of adult pandas and then integrated them into the calls of the panda cubs by mixing the recordings together. After these processing steps, the total duration of the call data was 1298.02 s (
Figure 1).
2.1.3. MFCC
The main frequency component of speech is called formant, which carries the identification attributes of a sound, just like a personal ID card [
24]. In this paper, we employed the MFCC (Mel-frequency Cepstral Coefficients) feature as the input because it can contain formant information [
25]. Peter and Zhao respectively described the call structure and call frequency of adult pandas and baby pandas [
26,
27]. We found that MFCC is equally suitable for extracting the acoustic features of panda’s vocalizations. As an acoustic feature, MFCC has strong anti-noise properties and is widely used in the field of speech recognition. To derive MFCC from original audio, we took a number of steps. First, the original audio needed to be divided into frames. The frame collects N sampling points into one observation unit, typically with a duration of ~20–30 ms. In order to avoid excessive changes in two adjacent frames, there will be an overlapping area between two adjacent frames. After framing, we multiplied each frame by a Hamming window to increase the continuity between the left and right ends of the frame. Fast Fourier transformations (FFT) were used to convert the signal from the time domain to the frequency domain so that we could better observe the signal characteristics. Therefore, after multiplying by a Hamming window, each frame must undergo a FFT to obtain the energy distribution on the frequency spectrum. We then passed the frequency spectrum through mel filters, before calculating the logarithmic energy of the output of the filter. Mel filters were the most important operation in this process, which could eliminate the effect of harmonics and highlight the formants of the original call data. Finally, the MFCC was obtained by the discrete cosine transform function (DCT).
2.2. Methods
We divided the vocalization data by individual to ensure that the vocalizations in the training dataset and test dataset come from different individuals. This approach reflected real-world considerations; we usually only get the vocalization data of certain individuals for training, but we hoped that the trained model was also effective for other individuals. The size of the dataset in this paper is small and very imbalanced, due to many more vocalizations being available from certain individuals or age and sexes. We attempted to mitigate this issue through several steps. After considering several approaches for dealing with small and imbalanced datasets, we opted to use data augmentation and focal loss to attempt to improve our experimental results (
Figure 2). After we processed the vocalization data from the pandas, we began to train the neural net. In this process, we only utilized one neural network, named SENet [
28], to recognize different attributes of the pandas. In previous work, we used two different types of networks with one focused on local features and the other on contextual [
21]. We found that SENet, a network that pays more attention to local features, was more suitable for our task. We labeled training data with two tags, one for each attribute, yielding four groups (i.e., sub-adult female, sub-adult male, adult female, and adult male). After labeling the call data, we input the MFCC in SENet to get recognition results.
2.2.1. Model Architecture
SENet [
28] is a type of convolutional neural network (CNN) [
29] that pays attention to the relationship between channels. SENet utilizes a Squeeze-and-Excitation (SE) module, which attempts to let the model automatically learn the importance of different channel features. Squeeze operates on the feature map. This operation uses a global average pool on each channel of the feature map to get the scalar of each channel. Then excitation is performed on the results obtained by the squeeze, in order for the CNN to learn the relationship between each channel, and also infers weights of different channels. Finally, the process is completed by multiplying the result by the original feature map to get the final feature. The advantage of this model is that it can pay more attention to the channel features with the most information while suppressing those unimportant channel features.
2.2.2. Data Augmentation
Considering that our dataset is small and that there is an obvious imbalance between the volume of vocalization data of four groupings of pandas, we augmented the call data by adding Gaussian noise and applying SpecAugment [
30]. Gaussian noise refers to a type of noise whose probability density function obeys a Gaussian distribution. SpecAugment operates on the log-mel spectrum. Specifically, we applied frequency masking by adding a mask with a value of 0 to the frequency axis of the log-mel spectrum, and time masking by adding a mask with a value of 0 to the time axis (
Figure 3). As needed, we could set the number of masks, the width of masks, and identify which part of the log-mel spectrum needs to be masked. Data augmentation reduces the impact of data imbalance by increasing the amount of data.
2.2.3. Focal Loss
The dataset has large imbalances in both age and sex (
Figure 2). In addition to using data augmentation, we utilized focal loss [
31] to solve this problem. Focal loss improves on the cross entropy loss which uses an adjustment item in the cross entropy to focus the learning on hard examples. Focal loss in the form of multiple categories was performed following the procedures in [
1].
is predicted probability. If a sample predicts well, then the loss generated by this sample is close to 0. The role of
is to solve the imbalance of positive and negative samples and the γ is mainly used to solve the imbalance of difficult and easy samples.
2.3. Implementation Detail
The proposed method was implemented on Ubuntu with an NVIDIA GTX 1080. After processing the original vocalization data, the dimension of all data was (173, 64, 1). When we extracted MFCC, some parameter settings were important. They are ‘n_mels’, ‘n_fft’ and ‘hop_length’. The ‘n_mels’ means the number of mel filters, which was set to 64. ‘n_fft’ means the length of the FFT window which was 1024. ‘hop_length’ refers to the overlapping area which was 512. The network’s batch size was 32. The learning rate was and the epoch was 100.
As in [
21], we first carried out evaluation experiment without considering the mutual exclusion of individuals in training and testing sets. We refer to the experiment as Experiment 1. Ten-fold cross validation was conducted in the experiment.
To evaluate the effectiveness of the methods in more practical scenarios where individuals in testing are usually not seen in training, we conducted Experiment 2 in this paper by ensuring that individuals in training and testing sets are mutually exclusive. We completed three sub experiments with different setups for both sex recognition and age group recognition, including (A) no augmented training data nor use of cross entropy loss, (B) augmented training data and use of cross entropy loss, (C) augmented training data and use of focal loss. Sub-experiment 2A is used as a control with Experiment 1 to study the impact of controlling the data distribution. Sub-experiments 2B and 2C were conducted to evaluate techniques for solving the problems of small dataset size and data imbalance, to determine if they improved the recognizer.
The number of pandas and the amount of call data we could collect was very limited. We adopted the principle of “lower, not higher” for the training data. That means when training data and test data were from different individuals, the training data were minimal and the number of training data samples was fixed. To do this, we first chose four individuals randomly as test data. If the four individuals had less than 100 vocalization clips in total, we copied clips to increase the vocalization data to 100 samples. When the number of vocalization clips was more than 100, we randomly subsampled 100 of them. Among the remaining permutations of 24 individuals, 540 was the lowest number of vocalization clips, meaning that 540 vocalization clips were the input size for both sex and age group. In this way, we created ten training and test datasets for each sub-experiment.
The data augmentation in Experiments 2B and 2C were the same, including in the number methods of data augmentation. This allowed us to see the effectiveness of focal loss more intuitively. Before training, we augmented male data six times to meet the number of female data and augmented “juvenile” data twice, and “adults” once. This allowed to balance the data in the input network to make the data of each category more consistent.
In addition to the above experiments, we also evaluated the impact of the training data size on the recognition performance. We refer to this experiment as Experiment 3, in which we fixed the individuals in the test set and set the size of the test set to 100 as above. The size of the training set was increased from 540 by increments of 30 clips up to a maximum of 720. Experiment 3 utilized data augmentation and focal loss. Note that the individuals in training data and test data are mutually exclusive in Experiment 3.
Table 1 summarizes the settings of different experiments in this paper, and
Table 2 reports the training time required for each experiment.