Our key observation is that for hearable applications of deep learning-based target speech extraction [
21,
70,
72], it is often impractical to obtain a clean speech sample of the target speaker. In this work, we propose a
target speech hearing (TSH) system suitable for binaural hearables applications that provides an interface for noisy in-the-wild speech samples, which we refer to as
noisy enrollments. A noisy enrollment of a speaker of interest would contain two kinds of noise: uncorrelated background noise, and interfering speech. While the background noise can be suppressed with existing methods [
26,
50], it is challenging to disambiguate and suppress interfering speech without suppressing the target speech itself, especially when the number of speakers in the scene could be arbitrary. More fundamentally, in a mixture of multiple speakers, it is challenging to know which of them is the intended target speaker.
Our system achieves this disambiguation by leveraging the beamforming capability of binaural hearables. Assuming that the listener would be looking at the target speaker at least for a few seconds, we propose that the listener could use this phase to enroll the speaker they want to focus on by letting the hearable know through on-device haptic control or a button click on the phone application. During this phase, since the direct path of the target speaker is equidistant from both ears of the binaural hearable, the application could disambiguate between target and inferring speakers to obtain a representation of the target speaker.
Here,
s0 ∈
R2 corresponds to the target speaker,
se1,...,
sem ∈
R2 correspond to interfering speakers during the enrollment phase, and
s1,...,
sn correspond to interfering speakers during the TSH phase. Note that the interfering speakers can be the same or different during the two phases.
ve(
t′) and
v(
t) represent background noises in the respective phases. Additionally, let
θ0 represent the azimuthal angle of the target speaker, relative to the listener. During the enrollment phase, to achieve disambiguation of the target speaker in the noisy enrollment signal, since the user looks in the direction of the target speaker, we can assume that:
\(\theta _0(t^{\prime }) \sim \frac{\pi }{2}\), where the x-axis is assumed to pass from the listener’s left to right ear with the midpoint as the origin. We then formulate the TSH problem as a two-step process:
3.1 Enrollment interface network
The quality of the target speech extracted by the target speech hearing network,
\(\mathcal {T}\), has a critical dependence on the discriminative quality of the speaker representation, ϵ
0, provided to it. In order to robustly handle various speech characteristics, we leverage the speaker representations computed by large-scale pre-trained models such as [
34,
65]. In this work, we use the open-source implementation of [
65] in the Resemblyzer project [
49]. Given a clean speech utterance of a speaker
si(
t′), [
49] uses a long short-term memory (LSTM) network,
\(\mathcal {D}\), to map the utterance to a unit length 256-dimensional vector
\(\mathcal {D}(s_i(t^{\prime })) = \epsilon _i\), where ϵ
i ∈
R256 and ||ϵ
i||
2 = 1, referred to as a
d-vector embedding. During the training phase, the LSTM model computes d-vectors optimized such that embedding corresponding to an utterance of a speaker is closest to the centroid of embeddings of all other utterances of the same speaker. This is done while simultaneously maximizing the distance from the centroids of all other speakers in the large-scale speech database used as the training set. In this work, we use d-vector embeddings as reference speaker representations that the noisy enrollment network
\(\mathcal {N}\) should predict using two approaches.
Noisy enrollment with beamforming. Following the notation in §
3, we note that the d-vector embedding of the target speaker can be obtained with its clean speech example as
\(\epsilon _0 = \mathcal {D}(s_0(t^{\prime }))\). If we could estimate the clean speech of the target speaker, provided that the target speaker is present at the azimuthal angle
\(\theta _0 \sim \frac{\pi }{2}\), we could estimate the d-vector embedding corresponding to the target speaker. Essentially, this is equivalent to beamforming with direction input steered towards the azimuthal angle equal to
\(\frac{\pi }{2}\). In this work, we follow the
delay and process approach proposed in several beamforming works [
12,
30,
66], where given a target direction and a reference microphone, inputs from other microphones are delayed according to the time it takes for the direct path from the given direction to reach them relative to the reference microphone. In this case, since we assume the direct path is equidistant from both left and right microphones, processing the raw inputs is sufficient to obtain the target speaker. Assuming the beamforming network is represented as
\(\mathcal {B}\), the process of noisy enrollment with beamforming could be written as:
In this work we use the state-of-the-art speech separation architecture TFGridNet [
68] as our beamforming architecture
\(\mathcal {B}\). Since enrollment is a one-time operation that does not need to be performed on-device, we could use the original non-causal implementation of the TFGridNet [
68] available in the ESPNet[
38] framework. Following the notation in [
68], we used the configuration:
D = 64,
B = 3,
H = 64,
I = 4,
J = 1,
L = 4 and
E = 8 with short-time fourier transform (STFT) window size set to 128 and hop size set to 64.
Noisy enrollment with knowledge distillation. Conversely, we could consider this problem as the noisy enrollment network,
\(\mathcal {N}\), directly computing the estimated d-vector embedding of the target speaker,
\(\hat{\epsilon }_0\), given the noisy speech. This would however require us to use a resource-intensive training process like the one proposed in [
65]. To do this efficiently, we train the enrollment network
\(\mathcal {N}\), using knowledge distillation [
5,
25], where the original d-vector model,
\(\mathcal {D}\), provides d-vector embeddings computed on clean target speech as ground-truth references. We note that during the training phase, we have access to clean target enrollment speech
s0(
t′), but we do not assume this during inference. Here, we train the noisy enrollment network
\(\mathcal {N}\) to minimize the loss function
\(\mathcal {L}(\hat{\epsilon }_0,{\epsilon }_0)\):
To make both our noisy enrollment approaches comparable, we use TFGridNet [
68] with the same configuration as above, as the noisy enrollment network
\(\mathcal {N}\), in this approach as well. We modify the architecture to output 256-dimensional embedding instead of an audio waveform, as shown in Fig.
2b3.2 Real-time target speech hearing system
Now that we have the embeddings for the target speaker, which captures the desired speech traits, our goal is to design a network that can perform target speaker hearing in a way that is both real-time on an embedded CPU and that achieves an end-to-end latency of less than 20 ms. This end-to-end latency is measured as the amount of time for a single sample to pass from the microphone input buffer, through our target speech extraction framework, and then copied into the headphone speaker output buffer, as shown in Fig.
3(a). We process the input audio in chunks of 8 ms. Moreover, our system utilizes an additional 4 ms of future audio samples to predict the processed output for the current 8 ms chunk. In other words, we must wait for at least 12 ms before we can begin processing the first sample in a particular chunk. Of course, this also means that our algorithm must be designed in such a way that the first sample in the audio chunk does not take into account any information beyond 12 ms into the future, as this information will not be available in practice.
We design our target speaker hearing network by starting with a state-of-the art speech separation network, namely TFGridNet [
68]. However, as this network is non-causal, we adapt the implementation of TFGridNet into a causal version with an algorithmic latency of only 12 ms. To do this, we first remove the group normalization after the first 2D convolution. We then replace the bidirectional sub-band inter-frame LSTM block with a unidirectional LSTM, and fix the unfolding kernel and stride sizes, respectively the hyper-parameters
I and
J, in both recurrent modules to 1. Additionally, instead of computing causal attention using causal masks, inspired by prior work in [
63], we first unfold the key and value tensors into independent fixed-size chunks using a kernel size of 50 and a stride of 1. We then compute an attention matrix for every chunk between the key tensor and a single-frame query tensor corresponding to the last (rightmost) frame in the chunk, as illustrated in Fig.
3(b). This attention matrix contains the multiplicative weights of the corresponding frames in the value tensor used to obtain the final output. This ensures that when we predict the output for a single frame, we only attend to the 50 frames that arrive with or before it. Although this limits the duration of time the attention layer looks into the past, this on-device approach is necessary to allow the network to effectively process larger time sequences. We choose an STFT window length of 196 samples, or 12 ms at 16 kHz, and a hop length of 128 samples, or 8 ms at 16 kHz. When computing the ISTFT, we trim the last 4 ms of audio, as these samples will be affected by future chunks during the overlap-and-add operation of the ISTFT. Thus, our resulting output would be 4 ms shorter than the input, which means we obtain an 8 ms output from a 12 ms input. For the TSH model, we use the
asteroid [
47] implementation of STFT.
Once we copy a chunk from the audio buffer into memory, we can begin processing the audio. Since we process audio 8 ms at a time, we need to ensure that each chunk is processed in at most 8 ms on an embedded device, or else incoming chunks begin to queue up, causing the processed output audio to be increasingly delayed. This constraint thus requires us to employ a very stringent processing algorithm to ensure we can keep up with the incoming audio stream. As the original TFGridNet could not meet these runtime requirements on our embedded CPU, we therefore needed to optimize the above model in various ways, we describe below, to minimize the inference time.
Caching intermediate outputs. When processing a stream of continuous chunks, there are numerous values that can be reused to avoid recomputing them. We maintain these values as a list of model state buffers that we pass as an input to the model in addition to the input signal and the target embedding at every inference. For example, we can avoid recomputing the STFT frame of prior chunks by caching these values and re-using them when computing the output of the first 2D convolution layer, as shown in Fig.
3(c). Likewise, we can store the output of the sequence of GridNet blocks from previous chunks and use them to compute the 2D deconvolution quicker. Additionally, as computing the ISTFT for the current chunk also uses information from previous chunks, we also need to maintain a buffer for the intermediate outputs of this 2D deconvolution layer. Furthermore, we maintain the hidden and cell states of the temporal unidirectional LSTM for every GridNet block we have. This allows us to truly make use of the long-term receptive field of the recurrent network. Finally, for every GridNet block, we also maintain the state buffers for previous values of the key and value tensors and concatenate them before unfolding (Fig.
3(b)). We use these buffers to avoid recomputing the linear projections of previous frames.
ONNX-specific optimizations. Our goal is to deploy network using ONNX Runtime [
14]. To do this, we re-write certain parts of the network to be more suitable for this setup. First, since we set both
I and
J to 1, we effectively remove the need for the unfolding layers before the recurrent intra-frame and sub-band modules. Additionally, we can replace all convolution and deconvolution layers, which now have a fixed kernel size and stride of 1, with linear layers that are converted to simpler matrix multiplication kernels when converting to ONNX. We also modify the layer normalization modules to use the native PyTorch implementation, which newer ONNX converters can readily fuse into a single kernel, reducing the overhead of multiple kernel calls. Finally, we rewrite the multi-head attention layer, which was implemented as a for-loop to compute the key, query and value tensors for each head, as a single block for each of these tensors, and reshape the output appropriately. This reduces the overall number of nodes in the ONNX graph.
Reducing model size. Instead of using the hyper-parameters suggested in [
68], we choose a hyper-parameter setting that produces a smaller, faster model. In this work, we use
D = 64,
B = 3,
H = 64,
I = 1,
J = 1,
L = 4 and
E = 6. The resulting model has a total of 2.04 million parameters.
To condition the network with the speaker embedding obtained during enrollment, we use a simple linear layer followed by layer normalization to compute a common 64 × 97 conditioning vector for all time chunks, which we multiply with the latent audio representation between the first and second GridNet blocks.
3.3 Training for real-world generalization
We train our target speech hearing system in two steps. We first train the enrollment networks to estimate d-vector embeddings. We then separately train the target speech hearing model while conditioning it with reference d-vector vector embeddings. This approach allows us to use the same target speech hearing model with any enrollment model that can estimate d-vector embeddings. We train these models with a training dataset that considers an accurate representation of real-world use cases of a target speech hearing system. Specifically, we consider variations in speech characteristics, acoustic transformations caused by physical multipath environments, acoustic transformations caused by the human head related transfer function (HRTF) and diverse background noise. We also consider the effects caused by motion of the speaker and noise with respect to the listener as an additional finetuning step. Below we explain the dataset details followed by the training process of the enrollment and target speech hearing networks.
Synthetic dataset. Each training sample in our dataset corresponds to an acoustic scene comprised of 2-3 speech samples and background noise. To create an acoustic scene, we first sample a 5 second background noise sample and then overlay the target speech and inferring speech at random start positions. For obtaining target and interfering speech, we randomly select 2-3 speakers from the LibriSpeech dataset [
46], and select a speech sample of length 2-5s for each speaker. The enrollment signals are generated using the same approach as well. We used the
train-clean-360 component of LibriSpeech dataset that comprises 360 hours of clean speech with 439 and 482 different female and male speakers, respectively. We further select random noise samples from WHAMR! dataset [
69] comprising a database of audio samples of real-world noisy environments. These audio samples, however, do not contain the effects of real-world indoor environments and human heads, which we found is important for extracting natural-sounding audio.
Accounting for multipath and HRTF. To account for these effects, we convolve each of the speech samples and background noise with a binaural room-impulse-response (BRIR) that captures the acoustic transformations caused by a room as well as a user’s head and torso. Let
hr, θ, ϕ be a BRIR corresponding to the room and subject combination
r, at azimuthal angle
θ and polar angle
ϕ with respect to the subject’s head. Let
S0(
t) ∈
R and
S1(
t) ∈
R be two mono clean speech mixtures sampled from the LibriSpeech dataset, and
V(
t) be noise sampled from the WHAMR! dataset. Then the binaural acoustic scene
x(
t) ∈
R2 for this source mixture could be computed as:
It is to be noted that in each acoustic scene, room and subject configuration
r, remains the same for all sources, but the angles with respect to the listener are arbitrary. To improve robustness to variations in rooms and subject, we aggregate BRIRs from 4 different datasets: CIPIC [
3], RRBRIR [
28], ASH-Listening-Set [
58] and CATTRIR [
29]. Of these, CIPIC dataset only comprises of impulse responses measured in an anechoic chamber and as a result, is devoid of any room characteristics. Combined, these datasets provided us with a total of 77 different room and subject configurations.
Training. To train the enrollment networks, we first generate the component speech utterances, as described above, with the constraint that target speaker’s azimuthal angle
\(\theta _{0} \sim \frac{\pi }{2}\). We train the beamformer-based enrollment network to predict target speech (Fig.
2a) with SNR loss. We train knowledge-distillation-based enrollment network to predict the d-vector embedding of the target speech (Fig.
2b) with cosine-similarity loss.
To train the target speech hearing (TSH) network, we also sample a random speech corresponding to the target speaker and convolve it with a BRIR corresponding to the same room and subject configuration. We input the TSH model with the acoustic scene and the d-vector embedding computing on the sample speech. We then optimize the TSH network,
\(\mathcal {T}\), to minimize the signal-to-noise ratio [
54] (SNR) loss between the estimated target speech and the ground-truth: -SNR(
\(\hat{s}_0(t), s_0(t)\)).
Finetuning for motion, error in the enrollment angle and real-world noise characteristics. In the dataset setup described above, we assumed a constant azimuthal angle for each source as time progressed. This means that sources are stationary with respect to the listener’s orientation, and the enrollment angle is close to \(\frac{\pi }{2}\) and does not change with time. These assumptions, however, are not true in the real world as sources could be moving, or there might be a rotation in the listener’s head resulting in significant relative angular velocities.
We handle relative motion and time-varying error in the enrollment angle with an additional finetuning step. During finetuning, we make the azimuthal and polar angle time-varying. We simulate motion by generating an array of positions over time with a finite time step of 25 ms. For enrollment, we assume that at each time step, both the enrollment azimuth and enrollment polar angles are uniformly random and lie in the range
\([\frac{\pi }{2} - \frac{\pi }{10}, \frac{\pi }{2} + \frac{\pi }{10}]\), accounting for a maximum error of 18 degrees. For the rest of the sources – interfering sources in the enrollment acoustic scene, and interfering as well as target sources in the input to the TSH model – we generate a random speaker motion by randomly triggering speaker motion events at each time step with a probability of 0.025. When a speaker motion event is triggered, we sample a pair of angular velocities along the polar and azimuthal directions with magnitudes uniformly distributed in the range
\([\frac{\pi }{6}, \frac{\pi }{2}]\) rad/s. The speaker moves along this direction for a random duration uniformly sampled from [0.1, 1] s. During this time, we do not consider any other motion events. This creates trajectories where the speaker may be stationary for some time intervals and sporadically move with different velocities within the same audio clip. Assuming such time-varying trajectories, the computation of an enrollment scene could be written as:
\(e(t^{\prime }) = S_{0}(t^{\prime }) * h_{r, \theta _{0}(t^{\prime }),\phi _{0}(t^{\prime })} + S_{e1}(t^{\prime }) * h_{r, \theta _{e1}(t^{\prime }),\phi _{e1}(t^{\prime })} + V_e(t^{\prime }) * h_{r, \theta _{ve}(t^{\prime }),\phi _{ve}(t^{\prime })}\), where,
\(\theta _{0}(t^{\prime }), \phi _{0}(t^{\prime }) \in [\frac{\pi }{2} - \frac{\pi }{10}, \frac{\pi }{2} + \frac{\pi }{10}]\). And the input scene can be computed as:
Since the BRIR datasets typically provide an impulse response at discrete points in space, it is not possible to directly perform the computation described in the expressions above. To achieve such a trajectory simulation with available BRIR datasets, we employ a nearest neighbor approximation of BRIRs – we select BRIR in the dataset that is closest to the desired azimuth and polar angle in time. We use Steam Audio SDK [
57] to perform this motion trajectory simulation. During the finetuning step, we use all four BRIR datasets we described above, but perform motion simulation only with CIPIC database [
3]. We do this because CIPIC provides BRIRs that are reasonably dense across all azimuth and polar angles, while the rest of BRIR datasets sparsely vary azimuthal angle only with fixed elevation.
Finally, to allow the model to learn common noise characteristics found in the real world such as microphone thermal noise and constant noise from heating, ventilation and air conditioning systems, we also train the model with randomly scaled white, pink and brown noise components. Specifically, during training, we augment the mixture signal with a white noise signal having a standard deviation uniformly chosen from the range [0, 0.002). As for the pink and brown noise, we use the powerlaw_psd_gaussian function from the Python colorednoise library to generate these noise signals. We scale the pink and brown noise signals with different scale factors each sampled uniformly from [0, 0.05).