research-article

Open access

Look Once to Hear: Target Speech Hearing with Noisy Examples

Authors:

Shyamnath GollakotaAuthors Info & Claims

CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems

Article No.: 37, Pages 1 - 16

https://doi.org/10.1145/3613904.3642057

Published: 11 May 2024 Publication History

All formats PDF

Abstract

In crowded settings, the human brain can focus on speech from a target speaker, given prior knowledge of how they sound. We introduce a novel intelligent hearable system that achieves this capability, enabling target speech hearing to ignore all interfering speech and noise, but the target speaker. A naïve approach is to require a clean speech example to enroll the target speaker. This is however not well aligned with the hearable application domain since obtaining a clean example is challenging in real world scenarios, creating a unique user interface problem. We present the first enrollment interface where the wearer looks at the target speaker for a few seconds to capture a single, short, highly noisy, binaural example of the target speaker. This noisy example is used for enrollment and subsequent speech extraction in the presence of interfering speakers and noise. Our system achieves a signal quality improvement of 7.01 dB using less than 5 seconds of noisy enrollment audio and can process 8 ms of audio chunks in 6.24 ms on an embedded CPU. Our user studies demonstrate generalization to real-world static and mobile speakers in previously unseen indoor and outdoor multipath environments. Finally, our enrollment interface for noisy examples does not cause performance degradation compared to clean examples, while being convenient and user-friendly. Taking a step back, this paper takes an important step towards enhancing the human auditory perception with artificial intelligence.

Figure 1:

1 Introduction

The past decade has witnessed two key technological trends. First, there have been significant advances in noise-canceling headsets and earbuds capable of better suppressing all environmental sounds [24, 27, 51]. Second, deep learning is enabling promising human-like intelligence across various domains [11, 31]. These two trends present opportunities for creating the future of intelligent hearables, with real-world capabilities that so far have been in the realm of science fiction. In this paper, we explore a novel capability for hearables — target speech hearing — that allows users to choose to hear target speakers based on user-selected target speaker characteristics, such as speech traits.

Specifically, we explore the following question: can we look at a target speaker within a crowd just once, extract their unique speech traits, and subsequently employ these traits to exclusively listen to that speaker, while filtering out other voices and background noise? A positive answer could enable novel hearable applications that are currently not possible. For example, imagine a scenario in which a user seeks to hear only the tour guide’s narration during a guided tour amidst the surrounding chatter and ambient noise while enjoying the tour sights. Alternatively, picture a leisurely stroll with a colleague along a cacophonous street, wanting to hear only their conversation and block out other sounds. Or think about being on a crowded bus, desiring to hear your friend talk while simultaneously gazing out of the window. While today’s noise-canceling headphones have seen significant improvements in canceling out all sounds, they cannot selectively pick speakers based on their speech traits. These use cases, however, require not only using noise-canceling headsets to remove all sounds but also playing only the target speech back into the hearables.

The latter, which we call target speech hearing, is a new capability for general-purpose hearable devices. Existing deep learning approaches for the problem of target speech extraction require prior clean audio examples of the target speaker [70]. These clean examples are utilized by a neural network to learn the characteristics of the target speaker, which are subsequently employed to separate their speech from that of other concurrent speakers. The challenge lies in the fact that this problem formulation does not align well with our target hearable application domain. Specifically, in all the previously described use cases, obtaining a clean example signal of the target speaker (e.g., tour guide) is difficult since the target speaker may always be in a noisy environment, with interference from other speakers.

Providing clean target speaker examples for enrollment is essentially a user interface problem, and hence requires the design of an intelligent hearable system that takes into account the constraints of a user-friendly interface. In this paper, we introduce the concept of target speech hearing on hearable devices with noisy examples. To achieve this, rather than expecting users to collect input examples of the target speaker in a noise-free environment in the absence of any other speakers, we show for the first time how one can enable target speech extraction using noisy binaural enrollments in the presence of other concurrent interfering speakers.

As shown in Fig. 1, the wearer looks at the target speaker for a few seconds and captures binaural audio, using two microphones, one at each ear. Since during this short enrollment phase, the wearer is looking in the direction of the target, the signal corresponding to the target speaker is aligned across the two binaural microphones, while the other interfering speakers are likely to be in a different direction and are therefore not aligned. We employ a neural network to learn the characteristics of the target speaker using this sample-aligned binaural signal and separate it from the interfering speaker using direction information. Once we have learnt the characteristics of the target speaker (i.e., target speaker embedding vector) using these noisy binaural enrollments, we subsequently input the embedding vector into a different neural network to extract the target speech from a cacophony of speakers. The advantage of our approach is that the wearer only needs to look at the target speaker for a few seconds during which we enroll the target speaker. Subsequently, the wearer can look in any direction, move their head, or walk around while still hearing the target speaker.¹

To make this idea practical, we make multiple contributions:

•

Enrollment networks with noisy examples. We design and compare two different enrollment networks — a beamformer network and a knowledge distillation network (see §3.1) — to effectively generate a speaker embedding vector that captures the traits of the target speaker using the short binaural noisy example.

•

Real-time embedded target speech hearing network. We use the generated embedding to subsequently extract the target’s speech using an optimized network that runs in real-time on an embedded IoT CPU. To do this, we start with the state-of-the-art speech separation network, TFGridNet [68], which cannot run in real-time on our embedded device. We introduce various model and system-level optimizations in §3.2 to achieve a light-weight target speech hearing network that runs in real-time on embedded CPUs.

•

Generalization to real-world multipath, HRTF and mobility. We present a training methodology that uses only synthetic data and yet allows our system to generalize to real-world unseen target and interfering speakers and their head-related transfer functions (HRTFs). Further, we explicitly train with multipath to generalize to both indoor and outdoor environments. We also introduce a fine-tuning mechanism that addresses moving sources and sudden changes in the listener’s head orientation (upto 90°/s angular velocity). This also allows the system to handle up to 18° error in the listener’s head orientation during enrollment (see §3.3).

We build an end-to-end hardware system that integrates a noise-canceling headset (Sony WH-1000XM4), a pair of binaural microphones (Sonic Presence SP15C) with our real-time target speech hearing network running on an embedded IoT CPU (Orange Pi 5B). The embedded device reads audio chunks from the microphones, which we process on-device and play back to the headset. Our average model inference time was 6.2 ms to process 8 ms audio chunks, making it a real-time system with a total end-to-end latency of 18.24 ms. Our results are as follows.

•

Compared to clean example enrollments, the beamformer network for noisy example enrollments resulted in 2.9 dB performance drop. In contrast, the knowledge distillation network resulted in only a 0.4 dB drop in performance compared to clean examples (see §4.4), while using only 1-4 second noisy enrollments.

•

Our system generalized to 9 real-world settings that span different motion scenarios, indoor and outdoor environments as well as different wearer postures with 8 participants using our hardware. Our design does not require any training data collection with our hearable hardware.

•

In a user study with 21 participants who spent over 420 minutes rating the target-speaker output by our hardware system from real-world indoor and outdoor environments, our system achieved a higher mean opinion score and interference removal for the target speaker than the raw unprocessed input.

•

Across nine participants who compared three interfaces for noisy enrollments — push button on headphone, touchpad on headphone, and virtual button on a smartphone — participants expressed preference for the push button because of its good haptic feedback.

Imbued with embedded intelligence, our work envisions hearables that allow wearers to manipulate their acoustic surroundings in real-time to customize their auditory experience based on user-defined characteristics like speech traits. By open sourcing the code and datasets, our work may help further future research among HCI and machine learning researchers on designing algorithms and systems around target speech hearing.

2 BACKGROUND and Related Work

To the best of our knowledge, prior work has not explored the concept of target speech hearing with noisy enrollments. Below we describe related work in acoustic machine learning, systems and interaction mechanisms for hearable devices.

AI-based hearable systems. Recent hearable systems such as Clearbuds [12] can enhance the speech of the user wearing the earbuds for telephony applications, but cannot pick and choose any other target speaker. Further, telephony applications have a delay constraint of 100-200 ms, which is an order of magnitude less constrained than our system. Our work is also related to semantic hearing [64]. The goal of this work is to pick and choose which classes of sounds the user wants to hear (e.g., car honks, nature sounds). Our work differs from this work in two key aspects. First, all speech is just one sound class in the semantic hearing system. It is not designed to separate different speakers given their speech characteristics. Second, all the sound classes are pre-defined for the semantic hearing system. In contrast, our proposed target speech hearing system is designed to focus on any target speaker who is not in the training data. To achieve this, we need an enrollment interface to capture a recording of the target speaker so the system during inference knows their speech characteristics. Finally, there has been recent interest in using EEG signals [4], which potentially could be used to identify the target speaker. While prior work has shown some promise using a head EEG scalp with a large number of electrodes, practical designs need custom in-ear EEG hardware [33] with electrodes placed only inside the ear or unobtrusive multi-channel EEG acquisition from around the ear (cEEGrids [8]). The effectiveness of identifying the target speaker using in-air or around-ear dry electrodes [42] in noisy uncontrollable environments requires further investigation and improvements. In contrast, our design works with existing binaural hearable hardware architectures, and requires only two microphones typical to today’s hearables.

Noise-canceling hearables. Active noise-canceling headsets and earbuds, including the lightweight in-ear systems like AirPods Pro, can now cancel upto 20-30 dB signal across the audible frequency range [51, 59]. These earable systems cancel all signals by transmitting an anti-noise signal and have much stronger delay requirements than our target speech hearing system. Further, they cannot pick and choose specific speakers based on their speech characteristics. Our design leverages these advances in noise-canceling hearables to reduce the amplitude of all external sounds and noise and then introduce back the target speaker through the headset with end-to-end delays of less than 20 ms.

Neural networks for target speech extraction. The goal here is to extract the speech signal of a target speaker, from a mixture of several speakers, given additional clues to identify the target speaker [70]. Prior work has explored three kinds of clues: audio clues from pre-recordings of the target speaker [7, 18, 23, 36, 67, 71, 72], visual clues using a video recording [52] and spatial clues by providing the direction and/or location of the target speaker. Deep learning has been used for target speech extraction using only a few seconds of pre-recorded audio of the target speaker [17, 19]. However, all existing target speech extraction approaches, including those that use multiple microphones [13, 37, 71, 72], require a clean audio recording of the target speaker without any interference from other speakers. In contrast, we introduce the first target speech extraction system that uses noisy enrollments from binaural hearables and addresses this fundamental interface problem of inputting a clean target speaker recording.

Prior work also proposes visual clues for this task [1, 2, 16, 21, 43, 44, 45, 52, 55]. However almost none of the existing commercial hearable systems like headsets and earbuds have cameras. Further, the lack of adoption of head-worn camera systems like Google glasses, might point to a cultural hesitance to such systems [20]. Target speech extraction is also related to the more general blind source separation problem [68] where the task is to separate all speakers in a mixture. This is challenging with an unknown number of speakers and with permutations between mapping the model output to the corresponding speakers [70].

Figure 2:

Beamforming and directional hearing. Providing the direction of the target speaker as input is the task of directional hearing or beamforming [53, 66]. This is performed by jointly processing audio signals captured from multiple microphones to amplify speech from a specific direction. The traditional approach has been to use statistical signal processing methods [9, 35], which are computationally light-weight. Recent work has shown that neural networks achieve improved performance over signal processing [30, 60] and can run on-device [66]. While we use neural beamformers for the enrollment phase, as described earlier, directional hearing is not well-suited for our target applications, as users (e.g., in the touring scenario) do not continuously look at the target speaker. Further since speech can have long pauses and is interleaved with other speakers, it is challenging to track the direction of a mobile target speaker if the user is not continuously looking at them.

Causal, non-causal and real-time models. Speech to speech neural networks have been proposed in the context of speech enhancement [15, 48, 56] and speech separation [39, 40, 41]. Most of these models [26, 61, 68] support offline processing of full-length utterances, referred to as non-causal models. There has been recent research on online processing where the model only has access to past information, referred to as causal models. Real-time implementations of such models [19, 22, 62, 63, 66] are able to process a second of speech within a second. We compare our design with causal and real-time implementations of DCCRN [26] and Waveformer [63], which have been used for target sound/speech extraction tasks.

3 Target Speech Hearing with Noisy Examples

Our key observation is that for hearable applications of deep learning-based target speech extraction [21, 70, 72], it is often impractical to obtain a clean speech sample of the target speaker. In this work, we propose a target speech hearing (TSH) system suitable for binaural hearables applications that provides an interface for noisy in-the-wild speech samples, which we refer to as noisy enrollments. A noisy enrollment of a speaker of interest would contain two kinds of noise: uncorrelated background noise, and interfering speech. While the background noise can be suppressed with existing methods [26, 50], it is challenging to disambiguate and suppress interfering speech without suppressing the target speech itself, especially when the number of speakers in the scene could be arbitrary. More fundamentally, in a mixture of multiple speakers, it is challenging to know which of them is the intended target speaker.

Our system achieves this disambiguation by leveraging the beamforming capability of binaural hearables. Assuming that the listener would be looking at the target speaker at least for a few seconds, we propose that the listener could use this phase to enroll the speaker they want to focus on by letting the hearable know through on-device haptic control or a button click on the phone application. During this phase, since the direct path of the target speaker is equidistant from both ears of the binaural hearable, the application could disambiguate between target and inferring speakers to obtain a representation of the target speaker.

Let e(t′) ∈ R² be the input binaural signal received by the binaural hearable during the enrollment phase, and x(t) ∈ R² be the input binaural signal received during TSH phase, where t′ and t corresponds to the time during the enrollment phase and TSH phase, respectively. Then these signals could be decomposed into their component signals:

\begin{equation*} e(t^{\prime }) = s_0(t^{\prime }) + \Sigma _{k=1}^m s_{ek}(t^{\prime }) + v_e(t^{\prime }) \end{equation*}

\begin{equation*} x(t) = s_0(t) + \Sigma _{k=1}^n s_{k}(t) + v(t) \end{equation*}

Here, s₀ ∈ R² corresponds to the target speaker, s_e1,..., s_em ∈ R² correspond to interfering speakers during the enrollment phase, and s₁,..., s_n correspond to interfering speakers during the TSH phase. Note that the interfering speakers can be the same or different during the two phases. v_e(t′) and v(t) represent background noises in the respective phases. Additionally, let θ₀ represent the azimuthal angle of the target speaker, relative to the listener. During the enrollment phase, to achieve disambiguation of the target speaker in the noisy enrollment signal, since the user looks in the direction of the target speaker, we can assume that: \(\theta _0(t^{\prime }) \sim \frac{\pi }{2}\), where the x-axis is assumed to pass from the listener’s left to right ear with the midpoint as the origin. We then formulate the TSH problem as a two-step process:

\begin{equation*} \hat{\epsilon }_0 = \mathcal {N}(e(t^{\prime }) | \theta _0(t^{\prime }) \sim \frac{\pi }{2}) \end{equation*}

\begin{equation*} \hat{s}_0(t) = \mathcal {T}(x(t), \hat{\epsilon }_0) \end{equation*}

Here, \(\hat{\epsilon }_0\) corresponds to the target speaker representation computed from the noisy enrollment signal e(t′), \(\mathcal {N}\) is the neural network estimating the target speaker’s representation and \(\mathcal {T}\) is the real-time causal target speech hearing network that can run on an embedded device. In the following subsections, we explain in detail, different architectures we explored for both the enrollment phase and TSH phase.

3.1 Enrollment interface network

The quality of the target speech extracted by the target speech hearing network, \(\mathcal {T}\), has a critical dependence on the discriminative quality of the speaker representation, ϵ₀, provided to it. In order to robustly handle various speech characteristics, we leverage the speaker representations computed by large-scale pre-trained models such as [34, 65]. In this work, we use the open-source implementation of [65] in the Resemblyzer project [49]. Given a clean speech utterance of a speaker s_i(t′), [49] uses a long short-term memory (LSTM) network, \(\mathcal {D}\), to map the utterance to a unit length 256-dimensional vector \(\mathcal {D}(s_i(t^{\prime })) = \epsilon _i\), where ϵ_i ∈ R²⁵⁶ and ||ϵ_i||₂ = 1, referred to as a d-vector embedding. During the training phase, the LSTM model computes d-vectors optimized such that embedding corresponding to an utterance of a speaker is closest to the centroid of embeddings of all other utterances of the same speaker. This is done while simultaneously maximizing the distance from the centroids of all other speakers in the large-scale speech database used as the training set. In this work, we use d-vector embeddings as reference speaker representations that the noisy enrollment network \(\mathcal {N}\) should predict using two approaches.

Noisy enrollment with beamforming. Following the notation in §3, we note that the d-vector embedding of the target speaker can be obtained with its clean speech example as \(\epsilon _0 = \mathcal {D}(s_0(t^{\prime }))\). If we could estimate the clean speech of the target speaker, provided that the target speaker is present at the azimuthal angle \(\theta _0 \sim \frac{\pi }{2}\), we could estimate the d-vector embedding corresponding to the target speaker. Essentially, this is equivalent to beamforming with direction input steered towards the azimuthal angle equal to \(\frac{\pi }{2}\). In this work, we follow the delay and process approach proposed in several beamforming works [12, 30, 66], where given a target direction and a reference microphone, inputs from other microphones are delayed according to the time it takes for the direct path from the given direction to reach them relative to the reference microphone. In this case, since we assume the direct path is equidistant from both left and right microphones, processing the raw inputs is sufficient to obtain the target speaker. Assuming the beamforming network is represented as \(\mathcal {B}\), the process of noisy enrollment with beamforming could be written as:

\begin{equation*} \hat{s}_0(t^{\prime }) = \mathcal {B}(e(t^{\prime }) | \theta _0 \sim \frac{\pi }{2}) \end{equation*}

\begin{equation*} \hat{\epsilon }_0 = \mathcal {D}(\hat{s}_0(t^{\prime })) \end{equation*}

In this work we use the state-of-the-art speech separation architecture TFGridNet [68] as our beamforming architecture \(\mathcal {B}\). Since enrollment is a one-time operation that does not need to be performed on-device, we could use the original non-causal implementation of the TFGridNet [68] available in the ESPNet[38] framework. Following the notation in [68], we used the configuration: D = 64, B = 3, H = 64, I = 4, J = 1, L = 4 and E = 8 with short-time fourier transform (STFT) window size set to 128 and hop size set to 64.

Figure 3:

Noisy enrollment with knowledge distillation. Conversely, we could consider this problem as the noisy enrollment network, \(\mathcal {N}\), directly computing the estimated d-vector embedding of the target speaker, \(\hat{\epsilon }_0\), given the noisy speech. This would however require us to use a resource-intensive training process like the one proposed in [65]. To do this efficiently, we train the enrollment network \(\mathcal {N}\), using knowledge distillation [5, 25], where the original d-vector model, \(\mathcal {D}\), provides d-vector embeddings computed on clean target speech as ground-truth references. We note that during the training phase, we have access to clean target enrollment speech s₀(t′), but we do not assume this during inference. Here, we train the noisy enrollment network \(\mathcal {N}\) to minimize the loss function \(\mathcal {L}(\hat{\epsilon }_0,{\epsilon }_0)\):

\begin{equation*} \hat{\epsilon }_0 = \mathcal {N}(e(t^{\prime }) | \theta _0 \sim \frac{\pi }{2}) \end{equation*}

\begin{equation*} \epsilon _0 = \mathcal {D}(s_0(t^{\prime })) \end{equation*}

\begin{equation*} \mathcal {L}(\hat{\epsilon }_0,{\epsilon }_0) = 1 - cos(\angle (\hat{\epsilon }_0, \epsilon _0)) = 1 - \frac{\hat{\epsilon }_0 \cdot \epsilon _0}{||\hat{\epsilon }_0||_2 ||\epsilon _0||_2} \end{equation*}

To make both our noisy enrollment approaches comparable, we use TFGridNet [68] with the same configuration as above, as the noisy enrollment network \(\mathcal {N}\), in this approach as well. We modify the architecture to output 256-dimensional embedding instead of an audio waveform, as shown in Fig. 2b

3.2 Real-time target speech hearing system

Now that we have the embeddings for the target speaker, which captures the desired speech traits, our goal is to design a network that can perform target speaker hearing in a way that is both real-time on an embedded CPU and that achieves an end-to-end latency of less than 20 ms. This end-to-end latency is measured as the amount of time for a single sample to pass from the microphone input buffer, through our target speech extraction framework, and then copied into the headphone speaker output buffer, as shown in Fig. 3(a). We process the input audio in chunks of 8 ms. Moreover, our system utilizes an additional 4 ms of future audio samples to predict the processed output for the current 8 ms chunk. In other words, we must wait for at least 12 ms before we can begin processing the first sample in a particular chunk. Of course, this also means that our algorithm must be designed in such a way that the first sample in the audio chunk does not take into account any information beyond 12 ms into the future, as this information will not be available in practice.

We design our target speaker hearing network by starting with a state-of-the art speech separation network, namely TFGridNet [68]. However, as this network is non-causal, we adapt the implementation of TFGridNet into a causal version with an algorithmic latency of only 12 ms. To do this, we first remove the group normalization after the first 2D convolution. We then replace the bidirectional sub-band inter-frame LSTM block with a unidirectional LSTM, and fix the unfolding kernel and stride sizes, respectively the hyper-parameters I and J, in both recurrent modules to 1. Additionally, instead of computing causal attention using causal masks, inspired by prior work in [63], we first unfold the key and value tensors into independent fixed-size chunks using a kernel size of 50 and a stride of 1. We then compute an attention matrix for every chunk between the key tensor and a single-frame query tensor corresponding to the last (rightmost) frame in the chunk, as illustrated in Fig. 3(b). This attention matrix contains the multiplicative weights of the corresponding frames in the value tensor used to obtain the final output. This ensures that when we predict the output for a single frame, we only attend to the 50 frames that arrive with or before it. Although this limits the duration of time the attention layer looks into the past, this on-device approach is necessary to allow the network to effectively process larger time sequences. We choose an STFT window length of 196 samples, or 12 ms at 16 kHz, and a hop length of 128 samples, or 8 ms at 16 kHz. When computing the ISTFT, we trim the last 4 ms of audio, as these samples will be affected by future chunks during the overlap-and-add operation of the ISTFT. Thus, our resulting output would be 4 ms shorter than the input, which means we obtain an 8 ms output from a 12 ms input. For the TSH model, we use the asteroid [47] implementation of STFT.

Once we copy a chunk from the audio buffer into memory, we can begin processing the audio. Since we process audio 8 ms at a time, we need to ensure that each chunk is processed in at most 8 ms on an embedded device, or else incoming chunks begin to queue up, causing the processed output audio to be increasingly delayed. This constraint thus requires us to employ a very stringent processing algorithm to ensure we can keep up with the incoming audio stream. As the original TFGridNet could not meet these runtime requirements on our embedded CPU, we therefore needed to optimize the above model in various ways, we describe below, to minimize the inference time.

Caching intermediate outputs. When processing a stream of continuous chunks, there are numerous values that can be reused to avoid recomputing them. We maintain these values as a list of model state buffers that we pass as an input to the model in addition to the input signal and the target embedding at every inference. For example, we can avoid recomputing the STFT frame of prior chunks by caching these values and re-using them when computing the output of the first 2D convolution layer, as shown in Fig. 3(c). Likewise, we can store the output of the sequence of GridNet blocks from previous chunks and use them to compute the 2D deconvolution quicker. Additionally, as computing the ISTFT for the current chunk also uses information from previous chunks, we also need to maintain a buffer for the intermediate outputs of this 2D deconvolution layer. Furthermore, we maintain the hidden and cell states of the temporal unidirectional LSTM for every GridNet block we have. This allows us to truly make use of the long-term receptive field of the recurrent network. Finally, for every GridNet block, we also maintain the state buffers for previous values of the key and value tensors and concatenate them before unfolding (Fig. 3(b)). We use these buffers to avoid recomputing the linear projections of previous frames.

ONNX-specific optimizations. Our goal is to deploy network using ONNX Runtime [14]. To do this, we re-write certain parts of the network to be more suitable for this setup. First, since we set both I and J to 1, we effectively remove the need for the unfolding layers before the recurrent intra-frame and sub-band modules. Additionally, we can replace all convolution and deconvolution layers, which now have a fixed kernel size and stride of 1, with linear layers that are converted to simpler matrix multiplication kernels when converting to ONNX. We also modify the layer normalization modules to use the native PyTorch implementation, which newer ONNX converters can readily fuse into a single kernel, reducing the overhead of multiple kernel calls. Finally, we rewrite the multi-head attention layer, which was implemented as a for-loop to compute the key, query and value tensors for each head, as a single block for each of these tensors, and reshape the output appropriately. This reduces the overall number of nodes in the ONNX graph.

Reducing model size. Instead of using the hyper-parameters suggested in [68], we choose a hyper-parameter setting that produces a smaller, faster model. In this work, we use D = 64, B = 3, H = 64, I = 1, J = 1, L = 4 and E = 6. The resulting model has a total of 2.04 million parameters.

To condition the network with the speaker embedding obtained during enrollment, we use a simple linear layer followed by layer normalization to compute a common 64 × 97 conditioning vector for all time chunks, which we multiply with the latent audio representation between the first and second GridNet blocks.

3.3 Training for real-world generalization

We train our target speech hearing system in two steps. We first train the enrollment networks to estimate d-vector embeddings. We then separately train the target speech hearing model while conditioning it with reference d-vector vector embeddings. This approach allows us to use the same target speech hearing model with any enrollment model that can estimate d-vector embeddings. We train these models with a training dataset that considers an accurate representation of real-world use cases of a target speech hearing system. Specifically, we consider variations in speech characteristics, acoustic transformations caused by physical multipath environments, acoustic transformations caused by the human head related transfer function (HRTF) and diverse background noise. We also consider the effects caused by motion of the speaker and noise with respect to the listener as an additional finetuning step. Below we explain the dataset details followed by the training process of the enrollment and target speech hearing networks.

Synthetic dataset. Each training sample in our dataset corresponds to an acoustic scene comprised of 2-3 speech samples and background noise. To create an acoustic scene, we first sample a 5 second background noise sample and then overlay the target speech and inferring speech at random start positions. For obtaining target and interfering speech, we randomly select 2-3 speakers from the LibriSpeech dataset [46], and select a speech sample of length 2-5s for each speaker. The enrollment signals are generated using the same approach as well. We used the train-clean-360 component of LibriSpeech dataset that comprises 360 hours of clean speech with 439 and 482 different female and male speakers, respectively. We further select random noise samples from WHAMR! dataset [69] comprising a database of audio samples of real-world noisy environments. These audio samples, however, do not contain the effects of real-world indoor environments and human heads, which we found is important for extracting natural-sounding audio.

Accounting for multipath and HRTF. To account for these effects, we convolve each of the speech samples and background noise with a binaural room-impulse-response (BRIR) that captures the acoustic transformations caused by a room as well as a user’s head and torso. Let h_{r, θ, ϕ} be a BRIR corresponding to the room and subject combination r, at azimuthal angle θ and polar angle ϕ with respect to the subject’s head. Let S₀(t) ∈ R and S₁(t) ∈ R be two mono clean speech mixtures sampled from the LibriSpeech dataset, and V(t) be noise sampled from the WHAMR! dataset. Then the binaural acoustic scene x(t) ∈ R² for this source mixture could be computed as:

\begin{equation*} x(t) = S_0(t) * h_{r, \theta _0,\phi _0} + S_1(t) * h_{r, \theta _1,\phi _1} + V(t) * h_{r, \theta _v,\phi _v} \end{equation*}

It is to be noted that in each acoustic scene, room and subject configuration r, remains the same for all sources, but the angles with respect to the listener are arbitrary. To improve robustness to variations in rooms and subject, we aggregate BRIRs from 4 different datasets: CIPIC [3], RRBRIR [28], ASH-Listening-Set [58] and CATTRIR [29]. Of these, CIPIC dataset only comprises of impulse responses measured in an anechoic chamber and as a result, is devoid of any room characteristics. Combined, these datasets provided us with a total of 77 different room and subject configurations.

Training. To train the enrollment networks, we first generate the component speech utterances, as described above, with the constraint that target speaker’s azimuthal angle \(\theta _{0} \sim \frac{\pi }{2}\). We train the beamformer-based enrollment network to predict target speech (Fig. 2a) with SNR loss. We train knowledge-distillation-based enrollment network to predict the d-vector embedding of the target speech (Fig. 2b) with cosine-similarity loss.

Figure 4:

To train the target speech hearing (TSH) network, we also sample a random speech corresponding to the target speaker and convolve it with a BRIR corresponding to the same room and subject configuration. We input the TSH model with the acoustic scene and the d-vector embedding computing on the sample speech. We then optimize the TSH network, \(\mathcal {T}\), to minimize the signal-to-noise ratio [54] (SNR) loss between the estimated target speech and the ground-truth: -SNR(\(\hat{s}_0(t), s_0(t)\)).

Finetuning for motion, error in the enrollment angle and real-world noise characteristics. In the dataset setup described above, we assumed a constant azimuthal angle for each source as time progressed. This means that sources are stationary with respect to the listener’s orientation, and the enrollment angle is close to \(\frac{\pi }{2}\) and does not change with time. These assumptions, however, are not true in the real world as sources could be moving, or there might be a rotation in the listener’s head resulting in significant relative angular velocities.

We handle relative motion and time-varying error in the enrollment angle with an additional finetuning step. During finetuning, we make the azimuthal and polar angle time-varying. We simulate motion by generating an array of positions over time with a finite time step of 25 ms. For enrollment, we assume that at each time step, both the enrollment azimuth and enrollment polar angles are uniformly random and lie in the range \([\frac{\pi }{2} - \frac{\pi }{10}, \frac{\pi }{2} + \frac{\pi }{10}]\), accounting for a maximum error of 18 degrees. For the rest of the sources – interfering sources in the enrollment acoustic scene, and interfering as well as target sources in the input to the TSH model – we generate a random speaker motion by randomly triggering speaker motion events at each time step with a probability of 0.025. When a speaker motion event is triggered, we sample a pair of angular velocities along the polar and azimuthal directions with magnitudes uniformly distributed in the range \([\frac{\pi }{6}, \frac{\pi }{2}]\) rad/s. The speaker moves along this direction for a random duration uniformly sampled from [0.1, 1] s. During this time, we do not consider any other motion events. This creates trajectories where the speaker may be stationary for some time intervals and sporadically move with different velocities within the same audio clip. Assuming such time-varying trajectories, the computation of an enrollment scene could be written as: \(e(t^{\prime }) = S_{0}(t^{\prime }) * h_{r, \theta _{0}(t^{\prime }),\phi _{0}(t^{\prime })} + S_{e1}(t^{\prime }) * h_{r, \theta _{e1}(t^{\prime }),\phi _{e1}(t^{\prime })} + V_e(t^{\prime }) * h_{r, \theta _{ve}(t^{\prime }),\phi _{ve}(t^{\prime })}\), where, \(\theta _{0}(t^{\prime }), \phi _{0}(t^{\prime }) \in [\frac{\pi }{2} - \frac{\pi }{10}, \frac{\pi }{2} + \frac{\pi }{10}]\). And the input scene can be computed as:

\begin{equation*} x(t) = S_0(t) * h_{r, \theta _0(t),\phi _0(t)} + S_1(t) * h_{r, \theta _1(t),\phi _1(t)} + V(t) * h_{r, \theta _v(t),\phi _v(t)} \end{equation*}

Since the BRIR datasets typically provide an impulse response at discrete points in space, it is not possible to directly perform the computation described in the expressions above. To achieve such a trajectory simulation with available BRIR datasets, we employ a nearest neighbor approximation of BRIRs – we select BRIR in the dataset that is closest to the desired azimuth and polar angle in time. We use Steam Audio SDK [57] to perform this motion trajectory simulation. During the finetuning step, we use all four BRIR datasets we described above, but perform motion simulation only with CIPIC database [3]. We do this because CIPIC provides BRIRs that are reasonably dense across all azimuth and polar angles, while the rest of BRIR datasets sparsely vary azimuthal angle only with fixed elevation.

Finally, to allow the model to learn common noise characteristics found in the real world such as microphone thermal noise and constant noise from heating, ventilation and air conditioning systems, we also train the model with randomly scaled white, pink and brown noise components. Specifically, during training, we augment the mixture signal with a white noise signal having a standard deviation uniformly chosen from the range [0, 0.002). As for the pink and brown noise, we use the powerlaw_psd_gaussian function from the Python colorednoise library to generate these noise signals. We scale the pink and brown noise signals with different scale factors each sampled uniformly from [0, 0.05).

4 Implementation and Evaluation

We first describe our end-to-end integration with noise-canceling headsets. We then describe our in-the-wild evaluation with previously unseen speakers and acoustic environments. Next, we describe our user study comparing different enrollment interfaces. Finally, we present benchmarks for the various models.

4.1 End-to-end system with noise-cancellation

Figure 5:

Hardware prototype. We design our hardware using the Sony WH-1000XM4 noise-cancelling headset. Since we need binaural acoustic data, we attach a pair of binaural microphones (Sonic Presence SP15C) to the exterior side of the earcups of the noise-canceling headset as shown in Fig. 4(a). The incoming binaural audio is processed on an embedded CPU and is played back to the user using the noise-canceling headset. Fig. 4(b) shows the attenuation achieved by the noise-canceling headsets across different frequencies, with and without active noise cancellation. In these experiments, we obtain recordings from a pair of microphones fitted into a human user’s ears as a nearby loudspeaker plays a 20 Hz-22 kHz linear frequency sweep. While wearing the headphones, we notice that the earcups cause the microphone to pick up an audio signal that synchronizes with the wearer’s blood pulse, which produce the spurious peaks at the lower frequencies (<100 Hz).

Runtime evaluation. We connect the binaural microphones to an Orange Pi 5B using a USB cable. This allows the Orange Pi to read audio chunks from the microphones, which we process on-device and play back to the headset connected via the audio jack. We deploy our neural network on the embedded device by converting the PyTorch model into an ONNX model using a nightly PyTorch version (2.1.0.dev20230713+cu118) and we use the python package onnxsim to simplify the resulting ONNX model. We run inference using the ONNX Runtime version 1.16.0. To evaluate our model runtime, we use the ONNX Runtime perftest tool to run 1000 successive inference operations with the model, and we observe that the model inference time, which is 5.47 ms on average, is lower than our chunk size of 8 ms. This means that our model can process an audio chunk before the next chunk arrives, and hence, it satisfies our real-time requirement.

In practice, since our model makes significant use of cached values, which we must provide as an input every time we run inference, we must also take into account the time it takes to copy these updated cache buffers from the model output to the model input. When we include these buffer copy operations in our runtime measurements, we see that although the overall runtime increases, it is still within the constraints for real-time operation with a runtime of around 6.24 ms, as shown in Fig. 4(c). This gives us a total end-to-end latency of 18.24 ms.

Figure 6:

End-to-end demonstration. We demonstrate our end-to-end system and ask participants to wear and use it. We enable active noise canceling on the headphones and run our target speaker hearing pipeline after the user enrolls their target speaker. For these experiments, we also place an additional pair of microphones inside the headphone’s earcups to record the exact sound that the user will listen to during operation. This includes any residual noise from the imperfect noise cancellation, as well as the amplification of target speakers from our target speaker hearing playback signal. We show the resulting spectrogram representation for two examples in Fig. 5. For brevity, we only show one audio channel.

In the first example, the user is trying to listen to a target human speaker in the presence of airplane engine noise playing over a nearby smartphone speaker. This noise is clearly visible in Fig. 5(a), which was recorded from the outer microphone. However, when we probe the sound at the inner microphone (Fig. 5(b)), we noticed a significant attenuation of this background noise due to the headphones’ active noise-canceling abilities. And yet, because of our target speaker hearing system, the target speaker is still very clearly audible at the inner microphone. We show in the second example that this behavior can even be seen for other, more dynamic and loud background sounds. Specifically, we show in Fig. 5(c) that even when the user tries to listen to the target speaker in the presence of music, in this case a loud pop song, our end-to-end system have an impressive ability to suppress the unwanted music and the singer while preserving the sound of the target speaker (Fig. 5(d)).

4.2 In-the-wild generalization

We evaluate our hardware in previously unseen indoor and outdoor environments, with participants who are not in the training data. We recruited 8 individuals (5 male, 3 female) to collect data in different in-the-wild scenarios using our hardware. We ask 3 participants at a time to collect noisy enrollment signals, as well as noisy real-world mixture audio, in different acoustic environments while they read random text. Among the three participants, one of them is designated as the wearer, while the other two are the speakers. To collect a noisy enrollment for a given target, the wearer looks at the target speaker as they read a text. This mimics the "Look Once" phase in the real-world use, where the listener would look at the target speaker. As the target speaker reads the text, background sounds and, in all but one case which had significant noise, speech from the other speaker make this enrollment signal noisy.

We record multiple noisy binaural audio clips while the target speaker reads a different text in the presence of other environmental sounds and speakers. Unlike the noisy enrollment signal, there is little control over these recordings, as the wearer and target speaker are free to move around and/or rotate their head (Fig. 6 A-C). These recordings were also collected in different acoustic environments, including living spaces, busy streets, and in nature (Fig. 6 D-F). They also contained settings where the listeners were in different postures, such as standing, sitting and laying down (Fig. 6 G-I).

Evaluation procedure. Since the target speaker is speaking in the presence of interfering speakers and unknown noise, it is difficult to obtain the ground truth audio signal for our target speakers in the real world. So, we cannot rely on objective metrics to evaluate the system performance. Instead, we design a listening survey to allow human participants to rate the performance of our two enrollment methods on 15 different target speaker scenarios from the in-the-wild dataset we collected. To do this, we recruited 21 participants (13 male and 8 female with an average age of 30.4 years) to take our survey and give their opinion on our target speaker hearing system to obtain a mean opinion score (MOS). To do this, for each scenario, we first ask the users to try listening to a 5-second clean signal of the target speaker reading text collected in a quiet room. We then ask the participants to listen to the target speaker in 3 distinct audio clips: 1) the original, noisy recording of the target speech with interferers, 2) the output of our target speaker hearing network using the noisy knowledge distillation embeddings, and 3) the output of our target speaker hearing network using the noisy beamformer embeddings. The 3 clips are presented in random order.

After listening to the mixture and model output clips in a random order, we ask the participants to rate the target speech extraction quality by asking them the following questions:

(1)

Noise suppression: How INTRUSIVE/NOTICEABLE were the INTERFERING SPEAKERS and BACKGROUND NOISES? 1 - Very intrusive, 2 - Somewhat intrusive, 3 - Noticeable, but not intrusive, 4 - Slightly noticeable, 5 - Not noticeable

(2)

Overall MOS: If the goal is to focus on this target speaker, how was your OVERALL experience? 1 - Bad, 2 - Poor, 3 - Fair, 4 - Good, 5 - Excellent

Figure 7:

Figure 8:

Results. Fig. 7 shows that our system can greatly suppress the background sounds and interfering speakers, as evidenced by the fact that that our beamforming and knowledge distillation enrollment networks increased the mean opinion score for the noise suppression task from 1.71 to 3.05 for the beamformer enrollment method and 3.28 for the knowledge distillation method (Fig. 7(a)). Our target speaker hearing framework was also able to improve the overall mean opinion score from 2.09 to 3.18 and 3.4, respectively (Fig. 7(b)).

Figure 9:

The results show a consistency between our objective benchmark results in §4.4 and real human evaluators, in that the embedding network trained with knowledge distillation outperforms the beamforming network in both cases. These evaluations also show that our system can generalize well to real-world environments with real human wearers. One particularly important scenario is to be able to adapt to sudden and rapid changes in the target speaker’s position due to the wearer rotating their head. Indeed, many of the participants in our study often rotated their heads, looking at different speakers and objects in their surroundings. As a result, vital spatial cues could vary widely with time. Such scenarios highlight the importance of fine-tuning our neural networks with audio samples that contain moving speakers. For example, in Fig. 8, we show a real-world example where the listener was noticeably rotating their head while recording throughout the highlighted region. This can be clearly seen by observing the level differences between the left and right channels, as the target speaker becomes more prominent in the left channel in that time frame. As a result, the target speaker appears to move relative to the listener. Without fine-tuning, our network incorrectly suppresses the target speaker during this time. However, when we fine-tune the network with our mobility technique in §3.3, we see that it can correctly pick the target speaker even when the user turns their head.

To further examine this behavior, we conduct an experiment with motion in the real world. Our goal here is to evaluate the model’s ability to adapt to the wearer’s head motion. We play 6 seconds of speech from a loudspeaker and record audio from a participant wearing our system in an office room with HVAC and other ambient noise. As the speech is played, the participant rotates their head several times per trial with different speeds, pausing briefly between successive rotations. We obtain recordings where the wearer rotates exclusively in the azimuthal direction, and recordings where they rotate exclusively the polar direction. We use a gyroscope to record the angular velocity during each trial. To evaluate the model, we use a different recording of this same speaker as an embedding signal and process the recorded audio. Since there is no other interfering speaker in this experiment (computing SI-SNR in real-world mixtures is challenging with interference), a well-trained model is expected to preserve the original speech regardless of the user rotation. As we see in Fig. 9a 9b). This is likely because when the wearer moves their head vertically, the position of the microphones do not change considerably. This analysis suggests that the model is leveraging inter-channel differences. Therefore, it is important to use moving sources during training for the system to better adapt to real-world use cases.

Figure 10:

4.3 Enrollment interface user study

We investigate two main questions: 1) What interface should the user interact with when they want to enroll the target speaker?, and 2) what enrollment duration do users find acceptable for such a system? As shown in Fig. 10, we integrate into our prototype with three different interfaces through which users can communicate their intention of enrolling a target speaker: 1) a virtual button on a smartphone application, 2) a push button on the headphone, and 3) a touch pad on the headphone. We evaluate four different possible enrollment durations: 2.5 s, 5 s, 7.5 s and 10 s.

Comparing user interfaces. We conducted a user study with 9 participants, where each participant first wore our device and sat on a chair as shown in Fig. 10(d). We placed a loudspeaker to the side of the wearer, which played a mixture of human speech from the LibriSpeech dataset and generic vacuum noise. A person sitting in front of the participant would then read a random text. We then asked the participants to use each of the three interfaces to signal to the system their intention to enroll the target speaker in front of them, while suppressing the interfering sounds emitted by the loudspeaker. When the users correctly interacted with the device to start enrollment, the headset would play a voice saying "Enrollment start". While enrolling the target speaker, the participants are asked to keep their head facing the target speaker, until the enrollment duration has passed and the enrollment is completely recorded, at which point a voice is played over the headphones, saying "Finished". The enrollment duration for all three interfaces was set to 5 seconds. After interacting with the three interfaces, we asked each participant to rate the three interfaces from 1-5 based on how likely they are to use that interface for this interaction. The results are shown in Fig. 11(a). Most participants showed a strong preference for the push button because of its good haptic feedback. All the participants showed the least preference for the smartphone since it required the extra complexity of looking at the smartphone screen while simultaneously trying to face the target speaker.

Figure 11:

Table 1:

Enrollment	d-vector	Real-time	SI-SNRi (dB)	Params (M)	MACs (GMAC)
network	similarity	TSH backbone
	1.0	Streaming TFGridNet	7.40	2.04	4.63
		Waveformer	4.94	1.6	2.43
		DCCRN	6.71	5.54	6.6
	0.74	Streaming TFGridNet	4.53
		Waveformer	2.34	"	"
		DCCRN	4.34
	0.85	Streaming TFGridNet	7.01
		Waveformer	4.63	"	"
		DCCRN	6.16

Table 1: Benchmarking results on the generated test set. Proposed noisy enrollment methods are evaluated with 3 different audio/speech processing architectures. Performance with clean enrollments is also provided for reference.

Evaluating the enrollment duration. We then asked the participants to use their favorite interface from the previous study to explore their perspectives on a reasonable enrollment duration. Specifically, each participant performed enrollments with 4 different durations: 2.5s, 5s, 7.5s and 10s. The results in Fig. 11(b) show that 89% of the participants thought that 5 seconds was an acceptable duration for the enrollment period.

Qualitative Results. Next, we asked each of the participants to fill out a System Usability Scale (SUS) questionnaire, which was developed in [10]. The overall score was 80.8 ± 16.7, which suggests generally positive feedback on usability. The SUS results for each question are shown in Fig. 11(c), which correlates with being highly usable and acceptable by users, according to Bangor’s empirical evaluation [6]. Finally, in addition to the previous studies, we also ask the subjective question: "Where do you see yourself using such a system?". Five of the participants mentioned using them in crowded scenarios and expressed similar applications to the response of one of the participants: "I’d like to use it in large social gatherings like conferences and lectures. I want to just talk with a specific people without being distracted by others or loud background noise". Five of the participants also mentioned that they were willing to use it in common public locations such as cafes, restaurants, on the street, karaoke and in large parties. Furthermore, one participant proposed that this technology might be useful for hearing aids. While all participants gave positive feedback and proposed useful potential applications, two of them also raised some limitations. One participant said, "In the real-world, I would also want to focus on a group of people instead of only one person". Another participant said "I think the headphone form factor is a bit obtrusive. A wireless earbud form-factor would be more socially acceptable."

4.4 Benchmarking the models

We quantitatively compare our two noisy enrollment methods described in §3.1. We also compare the quality of target speech with that extracted with embeddings computed on clean enrollments.

For the real-time target speech-hearing aspect of the system, we consider three models: our embedded real-time implementation of TFGridNet [68], Waveformer [63] and DCCRN [26]. These have been developed for speech separation, target sound extraction and speech enhancement, respectively. Waveformer and DCCRN are causal real-time models suitable for this task. We use network variants with the default algorithmic latency proposed in the respective architectures of 13 ms and 37.5 ms for Waveformer and DCCRN respectively, while for our causal TFGridNet implementation, we use an algorithmic latency of 12 ms.

This results in a total of nine combinations – 3 enrollment methods with 3 TSH backbones. We evaluate all nine combinations with 10000 binaural mixture/noisy enrollment pairs generated as described in §3.3. Table 1 reports the average target speaker’s signal quality improvement with respect to the mixture in terms of scale-invariant signal-to-noise ratio improvement (SI-SNRi). Assuming speaker embeddings computed from clean enrollment signals as a reference, we also report the cosine similarity between reference embeddings and embedding computed by both noisy enrollment methods in the second column. The last two columns show the size of the architecture in terms of parameter count as well as multiply-and-accumulate per second (MACs/s). Among these architectures, we find that TFGridNet offers the best performance while also consuming sufficiently low operations and latency.

Table 2:

Model	SI-SNRi (dB)	Runtime (ms)
	7.12	607.93
Streaming TFGridNet w/ caching on ONNX	"	9.28
Optimized Streaming TFGridNet with ONNX	7.01	5.47

Table 2: Runtime comparison of various optimizations on a TFGridNet with a similar parameter configuration. We report the average runtime over 1000 forward passes. To obtain runtime measurements, for the PyTorch model, we use the PyTorch v2.1.1 and the PyTorch Profiler. We use the knowledge distillation embedding network for SI-SNRi evaluation.

Figure 12:

In addition, since the original TFGridNet is not designed for streaming applications, we re-implemented it as a cached, streaming model without any other optimizations and compare our final model’s inference time with this streaming TFGridNet implementation running on PyTorch and ONNX Runtime. Table. 2 shows that executing inference on the Orange Pi is significantly faster if we use ONNX Runtime, and when we further optimize the model as described in §3.2, we can further achieve a 41.1% reduction in the inference time with only a 1.5% reduction in the SI-SNRi.

In Fig. 12a

To understand this disparity, in Fig. 12b 12c x = y line resulted in a positive improvement. Further, in Fig. 12d

Finally, in Fig. 13a

5 Limitations and Discussion

While our focus in this paper has been on extracting a target speaker and playing it back into the hearables, one could also train the system to remove a target speaker from a mixture of sounds. This can be helpful in scenarios where say you want to filter out one person’s disruptive speech while still hearing everyone else.

In our target applications, the user is interested in listening to a single speaker in a crowded environment. This is by itself a common scenario across multiple use cases. Future work could extend this to support multiple target speakers by enrolling each of them and retraining our network to use multiple speaker enrollments. One approach is to run multiple instances of our network for multiple speakers. This, however, would come at a significant on-device compute cost. Instead, training a single network to extract multiple speakers given some aggregated multi-speaker embedding may produce a system that more efficiently handles multiple speakers.

An assumption we make is that during the short enrollment phase, there are no other strong interfering speakers in the same direction as the target speaker. We note two key points: 1) If the target speaker is mobile during the enrollment phase, which our design supports, it reduces the probability of having another strong interfering speaker in the same direction, for the whole enrollment duration. 2) Even in static scenarios, we can train the network to focus on either the closest and/or loudest speaker in the direction, the wearer is looking at, during enrollment. Exploring this would be an interesting avenue for future research.

Figure 13:

Unlike our target speech-hearing network, our enrollment network is not designed to run on-device. However, it can generate the speaker embedding in hundreds of milliseconds on a cloud GPU. Having the enrollment network run on a more powerful GPU (e.g., cloud or edge computing server) is an acceptable choice since it is a one-time process.

The speech characteristics of humans may change with factors such as aging, change in health status and emotions [70], which is an issue with the general problem of target speaker extraction. We however note that in our use case, the wearer uses the binaural hearable to capture an enrollment example of the target speech right before it is being used to extract the target speaker. As a result some of these factors will likely not change in this short duration.

As illustrated in Fig. 12a 32] that could be capable of achieving close-to-human-level speaker recognition to in detecting subtle differences between similar-sounding speakers. Note that the embedding model does not need on-device execution and only runs occasionally.

We demonstrated target speech hearing using off-the-shelf binaural noise-canceling headsets. Active noise-canceling (ANC) headsets use an external microphone to capture environmental noise. The headsets then generate an anti-noise signal that cancels out the external sounds while using an internal microphone as feedback to generate the anti-noise signal. However, since users can play music and take phone calls on these headsets, these ANC systems also take the digital audio being played by the headset speakers as input to ensure that the digital audio does not get canceled. Since we play the target speech through the headset speakers, the headset system is already designed not to cancel it, and hence, we were able to demonstrate the feasibility of our design in §4.1. It would be interesting to explore whether future ANC headsets can be designed to account for target speech hearing to further improve performance.

Finally, we prototyped our system using off-the-shelf binaural headsets connected to an embedded IoT CPU. We note that the IoT CPU platform supports specialized neural processing units (NPUs) that we are not currently using in our implementation. However, these NPUs can potentially further improve the processing latency of our neural networks. Furthermore, recent advances in neural accelerators suggest that commercial devices designed for target speech hearing may likely be incorporated on these accelerators to minimize power consumption and latency.

6 Conclusion

This paper makes an important contribution to the overarching vision of intelligent hearables, where future headsets and earbuds can augment the human auditory perception with artificial intelligence, enabling users to actively manipulate their acoustic surroundings in real-time, giving them the ability to select and hear sounds based on user-defined characteristics, such as speech traits.

Towards this vision, we introduce the concept of target speech hearing using noisy examples on hearables that allows a user to focus on a specific speaker, given their speech characteristics, while reducing interference from other speakers and noise. We make three key technical contributions to achieve this new capability for hearables: 1) an enrollment interface that uses a noisy, binaural recording of the target speaker to generate a speaker embedding that captures that traits of the target speaker, 2) a real-time neural network that runs on embedded IoT CPU to extract the target speaker given the speaker embedding, and 3) a training methodology that uses synthetic data and yet allows our system to generalize to real-world unseen speakers, indoor and outdoor environments as well as support mobility. Our in-the-wild evaluations show generalization to real-world unseen indoor and outdoor environments.

Acknowledgments

The University of Washington researchers are partly supported by the Moore Inventor Fellow award #10617, Thomas J. Cable Endowed Professorship, and a UW CoMotion innovation gap fund. This work was facilitated through the use of computational, storage, and networking infrastructure provided by the HYAK Consortium at the University of Washington.

Footnote

In contrast, directional hearing [66] focuses on speech from a specific direction. However, this approach is not well-suited to our application scenarios, as users do not continuously look at the target speaker, the target speaker may have long pauses in their speech making continuous direction tracking challenging, and the direction can change as they or the user move their head to look elsewhere (e.g., tour sights).

Supplemental Material

MP4 File - Video Presentation

Video Presentation

Transcript for: Video Presentation

References

[1]

Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. The Conversation: Deep Audio-Visual Speech Enhancement. (2018). arxiv:cs.CV/1804.04121

Abstract

1 Introduction

2 BACKGROUND and Related Work

3 Target Speech Hearing with Noisy Examples

3.1 Enrollment interface network

3.2 Real-time target speech hearing system

3.3 Training for real-world generalization

4 Implementation and Evaluation

4.1 End-to-end system with noise-cancellation

4.2 In-the-wild generalization

4.3 Enrollment interface user study

4.4 Benchmarking the models

5 Limitations and Discussion

6 Conclusion

Acknowledgments

Footnote

Supplemental Material

References

Cited By

Index Terms

Recommendations

Enhancement of Noisy Speech Signals for Hearing Aids

Cued Speech automatic recognition in normal-hearing and deaf subjects

Effects of virtual acoustics on target-word identification performance in multi-talker environments

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Badges

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations