1 Introduction
Silent speech is a promising interaction modality for conveying user intents to a broad spectrum of computing devices, for its intuitiveness, efficacy, and privacy-preserving nature. Empowering these devices to decode silent speech effectively transforms them into optical microphones, enabling users to harness the intuitiveness and efficacy of speech without compromising privacy. Despite these merits, the integration of silent speech recognition into contemporary computing devices remains an open-ended research challenge. With silent speech recognition’s unique advantage in privacy, we foresee a future where it could enhance smartwatches, VR/AR glasses, and environmentally deployed IoT devices with speech-based interactions, the capabilities of which have been significantly extended by the recent advances in Large Language Models (LLM). This elimination of tangible interfaces is inviting, and we believe the key lies in the improvements of silent speech recognition to have greater robustness against environmental variances.
Prior research on silent speech has employed a diverse set of sensors, most notably RGB cameras [
4,
42,
51,
54], and ultrasound imaging [
24,
25]. In this work, we first identify depth as a new unique sensory information of silent speech. Specifically, we utilized depth sensing to capture high-fidelity depth data in the form of point clouds to reconstruct user speech, at both world and sentence levels. Our sensing principle relies on the fact that human faces have distinctive shape changes resulting from movements of lips, tongue, teeth, and jaw during speech, which manifests as depth data that could be easily and cheaply acquired by depth sensing.
Depth sensing distinguishes itself from other vision sensors by its insensitivity to fluctuations in ambient lighting conditions, unlike RGB cameras which tend to be more susceptible to such variations. Moreover, depth sensing exhibits a consistent recognition accuracy across various skin tones [
39], effectively broadening the technique’s appeal to a diverse user demographic. Furthermore, computing devices with different instrumentation locations such as on-wrist (i.e., smartwatches) and on-head (i.e., VR/AR glasses) and in-environment (i.e., IoT devices) receive vastly different sensory information which poses generalizability challenges to conventional lipreading – models that are trained at certain device location might not generalize well to other locations unless new data is collected for calibration. The adaptability and robustness of depth sensing thanks to the perspective-invariant nature of depth make it more promising for addressing this generalizability challenge.
To generalize the utilization of depth sensing across various computing modalities for silent speech recognition, we transformed the depth image into point clouds. This conversion enhances adaptability to accommodate different angles and distances between users’ lips and devices. Additionally, we calculated the point normals, which represent the local geometric property determined by its surrounding points [
45]. These normals were then concatenated with the corresponding points to serve as input to our deep learning pipeline – PointVSR, which includes signal alignment, point cloud feature extraction, and sequence decoding using the connectionist temporal classification loss (CTC) with 26 English characters, 1 space character, and 1 blank as tokens. We proved the superiority of our PointVSR over existing silent speech recognition using RGB videos, with a Character Error Rate (CER) and a Word Error Rate (WER) decrease of 3.82% and 4.96% respectively in within-user validation and 5% and 4.57% in cross-user validation.
We have conducted in-lab user studies and included participants with diverse physiological features and native languages, imitating three real-world device locations: On-Wrist, On-Head, and In-Environment. We also performed within-user and cross-user evaluations, the results of which indicated that our system can recognize sets of 30 distinct commands, achieving an accuracy of 91.33% (within-user), 74.88% (cross-user), with a standard deviation of 1.44% (within-user), 13.47% (cross-user). Furthermore, we explored the feasibility of recognizing sentences, achieving WER of 8.06% (within-user), 29.14% (cross-user), along with CER of 4.13% (within-user), 18.28% (cross-user). We deep-dived into results and derived further insights into sources of errors and comparisons with a status-quo technique using RGB videos as inputs. These results signify that our approach surpasses previously attainable capabilities and holds significant promise for the future of silent speech recognition technology.
Below we list our key contributions:
•
identified depth sensing as the new advantageous information source for silent speech recognition.
•
realized a uniform recognition pipeline using depth information for word and sentence recognition.
•
conducted validations and evaluations at three sensor locations to prove feasibility and superiority.
6 Evaluation
In this section, we assess the performance of our sentence and command recognition. We also compare these results with the outcomes of the state-of-the-art visual speech recognition model that utilizes RGB video data.
6.1 Evaluation Metrics
To evaluate the performance of our pipeline in interpreting spoken sentences using point cloud videos, we used Character Error Rate (CER) and Word Error Rate (WER) as evaluation metrics to measure the system performance. Character Error Rate measures the accuracy of individual character recognition in the predicted sequences with the true sentences. It quantifies the percentage of incorrectly recognized characters in the entire predicted sequence. The CER values were calculated using the following equation:
In this equation, S represents the number of substitutions of incorrectly recognized characters, D represents the number of deletions for missed characters, I represents the number of insertions of extra characters recognized in the predicted sequences, N equals the total number of characters in the true sentences.
WER also evaluates the recognition performance, but it extends the evaluation closer to the application level. Specifically, WER quantified the percentage of incorrectly recognized words in the whole predicted sequences, which were also calculated using the equation
5, but with words as tokens.
Since our model uses characters as tokens for sentence decoding, our model may produce a word that does not exist in English. Therefore, we refine the raw output of the model using the spelling correction module in the TextBlob package [
32], which replaces misspelled words with those that exist in the Project Gutenberg eBook’s dictionary, while maximizing the frequency of intended correction word. Both CER and WER metrics are calculated on the auto-corrected texts.
6.2 Comparative Evaluation on Sentence Recognition
To verify the feasibility of using depth sensing as a novel information source for visual speech recognition, we used a conventional RGB-based method as a baseline for comparison. We chose the off-the-shelf model from [
35], which has achieved state-of-the-art on public visual speech recognition benchmark LRS3 [
1]. The model uses a Conformer [
16] as the frontend to encode RGB images along with a hybrid CTC/attention as the decoder. However, since our model uses a pure CTC architecture, we modified the video model by removing the attention loss from its objective function. This modification kept the two models as similar as possible for a fair comparison, as we mainly focused on investigating 1) which sensing modality could enable more accurate silent speech recognition, and 2) whether our model, which combines point 4D convolutional layers with a learnable transformation network, could exploit information from point clouds efficiently.
6.2.1 Within-user Performance.
We ran within-user tests to investigate the proposed method’s generalization ability to unseen utterances and phrases. Specifically, we conducted a 5-fold validation. Each fold contained 20% of all utterances from every sensor location of every participant after the collected lists of utterances were shuffled. Since we collected an equal number of utterances from each of the three sensor locations for every participant, each fold contained a balanced proportion of utterances from all sensor locations for each participant and overall. It is noteworthy that there were no duplicated utterances included in both the training and testing datasets. Following the same within-user protocols, we trained the baseline RGB model on the paired RGB data collected from our data collection session. Our PointVSR model and the RGB model were trained with the same hyper-parameters, i.e., 250 max epochs with a maximum learning rate of 0.01, adjusted by the OneCycleLR scheduler [
49] in PyTorch, and the same spelling correction method was applied to the RGB model.
As shown in Fig.
5 right, overall, PointVSR outperformed the conventional visual speech recognition method (hereafter referred to as VideoVSR) with CER of 4.13% and WER of 8.06%, compared to the CER of 7.95% and WER of 13.02% in VideoVSR method, yielding relative improvements of 48.05% in CER and 38.10% in WER. This significant improvement confirmed our recognition pipeline as a promising method for enabling more precise and reliable silent speech interactions.
To investigate how the recognition accuracy varies among participants, we break down the results for each participant, shown in Fig.
6. We observed that the two native American English speakers, P2 and P6, both had better accuracies than the average. PointVSR achieved the best performance on P10 (WER 5.10%), who is not a native English speaker but can speak English almost as frequently and accent-free as a native speaker. In contrast, P9 had significantly higher CER and WER, which we suspect were caused by their noticeable accent, which likely deviated their lip movement patterns away from the rest of the participants, creating a data minority that posed challenges to deep learning. Therefore, we anecdotally note that proficiency in English can be one of the dominant factors in the performance of silent speech recognition, at least given a modest magnitude of data collection. This issue is not unique to PointVSR, as commodity voice recognition systems also are more likely to fail on stronger accents and oftentimes require users to perform speech clearly and loudly. However, we are hopeful that a more sizable data collection could mitigate the data minority problem and yield improved results, as prior deep learning inference systems in HCI have shown.
Depth sensing obtains distance information from the participant’s face, thereby contributing highly consistent spatial features across various sensor locations. To verify how this factor affects our method’s performance in real-world settings, we conducted a sensor-location analysis. In this section, CER and WER metrics are broken down to three different sensor locations. We also evaluated the RGB-based method under the same protocol for comparisons. The results of these two methods are depicted in Fig
5 left. Overall, PointVSR achieved consistently better performance across the three sensor locations than VideoVSR. Furthermore, PointVSR demonstrated smaller variances of performance across sensor locations compared to those of VideoVSR, indicating a more consistent recognition performance against varying face orientations and sensor distances. The robustness of our method can be particularly advantageous for silent speech applications on wearable and mobile devices, as these devices often are positioned differently relative to a user’s face during uses.
6.2.2 Cross-user Performance.
In order to gauge PointVSR’s performance when generalizing to unseen speakers that do not exist in our dataset, we performed a 5-fold cross-user validation with each fold containing two participants’ data. Fig.
5 right illustrates the results. Both CER and WER increased when moving to cross-user from within-user protocol because speech signals are often highly personalized, varying significantly from person to person, and thus a model can be improved by having training data from a user. The CER and WER measured 18.28% and 29.14% respectively with PointVSR. Still, our method compared favorably to the RGB-based method, achieving a WER and a CER decrease of 4.57% and 5% than VideoVSR, respectively. This generalizability enables our system to work better in an out-of-the-box manner, making it easier for unseen users to access reliable silent speech interactions without the need for calibrations.
In conclusion, PointVSR achieved better and more robust performance than the conventional RGB-based method. Furthermore, our model is much more lightweight than the RGB model – our model has 20 million parameters, which is only 8% the size of the RGB model that has 250 million parameters. This is an equally promising result as our superior performance and indicates that PointVSR, which uniquely leverages depth sensing, is potentially more efficient and easier to train than conventional models using RGB data.
6.2.3 Ad-hoc Analysis of Misrecognized Word in Sentence Recognition.
We are interested in whether the misrecognized words would share any common patterns. To investigate this, we used the NIST speech recognition scoring toolkit (SCTK) [
40] to analyze the frequencies of the types of errors in a word level on the raw outputs of our model (i.e., without spelling correction). Specifically, we counted the three types of errors including 1)
Substitution (10.7%) when one letter is incorrectly replaced with another; 2)
Deletion (0.7%) when a letter is omitted; and 3)
Insertion (0.2%) when an extra letter is added to a word. We only ran this analysis on the within-user results. Results indicate
Substitution being the dominant error type, taking up 92.2% of all errors, followed by
Deletion (6.0%) and
Insertion (1.7%). Furthermore, we found that
that is the most confusing word in our vocabulary, where it is misrecognized in 99 out of 293 occurrences. Misrecognition includes words such as
like,
and, and
hat as well as fabricated terms such as "TIAT", "THIT" and "TAND". Those errors are reasonable, as the inaccurately recognized words are highly similar to the target, making it inherently difficult to distinguish. However, by understanding the context using language models in naturally coherent sentences, we assume it is very possible to filter out improbable words or allow users to select from multiple most probable candidates. Additionally, the second most confused word
between is misspelled as
betwen for 45 times and
betweeen for 16 times. This type of error could be solved by common spelling auto-correction.
6.3 Command Recognition
Our command recognition and sentence recognition are built on the same PointVSR model. To avoid duplicated insights from comparisons with the RGB method, we did not conduct a comparative evaluation, but focused on an in-depth analysis of our method’s performance correlating with the viseme length of commands.
6.3.1 Command Recognition Evaluation and Results.
In addition to sentence recognition, we conducted an evaluation on command recognition, including two rounds of experiments: 5-fold within-user evaluation and cross-user evaluation. For within-user evaluation, we trained the model on 80% of each participant’s data (approximately 2139 command utterances from 10 participants) and tested it on the remaining 20% of each participant’s data (around 428 command utterances). In cross-user evaluation, we divided the dataset into a testing set, comprising data from two participants, and a training set, composed of data from the remaining eight participants. This division followed a 5-fold cross-validation approach, consistent with the methodology used in cross-user sentence recognition. Both training and testing data in these two evaluations incorporated information from the three sensor locations shown in Fig.
4.
The confusion matrix described within-user command recognition accuracy is shown in Fig.
7. The average command recognition accuracy equals 91.33% (SD = 1.44%), which serves as strong evidence of our model’s effectiveness in accurately differentiating 30 distinct commands of varying lengths. Furthermore, the confusion matrix indicates certain commands are more susceptible to being confused than others, often due to shared viseme sequences, pronunciations, and similar lip movements. For instance, the commands
Turn On and
Turn Off were frequently mistaken, likely due to their shared first half
Turn and the prolonged period required for the lips to form the round shape for the character
O in both
On and
Off. For the cross-user evaluation, the command recognition accuracy decreased to 74.88% (SD = 13.47%) due to user variance which decreased the performance in sentence recognition as well.
6.3.2 Correlation Analysis Between Viseme Length and Recognition Accuracy.
We observed that commands such as
Start,
Pause,
Search with short viseme sequences, are more errorful than commands with longer sequences. To better illustrate this phenomenon, we drew a correlation plot of viseme length against command recognition accuracy, as displayed in Figure
7 (right). In this plot, the x-axis represents the viseme length of commands. Note that the length is not a direct conversion from phonemes; rather, it results from the merging of adjacent identical visemes. For example, the viseme sequence of the command
Text Dad, [’T’, ’EH’, ’K’, ’T’, ’T’, ’T’, ’EH’, ’T’], was converted into [’T’, ’EH’, ’K’, ’T’, ’EH’, ’T’] after we merged the three consecutive viseme ’T’. A clear trend in this plot shows an increase in accuracy as the viseme length of commands increases. This key finding suggests to future researchers that in the design of silent speech command sets, preference should be given to longer, non-overlapping commands in order to optimize accuracy.
7 Discussion
7.1 Ablation Study
We conducted an ablation study to get deeper insights into PointVSR by measuring how each of its components contributes to the recognition performance. Broadly, our model features four key components: 1) TNet; 2) 4D point cloud convolution layer; 3) Transformer; and 4) Bidirectional GRUs. We performed ablation tests by individually removing one component at a time and evaluating its impact on performance. Following a consistent data split approach within-user 5-fold validation, we opted for a subset of 2 folds from the total 5 and executed the experiments for each ablation test. To make a fair comparison, we kept the structures and parameters of the remaining components the same after one component was removed. We employed zero padding to expand dimensions when removing the 4D point cloud convolution and utilized adaptive pooling when removing the transformer or GRUs to reduce dimensions. Results are summarized in the table
3.
Overall, all components in our model play a vital role in the holistic functioning of the system as we observed significant increases in error rates when any of the components were removed, particularly when excluding either the TNet or GRUs components. In these cases, the word error rate exceeded 90%, indicating a substantial loss in the model’s ability to interpret speech based on depth visual cues. Additionally, the 4D point cloud convolution and Transformer components serve as crucial feature extraction components throughout the entire pipeline, when the 4D convolution or the Transformer layer was removed, the model could still capture some aspects of speech, but the error rates increased substantially.
7.2 Handheld Sensor Location
We conducted an additional experiment to explore the
Handheld sensor location, imitating the common way people hold their smartphones in their hands in daily use. We started by recording a new test dataset that consists of 50 new sentences from P1, following the same sentence composition rules outlined in Section
5.1. During the data collection phase, the iPhone 12 mini was, on average, positioned 22.45 cm away from the user’s face, as determined by analyzing the depth map, similar to the use scenario shown in prior work [
54]. With this dataset recorded from the handheld sensor location as a test set, we performed two distinct experiments to assess the model’s performance in different contexts. The first, within-user cross-location, utilized training datasets from PI with the previous three sensor locations. The second experiment, cross-user cross-location, followed a similar approach as our previous within-user cross-location evaluation, with all data of P1 excluded from the training dataset.
We observed that the within-user cross-location experiment yielded a WER of 7.25% and a CER of 4.17%. Comparatively, the within-user within-location results from the user study averaging On-Wrist, In-Environment, and On-Head locations showed a very similar WER of 8.06% and a CER of 4.13%. When examining the cross-user cross-location scenario, we observed a WER of 28.00% and a CER of 18.99%. These outcomes also align with the earlier results from the cross-user study, where the WER was 29.14%, and the CER was 18.28%. The consistent performance across multiple sensor locations suggests the robustness of our method that could accommodate handheld use scenarios. This result highlights the adaptability of our method to different device form factors, orientations, and head/hand postures, and potentially to factors beyond the ones demonstrated in this work.
7.3 Power Consumption and Computational Cost
Power consumption and computational cost are important factors to consider for our method’s ecological validity. Our deep learning inference of one spoken sentence in the user study requires 66.66G floating-point operations with 20.43M parameters, which take 621 milliseconds for one sentence (i.e., 150 frames) on a server with four Nvidia RTX A5500 GPUs to complete. Specifically, among these operations, TNet takes 14.62G floating-point operations with 0.81M parameters, 4D convolution 0.80G floating-point operations with 0.56K parameters, Transformer layers take 51.24G floating-point operations with 0.32M parameters and GRUs take 1.38M floating-point operations with 19.28M parameters.
In real-world applications of our method, the power consumption of a system would comprise two components: the data acquisition and the computation. To investigate data acquisition, we recorded 1 hour of continuous operation of iPhone 12 mini’s depth camera and analyzed the Powerlog file from the Battery Life profile. The depth camera consumed 110.50 mAh (i.e., 4.96% of iPhone 12 mini battery capacity of 2227 mAh), which is equivalent to 0.42 Wh assuming the working voltage of 3.83 V. The estimated consumption of the computation component falls within the range of approximately 0.86 to 5.41 W for inferencing one sentence with a length of 150 frames. This estimation is based on a linear extrapolation using the results from the model
2 with 5.59G floating-point operations, which requires consumption between 0.072 and 0.454 W. Of note, these numbers are theoretical speculations and should be perceived as an approximation and require further validation through device deployment in the future work.
7.4 Performance Comparison with Trimmed VideoVSR model
We used a state-of-the-art video-based VSR model as a baseline to evaluate our proposed model’s effectiveness in learning point cloud data. This might not constitute a fair comparison in that video models have often been tailored to large-scale datasets and have a large number of parameters, thus demanding a lot of training data for the performance ramp-up. We conducted an additional series of tests to investigate whether the performance difference we have seen in previous evaluations stemmed merely from our model’s smaller size, which makes it easy to train on our relatively small dataset, or truthful superiority future silent speech systems could rely on. Due to the lack of large-scale depth silent speech data, we optimized video models for small-scale data. Specifically, we created a trimmed, 20M version of the VideoVSR model, aligning it in scale with our PointVSR model. Specifically, we used 4 multi-head attention layers instead of 12, and reduced the latent dimension from 768 to 512 and the number of heads from 12 to 4. We then evaluated this model following the same protocols as in section
6.2.1 and
6.2.2. As shown in Table
4, the trimmed VideoVSR model exhibited the highest error rates, indicating that the strengths of our method originate from the superiority of the network architecture in concert with the unique leverage of depth information, rather than the use of small-scale training data.
7.5 Privacy Concerns of Depth Sensing
RGB cameras capture full-color images that can include identifying details such as facial features, skin colors, and clothing. In contrast, depth data contains information about the shape and distance of objects without capturing detailed visual textures or colors. This makes depth sensing less likely to contain information that can be used to identify individuals. Additionally, captures of background can be easily filtered out using distance thresholding in the depth map at hardware and software levels, thereby minimizing the risk of exposing unintentional information. In comparison to RGB-based systems that may require complex neural networks for filtering out irrelevant facial information, our depth-based approach inherently reduces the need for such complex algorithms, contributing to a more straightforward and privacy-conscious methodology. However, we acknowledge that the increasing resolution of depth cameras and the increasing capabilities of deep neural nets could make depth sensing as privacy concerning as other imaging approaches, and thus depth-based silent speech recognition systems should investigate privacy implications in user contexts.
7.6 Example Use Cases
To demonstrate our system, we developed a series of usage scenarios (Fig.
8 and also see Video Figure). Our method, as detailed in the paper, facilitated the recognition of silent speech in a wide array of use scenarios.
7.6.1 Smartwatch as a Robust Natural Language Interface to AI.
We envision a future where smartwatches could serve as a natural and always available voice interface for users to interact with LLM-based AI agents. This requires smartwatches to have reliable sensing performance across a wide range of adversarial noise in both audio and video channels. In noisy environments, audio signals are often polluted, posing challenges for accurate speech recognition. In contrast, depth cameras are capable of capturing precise lip movements, providing a more resilient approach to speech recognition. In the use scenario shown in Fig.
8 (A), a smartwatch prototype, enhanced by our method, achieves accurate and reliable speech recognition results even amidst busy and noisy traffic surroundings. Though video/RGB-based silent speech recognition methods may have similar robustness in acoustically noisy environments, they might be susceptible to noise in the visible light channel and dim lighting conditions, and yield inconsistent performance with various skin tones. As shown in Fig.
8 (B), our system leverages advancements in depth sensing and its robustness against aforementioned noise and can accurately interpret sentences.
7.6.2 Flexible Sensor Location to Enable Device with Various Form Factors.
Depth data maintains consistency across different sensor positions and orientations to a greater extent than RGB data, in that spatial characteristics of objects do not depend on the viewing angle, unlike color and texture information, which can change significantly with sensor-user perspective. In this regard, our method could yield more consistent data during constantly changing postures when users use smart devices, especially those deployed in the environment. Furthermore, our method incorporates an alignment process by TNet, contributing to its robustness by accommodating variations in orientation. As illustrated in Fig.
8 (C), users can silently control devices such as AC systems, showcasing the practicality of our system in diverse settings. Beyond smart environment applications, our method can seamlessly integrate with smartphones shown in Fig.
8 (D) to allow existing applications on smartphones to recognize speech as a natural and intuitive interaction modality.
8 Limitation and Future Work
As with all technologies, our proposed one has limitations and so does the evaluation of our work, which we acknowledge here to inspire ideas and encourage future work.
Demographic variety Foremost, the demographic variety of our participant pool is modest, which limited further insights we could draw regarding demographical factors on our proposed visual speech recognition. We suspect that our technique will share strengths as well as weaknesses as ones of depth sensing for our hitchhike on commodity depth sensing infrastructure. On the positive side, depth sensing can be as robust as FaceID, which has been proven successful across demographical variances in authentication tasks. Robustness against user variance is an important promise of our proposed technique to make visual speech recognition reliable and ultimately equitable across society. Nonetheless, studies with larger numbers and a more diverse set of participants are needed in our future work.
Deployment in the wild Additionally, further insights could be drawn by deploying our system in the wild, allowing its users to speak natural languages at will, and with a wide variety of factors of environments (e.g., vibrational noise, sun exposure) and user features (e.g., body posture, face feature). This would require our system to be implemented on a device with a proper cloud computing scheme or to be entirely standalone by running on the device, both of which require further system engineering which we plan to do in the future.
Other depth sensors Furthermore, our hardware selection is limited to the TrueDepth camera, a high-end depth camera that provides high-resolution depth data with high SNR. However, not all smart devices can afford to equip this sensor, limiting the scalability of this work to some extent. We acknowledge that future work could investigate mid- or low-end depth cameras with lower-resolution depth data or data with higher noise floors (i.e., low SNR) to learn about the performance of our technique on a wider spectrum of depth cameras. In this work, the input depth frames are down-sampled to 1024 points (i.e., lower resolution by a factor of at least 3 than the original data), which imitates lower-resolution depth data. However, the effect of higher noise floors remains to be tested in the future.
Impovement with more training data Our model is more parameter-efficient while outperforming the conventional RGB model in recognition tasks. The lightweight model has lower computational costs and is thus easier to distribute on edge devices. However, since there are no existing large-scale point cloud datasets for speech recognition tasks, we were unable to assess our model’s capability when scaling up the number of its parameters together with its training data. Recent research on machine learning has proven Transformers are data-efficient and scalable [
20,
66], and we anticipate increasing the complexity of the model (e.g., using more heads and larger latent space dimensions for the multi-head attention architecture) should enhance the model’s capability to fit more training data and leave this work for future explorations.
Beyond speech recognition Finally, depth data of human speech could lead to a wider array of use cases beyond speech recognition, such as health care, education, and embodied AI. Though establishing a dataset of human speech depth data is not within the scope of this work, we release the depth data we have collected under the approval of our institute’s IRB, as well as the source code of our system implementation to facilitate the growth of this raising field of research. Our open-source repository is held at:
https://github.com/hilab-open-source/WatchYourMouth.