Search | arXiv e-print repository

Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation

Authors: Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Soyoon Kim, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Jung-Woo Ha, Sungroh Yoon, Kang Min Yoo

Abstract: Recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech. However, an LLM-based strategy for modeling spoken dialogs remains elusive, calling for further investigation. This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken respons… ▽ More Recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech. However, an LLM-based strategy for modeling spoken dialogs remains elusive, calling for further investigation. This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. We have verified the inclusion of prosody in speech tokens that predominantly contain semantic information and have used this foundation to construct a prosody-infused speech-text model. Additionally, we propose a generalized speech-text pretraining scheme that enhances the capture of cross-modal semantics. To construct USDM, we fine-tune our speech-text model on spoken dialog data using a multi-step spoken dialog template that stimulates the chain-of-reasoning capabilities exhibited by the underlying LLM. Automatic and human evaluations on the DailyTalk dataset demonstrate that our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines. We will make our code and checkpoints publicly available. △ Less

Submitted 26 August, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

arXiv:2311.01279 [pdf, other]

doi 10.1145/3583740.3626819

ExPECA: An Experimental Platform for Trustworthy Edge Computing Applications

Authors: Samie Mostafavi, Vishnu Narayanan Moothedath, Stefan Rönngren, Neelabhro Roy, Gourav Prateek Sharma, Sangwon Seo, Manuel Olguín Muñoz, James Gross

Abstract: This paper presents ExPECA, an edge computing and wireless communication research testbed designed to tackle two pressing challenges: comprehensive end-to-end experimentation and high levels of experimental reproducibility. Leveraging OpenStack-based Chameleon Infrastructure (CHI) framework for its proven flexibility and ease of operation, ExPECA is located in a unique, isolated underground facili… ▽ More This paper presents ExPECA, an edge computing and wireless communication research testbed designed to tackle two pressing challenges: comprehensive end-to-end experimentation and high levels of experimental reproducibility. Leveraging OpenStack-based Chameleon Infrastructure (CHI) framework for its proven flexibility and ease of operation, ExPECA is located in a unique, isolated underground facility, providing a highly controlled setting for wireless experiments. The testbed is engineered to facilitate integrated studies of both communication and computation, offering a diverse array of Software-Defined Radios (SDR) and Commercial Off-The-Shelf (COTS) wireless and wired links, as well as containerized computational environments. We exemplify the experimental possibilities of the testbed using OpenRTiST, a latency-sensitive, bandwidth-intensive application, and analyze its performance. Lastly, we highlight an array of research domains and experimental setups that stand to gain from ExPECA's features, including closed-loop applications and time-sensitive networking. △ Less

Submitted 2 November, 2023; originally announced November 2023.

arXiv:2307.13241 [pdf, other]

A Visual Quality Assessment Method for Raster Images in Scanned Document

Authors: Justin Yang, Peter Bauer, Todd Harris, Changhyung Lee, Hyeon Seok Seo, Jan P Allebach, Fengqing Zhu

Abstract: Image quality assessment (IQA) is an active research area in the field of image processing. Most prior works focus on visual quality of natural images captured by cameras. In this paper, we explore visual quality of scanned documents, focusing on raster image areas. Different from many existing works which aim to estimate a visual quality score, we propose a machine learning based classification m… ▽ More Image quality assessment (IQA) is an active research area in the field of image processing. Most prior works focus on visual quality of natural images captured by cameras. In this paper, we explore visual quality of scanned documents, focusing on raster image areas. Different from many existing works which aim to estimate a visual quality score, we propose a machine learning based classification method to determine whether the visual quality of a scanned raster image at a given resolution setting is acceptable. We conduct a psychophysical study to determine the acceptability at different image resolutions based on human subject ratings and use them as the ground truth to train our machine learning model. However, this dataset is unbalanced as most images were rated as visually acceptable. To address the data imbalance problem, we introduce several noise models to simulate the degradation of image quality during the scanning process. Our results show that by including augmented data in training, we can significantly improve the performance of the classifier to determine whether the visual quality of raster images in a scanned document is acceptable or not for a given resolution setting. △ Less

Submitted 25 July, 2023; originally announced July 2023.

arXiv:2306.01296 [pdf, other]

doi 10.21437/Interspeech.2023-361

Improved Training for End-to-End Streaming Automatic Speech Recognition Model with Punctuation

Authors: Hanbyul Kim, Seunghyun Seo, Lukas Lee, Seolki Baek

Abstract: Punctuated text prediction is crucial for automatic speech recognition as it enhances readability and impacts downstream natural language processing tasks. In streaming scenarios, the ability to predict punctuation in real-time is particularly desirable but presents a difficult technical challenge. In this work, we propose a method for predicting punctuated text from input speech using a chunk-bas… ▽ More Punctuated text prediction is crucial for automatic speech recognition as it enhances readability and impacts downstream natural language processing tasks. In streaming scenarios, the ability to predict punctuation in real-time is particularly desirable but presents a difficult technical challenge. In this work, we propose a method for predicting punctuated text from input speech using a chunk-based Transformer encoder trained with Connectionist Temporal Classification (CTC) loss. The acoustic model trained with long sequences by concatenating the input and target sequences can learn punctuation marks attached to the end of sentences more effectively. Additionally, by combining CTC losses on the chunks and utterances, we achieved both the improved F1 score of punctuation prediction and Word Error Rate (WER). △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: Accepted at INTERSPEECH 2023

Journal ref: Proc. INTERSPEECH 2023, 1653-1657

arXiv:2306.00680 [pdf, other]

Encoder-decoder multimodal speaker change detection

Authors: Jee-weon Jung, Soonshin Seo, Hee-Soo Heo, Geonmin Kim, You Jin Kim, Young-ki Kwon, Minjae Lee, Bong-Jin Lee

Abstract: The task of speaker change detection (SCD), which detects points where speakers change in an input, is essential for several applications. Several studies solved the SCD task using audio inputs only and have shown limited performance. Recently, multimodal SCD (MMSCD) models, which utilise text modality in addition to audio, have shown improved performance. In this study, the proposed model are bui… ▽ More The task of speaker change detection (SCD), which detects points where speakers change in an input, is essential for several applications. Several studies solved the SCD task using audio inputs only and have shown limited performance. Recently, multimodal SCD (MMSCD) models, which utilise text modality in addition to audio, have shown improved performance. In this study, the proposed model are built upon two main proposals, a novel mechanism for modality fusion and the adoption of a encoder-decoder architecture. Different to previous MMSCD works that extract speaker embeddings from extremely short audio segments, aligned to a single word, we use a speaker embedding extracted from 1.5s. A transformer decoder layer further improves the performance of an encoder-only MMSCD model. The proposed model achieves state-of-the-art results among studies that report SCD performance and is also on par with recent work that combines SCD with automatic speech recognition via human transcription. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: 5 pages, accepted for presentation at INTERSPEECH 2023

arXiv:2210.17017 [pdf, other]

Blank Collapse: Compressing CTC emission for the faster decoding

Authors: Minkyu Jung, Ohhyeok Kwon, Seunghyun Seo, Soonshin Seo

Abstract: Connectionist Temporal Classification (CTC) model is a very efficient method for modeling sequences, especially for speech data. In order to use CTC model as an Automatic Speech Recognition (ASR) task, the beam search decoding with an external language model like n-gram LM is necessary to obtain reasonable results. In this paper we analyze the blank label in CTC beam search deeply and propose a ve… ▽ More Connectionist Temporal Classification (CTC) model is a very efficient method for modeling sequences, especially for speech data. In order to use CTC model as an Automatic Speech Recognition (ASR) task, the beam search decoding with an external language model like n-gram LM is necessary to obtain reasonable results. In this paper we analyze the blank label in CTC beam search deeply and propose a very simple method to reduce the amount of calculation resulting in faster beam search decoding speed. With this method, we can get up to 78% faster decoding speed than ordinary beam search decoding with a very small loss of accuracy in LibriSpeech datasets. We prove this method is effective not only practically by experiments but also theoretically by mathematical reasoning. We also observe that this reduction is more obvious if the accuracy of the model is higher. △ Less

Submitted 26 June, 2023; v1 submitted 30 October, 2022; originally announced October 2022.

Comments: Accepted in Interspeech 2023

arXiv:2203.00899 [pdf, other]

doi 10.48550/arXiv.2203.00899

Machine learning based lens-free imaging technique for field-portable cytometry

Authors: Rajkumar Vaghashiya, Sanghoon Shin, Varun Chauhan, Kaushal Kapadiya, Smit Sanghavi, Sungkyu Seo, Mohendra Roy

Abstract: Lens-free Shadow Imaging Technique (LSIT) is a well-established technique for the characterization of microparticles and biological cells. Due to its simplicity and cost-effectiveness, various low-cost solutions have been evolved, such as automatic analysis of complete blood count (CBC), cell viability, 2D cell morphology, 3D cell tomography, etc. The developed auto characterization algorithm so f… ▽ More Lens-free Shadow Imaging Technique (LSIT) is a well-established technique for the characterization of microparticles and biological cells. Due to its simplicity and cost-effectiveness, various low-cost solutions have been evolved, such as automatic analysis of complete blood count (CBC), cell viability, 2D cell morphology, 3D cell tomography, etc. The developed auto characterization algorithm so far for this custom-developed LSIT cytometer was based on the hand-crafted features of the cell diffraction patterns from the LSIT cytometer, that were determined from our empirical findings on thousands of samples of individual cell types, which limit the system in terms of induction of a new cell type for auto classification or characterization. Further, its performance is suffering from poor image (cell diffraction pattern) signatures due to its small signal or background noise. In this work, we address these issues by leveraging the artificial intelligence-powered auto signal enhancing scheme such as denoising autoencoder and adaptive cell characterization technique based on the transfer of learning in deep neural networks. The performance of our proposed method shows an increase in accuracy >98% along with the signal enhancement of >5 dB for most of the cell types, such as Red Blood Cell (RBC) and White Blood Cell (WBC). Furthermore, the model is adaptive to learn new type of samples within a few learning iterations and able to successfully classify the newly introduced sample along with the existing other sample types. △ Less

Submitted 2 March, 2022; v1 submitted 2 March, 2022; originally announced March 2022.

Comments: Published in Biosensors Journal

Journal ref: https://www.mdpi.com/2079-6374/12/3/144

arXiv:2112.15399 [pdf, other]

InfoNeRF: Ray Entropy Minimization for Few-Shot Neural Volume Rendering

Authors: Mijeong Kim, Seonguk Seo, Bohyung Han

Abstract: We present an information-theoretic regularization technique for few-shot novel view synthesis based on neural implicit representation. The proposed approach minimizes potential reconstruction inconsistency that happens due to insufficient viewpoints by imposing the entropy constraint of the density in each ray. In addition, to alleviate the potential degenerate issue when all training images are… ▽ More We present an information-theoretic regularization technique for few-shot novel view synthesis based on neural implicit representation. The proposed approach minimizes potential reconstruction inconsistency that happens due to insufficient viewpoints by imposing the entropy constraint of the density in each ray. In addition, to alleviate the potential degenerate issue when all training images are acquired from almost redundant viewpoints, we further incorporate the spatially smoothness constraint into the estimated images by restricting information gains from a pair of rays with slightly different viewpoints. The main idea of our algorithm is to make reconstructed scenes compact along individual rays and consistent across rays in the neighborhood. The proposed regularizers can be plugged into most of existing neural volume rendering techniques based on NeRF in a straightforward way. Despite its simplicity, we achieve consistently improved performance compared to existing neural view synthesis methods by large margins on multiple standard benchmarks. △ Less

Submitted 10 April, 2022; v1 submitted 31 December, 2021; originally announced December 2021.

Comments: CVPR 2022, Website: http://cv.snu.ac.kr/research/InfoNeRF

arXiv:2105.02400 [pdf, other]

SIPSA-Net: Shift-Invariant Pan Sharpening with Moving Object Alignment for Satellite Imagery

Authors: Jaehyup Lee, Soomin Seo, Munchurl Kim

Abstract: Pan-sharpening is a process of merging a high-resolution (HR) panchromatic (PAN) image and its corresponding low-resolution (LR) multi-spectral (MS) image to create an HR-MS and pan-sharpened image. However, due to the different sensors' locations, characteristics and acquisition time, PAN and MS image pairs often tend to have various amounts of misalignment. Conventional deep-learning-based metho… ▽ More Pan-sharpening is a process of merging a high-resolution (HR) panchromatic (PAN) image and its corresponding low-resolution (LR) multi-spectral (MS) image to create an HR-MS and pan-sharpened image. However, due to the different sensors' locations, characteristics and acquisition time, PAN and MS image pairs often tend to have various amounts of misalignment. Conventional deep-learning-based methods that were trained with such misaligned PAN-MS image pairs suffer from diverse artifacts such as double-edge and blur artifacts in the resultant PAN-sharpened images. In this paper, we propose a novel framework called shift-invariant pan-sharpening with moving object alignment (SIPSA-Net) which is the first method to take into account such large misalignment of moving object regions for PAN sharpening. The SISPA-Net has a feature alignment module (FAM) that can adjust one feature to be aligned to another feature, even between the two different PAN and MS domains. For better alignment in pan-sharpened images, a shift-invariant spectral loss is newly designed, which ignores the inherent misalignment in the original MS input, thereby having the same effect as optimizing the spectral loss with a well-aligned MS image. Extensive experimental results show that our SIPSA-Net can generate pan-sharpened images with remarkable improvements in terms of visual quality and alignment, compared to the state-of-the-art methods. △ Less

Submitted 5 May, 2021; originally announced May 2021.

Comments: Accepted to CVPR 2021

arXiv:2104.07253 [pdf, other]

Integration of Pre-trained Networks with Continuous Token Interface for End-to-End Spoken Language Understanding

Authors: Seunghyun Seo, Donghyun Kwak, Bowon Lee

Abstract: Most End-to-End (E2E) SLU networks leverage the pre-trained ASR networks but still lack the capability to understand the semantics of utterances, crucial for the SLU task. To solve this, recently proposed studies use pre-trained NLU networks. However, it is not trivial to fully utilize both pre-trained networks; many solutions were proposed, such as Knowledge Distillation, cross-modal shared embed… ▽ More Most End-to-End (E2E) SLU networks leverage the pre-trained ASR networks but still lack the capability to understand the semantics of utterances, crucial for the SLU task. To solve this, recently proposed studies use pre-trained NLU networks. However, it is not trivial to fully utilize both pre-trained networks; many solutions were proposed, such as Knowledge Distillation, cross-modal shared embedding, and network integration with Interface. We propose a simple and robust integration method for the E2E SLU network with novel Interface, Continuous Token Interface (CTI), the junctional representation of the ASR and NLU networks when both networks are pre-trained with the same vocabulary. Because the only difference is the noise level, we directly feed the ASR network's output to the NLU network. Thus, we can train our SLU network in an E2E manner without additional modules, such as Gumbel-Softmax. We evaluate our model using SLURP, a challenging SLU dataset and achieve state-of-the-art scores on both intent classification and slot filling tasks. We also verify the NLU network, pre-trained with Masked Language Model, can utilize a noisy textual representation of CTI. Moreover, we show our model can be trained with multi-task learning from heterogeneous data even after integration with CTI. △ Less

Submitted 16 February, 2022; v1 submitted 15 April, 2021; originally announced April 2021.

Comments: Accepted for ICASSP 2022

arXiv:2101.11469 [pdf, ps, other]

VOTE400(Voide Of The Elderly 400 Hours): A Speech Dataset to Study Voice Interface for Elderly-Care

Authors: Minsu Jang, Sangwon Seo, Dohyung Kim, Jaeyeon Lee, Jaehong Kim, Jun-Hwan Ahn

Abstract: This paper introduces a large-scale Korean speech dataset, called VOTE400, that can be used for analyzing and recognizing voices of the elderly people. The dataset includes about 300 hours of continuous dialog speech and 100 hours of read speech, both recorded by the elderly people aged 65 years or over. A preliminary experiment showed that speech recognition system trained with VOTE400 can outper… ▽ More This paper introduces a large-scale Korean speech dataset, called VOTE400, that can be used for analyzing and recognizing voices of the elderly people. The dataset includes about 300 hours of continuous dialog speech and 100 hours of read speech, both recorded by the elderly people aged 65 years or over. A preliminary experiment showed that speech recognition system trained with VOTE400 can outperform conventional systems in speech recognition of elderly people's voice. This work is a multi-organizational effort led by ETRI and MINDs Lab Inc. for the purpose of advancing the speech recognition performance of the elderly-care robots. △ Less

Submitted 20 January, 2021; originally announced January 2021.

Comments: 3 pages, 7 tables

arXiv:2007.13350 [pdf]

Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Normalization for End-to-End Speaker Verification System

Authors: Soonshin Seo, Ji-Hwan Kim

Abstract: One of the most important parts of an end-to-end speaker verification system is the speaker embedding generation. In our previous paper, we reported that shortcut connections-based multi-layer aggregation improves the representational power of the speaker embedding. However, the number of model parameters is relatively large and the unspecified variations increase in the multi-layer aggregation. T… ▽ More One of the most important parts of an end-to-end speaker verification system is the speaker embedding generation. In our previous paper, we reported that shortcut connections-based multi-layer aggregation improves the representational power of the speaker embedding. However, the number of model parameters is relatively large and the unspecified variations increase in the multi-layer aggregation. Therefore, we propose a self-attentive multi-layer aggregation with feature recalibration and normalization for end-to-end speaker verification system. To reduce the number of model parameters, the ResNet, which scaled channel width and layer depth, is used as a baseline. To control the variability in the training, a self-attention mechanism is applied to perform the multi-layer aggregation with dropout regularizations and batch normalizations. Then, a feature recalibration layer is applied to the aggregated feature using fully-connected layers and nonlinear activation functions. Deep length normalization is also used on a recalibrated feature in the end-to-end training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively). △ Less

Submitted 28 July, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

Comments: 5 pages, 1 figures, 4 tables

arXiv:2006.16583 [pdf, other]

Pan-Sharpening with Color-Aware Perceptual Loss and Guided Re-Colorization

Authors: Juan Luis Gonzalez Bello, Soomin Seo, Munchurl Kim

Abstract: We present a novel color-aware perceptual (CAP) loss for learning the task of pan-sharpening. Our CAP loss is designed to focus on the deep features of a pre-trained VGG network that are more sensitive to spatial details and ignore color information to allow the network to extract the structural information from the PAN image while keeping the color from the lower resolution MS image. Additionally… ▽ More We present a novel color-aware perceptual (CAP) loss for learning the task of pan-sharpening. Our CAP loss is designed to focus on the deep features of a pre-trained VGG network that are more sensitive to spatial details and ignore color information to allow the network to extract the structural information from the PAN image while keeping the color from the lower resolution MS image. Additionally, we propose "guided re-colorization", which generates a pan-sharpened image with real colors from the MS input by "picking" the closest MS pixel color for each pan-sharpened pixel, as a human operator would do in manual colorization. Such a re-colorized (RC) image is completely aligned with the pan-sharpened (PS) network output and can be used as a self-supervision signal during training, or to enhance the colors in the PS image during test. We present several experiments where our network trained with our CAP loss generates naturally looking pan-sharpened images with fewer artifacts and outperforms the state-of-the-arts on the WorldView3 dataset in terms of ERGAS, SCC, and QNR metrics. △ Less

Submitted 30 June, 2020; originally announced June 2020.

arXiv:2004.09239 [pdf]

Firefly-Algorithm Supported Scheme to Detect COVID-19 Lesion in Lung CT Scan Images using Shannon Entropy and Markov-Random-Field

Authors: Venkatesan Rajinikanth, Seifedine Kadry, Krishnan Palani Thanaraj, Krishnamurthy Kamalanand, Sanghyun Seo

Abstract: The pneumonia caused by Coronavirus disease (COVID-19) is one of major global threat and a number of detection and treatment procedures are suggested by the researchers for COVID-19. The proposed work aims to suggest an automated image processing scheme to extract the COVID-19 lesion from the lung CT scan images (CTI) recorded from the patients. This scheme implements the following procedures; (i)… ▽ More The pneumonia caused by Coronavirus disease (COVID-19) is one of major global threat and a number of detection and treatment procedures are suggested by the researchers for COVID-19. The proposed work aims to suggest an automated image processing scheme to extract the COVID-19 lesion from the lung CT scan images (CTI) recorded from the patients. This scheme implements the following procedures; (i) Image pre-processing to enhance the COVID-19 lesions, (ii) Image post-processing to extract the lesions, and (iii) Execution of a relative analysis between the extracted lesion segment and the Ground-Truth-Image (GTI). This work implements Firefly Algorithm and Shannon Entropy (FA+SE) based multi-threshold to enhance the pneumonia lesion and implements Markov-Random-Field (MRF) segmentation to extract the lesions with better accuracy. The proposed scheme is tested and validated using a class of COVID-19 CTI obtained from the existing image datasets and the experimental outcome is appraised to authenticate the clinical significance of the proposed scheme. The proposed work helped to attain a mean accuracy of >92% during COVID-19 lesion segmentation and in future, it can be used to examine the real clinical lung CTI of COVID-19 patients. △ Less

Submitted 14 April, 2020; originally announced April 2020.

Comments: 12 pages

arXiv:2001.10817 [pdf]

MCSAE: Masked Cross Self-Attentive Encoding for Speaker Embedding

Authors: Soonshin Seo, Ji-Hwan Kim

Abstract: In general, a self-attention mechanism has been applied for speaker embedding encoding. Previous studies focused on training the self-attention in a high-level layer, such as the last pooling layer. However, the effect of low-level features was reduced in the speaker embedding encoding. Therefore, we propose masked cross self-attentive encoding (MCSAE) using ResNet. It focuses on the features of b… ▽ More In general, a self-attention mechanism has been applied for speaker embedding encoding. Previous studies focused on training the self-attention in a high-level layer, such as the last pooling layer. However, the effect of low-level features was reduced in the speaker embedding encoding. Therefore, we propose masked cross self-attentive encoding (MCSAE) using ResNet. It focuses on the features of both high-level and lowlevel layers. Based on multi-layer aggregation, the output features of each residual layer are used for the MCSAE. In the MCSAE, cross self-attention module is trained the interdependence of each input features. A random masking regularization module also applied to preventing overfitting problem. As such, the MCSAE enhances the weight of frames representing the speaker information. Then, the output features are concatenated and encoded to the speaker embedding. Therefore, a more informative speaker embedding is encoded by using the MCSAE. The experimental results showed an equal error rate of 2.63% and a minimum detection cost function of 0.1453 using the VoxCeleb1 evaluation dataset. These were improved performances compared with the previous self-attentive encoding and state-of-the-art encoding methods. △ Less

Submitted 28 July, 2020; v1 submitted 27 January, 2020; originally announced January 2020.

Comments: 5 pages, 3 figures, 4 tables

arXiv:1904.03814 [pdf, other]

Temporal Convolution for Real-time Keyword Spotting on Mobile Devices

Authors: Seungwoo Choi, Seokjun Seo, Beomjun Shin, Hyeongmin Byun, Martin Kersner, Beomsu Kim, Dongyoung Kim, Sungjoo Ha

Abstract: Keyword spotting (KWS) plays a critical role in enabling speech-based user interactions on smart devices. Recent developments in the field of deep learning have led to wide adoption of convolutional neural networks (CNNs) in KWS systems due to their exceptional accuracy and robustness. The main challenge faced by KWS systems is the trade-off between high accuracy and low latency. Unfortunately, th… ▽ More Keyword spotting (KWS) plays a critical role in enabling speech-based user interactions on smart devices. Recent developments in the field of deep learning have led to wide adoption of convolutional neural networks (CNNs) in KWS systems due to their exceptional accuracy and robustness. The main challenge faced by KWS systems is the trade-off between high accuracy and low latency. Unfortunately, there has been little quantitative analysis of the actual latency of KWS models on mobile devices. This is especially concerning since conventional convolution-based KWS approaches are known to require a large number of operations to attain an adequate level of performance. In this paper, we propose a temporal convolution for real-time KWS on mobile devices. Unlike most of the 2D convolution-based KWS approaches that require a deep architecture to fully capture both low- and high-frequency domains, we exploit temporal convolutions with a compact ResNet architecture. In Google Speech Command Dataset, we achieve more than \textbf{385x} speedup on Google Pixel 1 and surpass the accuracy compared to the state-of-the-art model. In addition, we release the implementation of the proposed and the baseline models including an end-to-end pipeline for training models and evaluating them on mobile devices. △ Less

Submitted 18 November, 2019; v1 submitted 7 April, 2019; originally announced April 2019.

Comments: In INTERSPEECH 2019

Showing 1–16 of 16 results for author: Seo, S