Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

DeFTAN-II: Efficient Multichannel Speech Enhancement With Subgroup Processing

Published: 30 October 2024 Publication History

Abstract

In this work, we present DeFTAN-II, an efficient multichannel speech enhancement model based on transformer architecture and subgroup processing. Despite the success of transformers in speech enhancement, they face challenges in capturing local relations, reducing the high computational complexity, and lowering memory usage. To address these limitations, we introduce subgroup processing in our model, combining subgroups of locally emphasized features with other subgroups containing original features. The subgroup processing is implemented in several blocks of the proposed network. In the proposed split dense blocks extracting spatial features, a pair of subgroups is sequentially concatenated and processed by convolution layers to effectively reduce the computational complexity and memory usage. For the F- and T-transformers extracting temporal and spectral relations, we introduce cross-attention between subgroups to identify relationships between locally emphasized and non-emphasized features. The dual-path feedforward network then aggregates attended features in terms of the gating of local features processed by dilated convolutions. Through extensive comparisons with state-of-the-art multichannel speech enhancement models, we demonstrate that DeFTAN-II with subgroup processing outperforms existing methods at significantly lower computational complexity. Moreover, we evaluate the model's generalization capability on real-world data without fine-tuning, which further demonstrates its effectiveness in practical scenarios.

References

[1]
J. Donley et al., “EasyCom: An augmented reality dataset to support algorithms for easy communication in noisy environments,” 2021, arXiv:2107.04174.
[2]
P. Guiraud et al., “An introduction to the speech enhancement for augmented reality (spear) challenge,” in Proc. Int. Workshop. Acoust. Sig. Enhancement, Bamberg, Germany, 2022, pp. 1–5.
[3]
T. Yoshioka et al., “The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices,” in Proc. 2015 IEEE Workshop Automat. Speech Recognit. Understanding, Scottsdale, AZ, 2015, pp. 436–443.
[4]
X. Ren et al., “A causal u-net based neural beamforming network for real-time multi-channel speech enhancement,” in Proc. Interspeech, Brno, Czechia, 2021, pp. 1832–1836.
[5]
A. Pandey, B. Xu, A. Kumar, J. Donley, P. Calamia, and D. Wang, “TPARN: Triple-path attentive recurrent network for time-domain multichannel speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Singapore, 2022, pp. 6497–6501.
[6]
B. Tolooshams, R. Giri, A. H. Song, U. Isik, and A. Krishnaswamy, “Channel-attention dense u-net for multichannel speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Barcelona, Spain, 2020, pp. 836–840.
[7]
Z.-Q. Wang, P. Wang, and D. Wang, “Complex spectral mapping for single-and multi-channel speech enhancement and robust ASR,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 1778–1787, 2020.
[8]
Z.-Q. Wang and D. Wang, “Multi-microphone complex spectral mapping for speech dereverberation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Barcelona, Spain, 2020, pp. 486–490.
[9]
K. Tesch and T. Gerkmann, “Insights into deep non-linear filters for improved multi-channel speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 563–575, 2022.
[10]
A. Li, W. Liu, C. Zheng, and X. Li, “Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 6487–6491.
[11]
A. Pandey, B. Xu, A. Kumar, J. Donley, P. Calamia, and D. Wang, “Multichannel speech enhancement without beamforming,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Singapore, 2022, pp. 6502–6506.
[12]
J. Liu and X. Zhang, “DRC-NET: Densely connected recurrent convolutional neural network for speech dereverberation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Singapore, 2022, pp. 166–170.
[13]
L. Shubo et al., “Spatial-DCCRN: DCCRN equipped with frame-level angle feature and hybrid filtering for multi-channel speech enhancement,” in Proc. IEEE Spoken Lang. Technol. Workshop, Doha, Qatar, 2023, pp. 436–443.
[14]
Y. Yang, C. Quan, and X. Li, “McNet: Fuse multiple cues for multichannel speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Rhodes, Greece, 2023, pp. 1–5.
[15]
D. Lee and J.-W. Choi, “DeFT-AN: Dense frequency-time attentive network for multichannel speech enhancement,” IEEE Signal Proc. Lett., vol. 30, pp. 155–159, 2023.
[16]
Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Proc. Lett., vol. 21, no. 1, pp. 65–68, Jan. 2014.
[17]
X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” in Proc. Interspeech, Lyon, France, 2013, pp. 436–440.
[18]
Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 1, pp. 7–19, Jan. 2015.
[19]
Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” in Proc. Interspeech, San Francisco, CA, 2016, pp. 545–549.
[20]
D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 3, pp. 483–492, Mar. 2016.
[21]
S. R. Park and J. W. Lee, “A fully convolutional neural network for speech enhancement,” in Proc. Interspeech, Stockholm, Sweden, 2017, pp. 1993–1997.
[22]
D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 10, pp. 1702–1726, May, 2018.
[23]
S.-W. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveform-based speech enhancement by fully convolutional networks,” in Proc. Asia-Pacific Signal, Inf. Process. Assoc. Annu. Summit Conf., 2017, pp. 06–12.
[24]
Y. Luo and N. Mesgarani, “TasNet: Time-domain audio separation network for real-time, single-channel speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Calgary, AB, Canada, 2018, pp. 696–700.
[25]
S. Venkataramani, J. Casebeer, and P. Smaragdis, “End-to-end source separation with adaptive front-ends,” in Proc. 52nd Asilomar Conf. Signals, Sys., Comput., 2018, pp. 684–688.
[26]
Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, Aug. 2019.
[27]
J. Rixen and M. Renz, “SFSRNet: Super-resolution for single-channel audio source separation,” in Proc. AAAI Conf. Artif. Intell., Virtual, 2022, vol. 36, no. 10, pp. 11220–11228.
[28]
E. Tzinis, Z. Wang, and P. Smaragdis, “Sudo RM-RF: Efficient networks for universal audio source separation,” in Proc. IEEE 30th Int. Workshop Mach. Learn. Signal Process., Espoo, Finland, 2020, pp. 1–6.
[29]
J. Rixen and M. Renz, “QDPN-quasi-dual-path network for single-channel speech separation,” in Proc. Interspeech, Incheon, Korea, 2022, pp. 5353–5357.
[30]
Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 46–50.
[31]
S.-W. Fu, T.-Y. Hu, Y. Tsao, and X. Lu, “Complex spectrogram enhancement by convolutional neural network with multi-metrics learning,” in Proc. IEEE 30th Int. Workshop Mach. Learn. Signal Process., Tokyo, Japan, 2017, pp. 1–6.
[32]
K. Tan and D. Wang, “Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Brighton, U.K., 2019, pp. 6865–6869.
[33]
A. Pandey and D. Wang, “Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 2020, pp. 6629–6633.
[34]
G. Yu, A. Li, C. Zheng, Y. Guo, Y. Wang, and H. Wang, “Dual-branch attention-in-attention transformer for single-channel speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Singapore, 2022, pp. 7847–7851.
[35]
A. Pandey and D. Wang, “Dense CNN with self-attention for time-domain speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 1270–1279, 2021.
[36]
L. Yang, W. Liu, and W. Wang, “TFPSNet: Time-frequency domain path scanning network for speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 6842–6846.
[37]
Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Integrating full-and sub-band modeling for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 3221–3236, 2023.
[38]
J. L. Elman, “Finding structure in time,” Cogn. Sci., vol. 14, no. 2, pp. 179–211, Mar. 1990.
[39]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.
[40]
A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Info. Process. Syst., Long Beach, CA, 2017, vol. 30.
[41]
Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks,” in Proc. 16th Annu. Conf. Int. Speech. Commun. Assoc., 2015, pp. 1–7.
[42]
F. Weninger et al., “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” in Proc. Int. Conf. Latent Variable Anal, Signal Separation, Springer, Liberec, Czech, 2015, pp. 91–99.
[43]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[44]
C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Virtual, 2021, pp. 21–25.
[45]
J. Chen, Q. Mao, and D. Liu, “Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation,” in Proc. Interspeech, Shanghai, China, 2020, pp. 2642–2646.
[46]
G. Huang, Z. Liu, L. V. D. Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comp. Vis. Pattern Recognit., Honolulu, HI, 2017, pp. 4700–4708.
[47]
A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, Shanghai, China, 2020, pp. 5036–5040.
[48]
J. Guo et al., “CMT: Convolutional neural networks meet vision transformers,” in Proc. IEEE/CVF Conf. Comp. Vis. Pattern Recognit., New Orleans, LA, 2022, pp. 12175–12185.
[49]
Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Comp. Vis., Virtual, 2021, pp. 10012–10022.
[50]
S. Li et al., “Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting,” in Proc. Adv. Neural Info. Process. Sys., Vancouver, BC, Canada, vol. 32, 2019, vol. 32.
[51]
E. Guizzo et al., “L3DAS22 challenge: Learning 3D audio sources in a real office environment,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Singapore, 2022, pp. 9186–9190.
[52]
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proc. IEEE Int. Conf. Comp. Vis., Santiago, Chile, 2015, pp. 1026–1034.
[53]
D. Lee and J.-W. Choi, “DeFT-AN RT: Real-time multichannel speech enhancement using dense frequency-time attentive network and non-overlapping synthesis window,” in Proc. Interspeech, Dublin, Ireland, 2023, pp. 864–868.
[54]
A. Hassani, S. Walton, J. Li, S. Li, and H. Shi, “Neighborhood attention transformer,” in Proc. IEEE/CVF Conf. Comp. Vis. Pattern Recognit., Vancouver, BC, Canada, 2023, pp. 6185–6194.
[55]
Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, “Efficient attention: Attention with linear complexities,” in Proc. IEEE/CVF Winter Conf. Appl. Comp. Vis., Waikoloa, HI, 2021, pp. 3531–3539.
[56]
Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in Proc. Int. Conf. Mach. Learn., Sydney, Australia, 2017, pp. 933–941.
[57]
Y. Koizumi et al., “DF-Conformer: Integrated architecture of conv-tasnet and conformer using linear complexity self-attention for speech enhancement,” in Proc. IEEE Workshop Appl. Signal Process. Audio, Acoust., New Paltz, NY, 2021, pp. 161–165.
[58]
M. Tan and Q. Le, “EfficientNetV2: Smaller models and faster training,” in Proc. Int. Conf. Mach. Learn., Virtual, 2021, pp. 10096–10106.
[59]
T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals, “WSJCAM0: A british english speech corpus for large vocabulary continuous speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Detroit, MI, 1995, vol. 1, pp. 81–84.
[60]
R. Scheibler, E. Bezzam, and I. Dokmanić, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Calgary, AB, Canada, 2018, pp. 351–355.
[61]
K. Kinoshita et al., “A summary of the REVERB challenge: State-of-the-art and remaining challenges in reverberant speech processing research,” EURASIP J. Adv. Signal Process., vol. 2016, pp. 1–19, Jan. 2016.
[62]
C. K. Reddy et al., “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” in Proc. Interspeech, 2020, pp. 2492–2496.
[63]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., South Brisbane, Australia, 2015, pp. 5206–5210.
[64]
E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50 k: An open dataset of human-labeled sound events,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 829–852, 2022.
[65]
D. Kingma, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Represent., San Diego, CA, 2015.
[66]
J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Brighton, U.K., 2019, pp. 626–630.
[67]
A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Salt Lake City, UT, 2001, vol. 2, pp. 749–752.
[68]
C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Dallas, TX, 2010, pp. 4214–4217.
[69]
Y.-J. Lu et al., “Towards low-distortion multi-channel speech enhancement: The ESPNET-SE submission to the L3DAS22 challenge,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Singapore, 2022, pp. 9201–9205.
[70]
G. Zhang, L. Yu, C. Wang, and J. Wei, “Multi-scale temporal frequency convolutional network with axial attention for speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Singapore, 2022, pp. 9122–9126.
[71]
J. Li, Y. Zhu, D. Luo, Y. Liu, G. Cui, and Z. Li, “The PCG-AIID system for L3DAS22 challenge: MIMO and MISO convolutional recurrent network for multi channel speech enhancement and speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Singapore, 2022, pp. 9211–9215.
[72]
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A. framework for self-supervised learning of speech representations,” in Proc. Adv. Neural Info. Process. Sys, Virtual, 2020, vol. 33, pp. 12449–12460.
[73]
K. Tesch and T. Gerkmann, “Multi-channel speech separation using spatially selective deep non-linear filters,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 32, pp. 542–553, 2024.
[74]
S. Zhang and X. Li, “Microphone array generalization for multichannel narrowband deep speech enhancement,” in Proc. Interspeech, Brno, Czechia, 2021, pp. 666–670.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 32, Issue
2024
5088 pages
ISSN:2329-9290
EISSN:2329-9304
Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 30 October 2024
Published in TASLP Volume 32

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 1
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media