research-article

A Comparative Study of Speaker Role Identification in Air Traffic Communication Using Deep Learning Approaches

Authors:

Yi LinAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 4

Article No.: 102, Pages 1 - 17

https://doi.org/10.1145/3572792

Published: 24 March 2023 Publication History

Abstract

Automatic spoken instruction understanding (SIU) of the controller-pilot conversations in the air traffic control (ATC) requires not only recognizing the words and semantics of the speech but also determining the role of the speaker. However, few of the published works on the automatic understanding systems in air traffic communication focus on speaker role identification (SRI). In this article, we formulate the SRI task of controller-pilot communication as a binary classification problem. Furthermore, the text-based, speech-based, and speech-and-text-based multi-modal methods are proposed to achieve a comprehensive comparison of the SRI task. To ablate the impacts of the comparative approaches, various advanced neural network architectures are applied to optimize the implementation of text-based and speech-based methods. Most importantly, a multi-modal speaker role identification network (MMSRINet) is designed to achieve the SRI task by considering both the speech and textual modality features. To aggregate modality features, the modal fusion module is proposed to fuse and squeeze acoustic and textual representations by modal attention mechanism and self-attention pooling layer, respectively. Finally, the comparative approaches are validated on the ATCSpeech corpus collected from a real-world ATC environment. The experimental results demonstrate that all the comparative approaches worked for the SRI task, and the proposed MMSRINet shows competitive performance and robustness compared with the other methods on both seen and unseen data, achieving 98.56% and 98.08% accuracy, respectively.

References

[1]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In International Conference on Advances in Neural Information Processing Systems.

[2]

Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (Feb.2019), 423–443. DOI:

Digital Library

[3]

Emre Cakır, Giambattista Parascandolo, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen. 2017. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio, Speech Lang. Process. 25, 6 (2017), 1291–1303.

Digital Library

[4]

Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho. 2017. Convolutional recurrent neural networks for music classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 2392–2396.

Digital Library

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 4171–4186. DOI:

[6]

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 457–468. DOI:

[7]

Ignazio Gallo, Alessandro Calefati, Shah Nawaz, and Muhammad Kamran Janjua. 2018. Image and encoded text fusion for multi-modal classification. In Digital Image Computing: Techniques and Applications (DICTA’18). 1–7. DOI:

[8]

Dongyue Guo, Zichen Zhang, Peng Fan, Jianwei Zhang, and Bo Yang. 2021. A context-aware language model to improve the speech recognition in air traffic control. Aerospace 8, 11 (2021), 348.

[9]

Yoonchang Han, Jae-Hun Kim, and Kyogu Lee. 2017. Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE ACM Trans. Audio Speech Lang. Process. 25, 1 (2017), 208–221. DOI:

Digital Library

[10]

John H. L. Hansen and Gang Liu. 2016. Unsupervised accent classification for deep data fusion of accent and language information. Speech Commun. 78 (2016), 19–33.

Digital Library

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 770–778. DOI:

[12]

Hartmut Helmke, Jürgen Rataj, Thorsten Mühlhausen, Oliver Ohneiser, Heiko Ehr, Matthias Kleinert, Youssef Oualil, Marc Schulder, and D. Klakow. 2015. Assistant-based speech recognition for ATM applications. In 11th USA/Europe Air Traffic Management Research and Development Seminar (ATM’15).

[13]

Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. 2017. CNN architectures for large-scale audio classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 131–135. DOI:

Digital Library

[14]

Kittisak Jermsittiparsert, Abdurrahman Abdurrahman, Parinya Siriattakul, Ludmila A. Sundeeva, Wahidah Hashim, Robbi Rahim, and Andino Maseleno. 2020. Pattern recognition and features selection for speech emotion recognition model using deep learning. Int. J. Speech Technol. 23, 4 (2020), 799–806.

Digital Library

[15]

Jee-weon Jung, Hee-Soo Heo, Ju-ho Kim, Hye-jin Shim, and Ha-Jin Yu. 2019. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. arXiv:1904.08104 [cs, eess] (July2019).

[16]

Douwe Kiela and Léon Bottou. 2014. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 36–45.

[17]

Douwe Kiela, Edouard Grave, Armand Joulin, and Tomás Mikolov. 2018. Efficient large-scale multi-modal classification. In 32nd AAAI Conference on Artificial Intelligence, (AAAI’18). AAAI Press, 5198–5204.

[18]

Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Conference on Empirical Methods in Natural Language Processing. ACL, 1746–1751. DOI:

[19]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012), 1097–1105.

Digital Library

[20]

Anurag Kumar and Bhiksha Raj. 2016. Audio event detection using weakly labeled data. In 24th ACM International Conference on Multimedia. 1038–1047.

Digital Library

[21]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324. DOI:

[22]

Yi Lin. 2021. Spoken instruction understanding in air traffic control: Challenge, technique, and application. Aerospace 8, 3 (2021), 65.

[23]

Yi Lin, Linjie Deng, Zhengmao Chen, Xiping Wu, Jianwei Zhang, and Bo Yang. 2020. A real-time ATC safety monitoring framework using a deep learning approach. IEEE Trans. Intell. Transp. Syst. 21, 11 (2020), 4572–4581. DOI:

[24]

Yi Lin, Dongyue Guo, Jianwei Zhang, Zhengmao Chen, and Bo Yang. 2021a. A unified framework for multilingual speech recognition in air traffic control systems. IEEE Trans. Neural Netw. Learn. Syst. 32, 8 (2021), 3608–3620. DOI:

[25]

Yi Lin, Min Ruan, Kunjie Cai, Dan Li, Ziqiang Zeng, Fan Li, and Bo Yang. 2022. Identifying and managing risks of AI-driven operations: A case study of automatic speech recognition for improving air traffic safety. Chinese J. Aeron. (2022). DOI:

[26]

Yi Lin, Xianlong Tan, Bo Yang, Kai Yang, Jianwei Zhang, and Jing Yu. 2019. Real-time controlling dynamics sensing in air traffic system. Sensors 19, 3 (2019), 679. DOI:

[27]

Yi Lin, YuanKai Wu, Dongyue Guo, Pan Zhang, Changyu Yin, Bo Yang, and Jianwei Zhang. 2021b. A deep learning framework of autonomous pilot agent for air traffic controller training. IEEE Trans. Hum.-mach. Syst. (2021), 1–9. DOI:

[28]

Yi Lin, Bo Yang, Linchao Li, Dongyue Guo, Jianwei Zhang, Hu Chen, and Yi Zhang. 2021c. ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems. Appl. Soft Comput. 112 (2021), 107847. DOI:

[29]

Pengfei Liu, Xipeng Qiu, Xinchi Chen, Shiyu Wu, and Xuanjing Huang. 2015. Multi-timescale long short-term memory neural network for modelling sentences and documents. In Conference on Empirical Methods in Natural Language Processing. 2326–2335. DOI:

[30]

Ignacio Lopez-Moreno, Javier Gonzalez-Dominguez, Oldrich Plchot, David Martinez, Joaquin Gonzalez-Rodriguez, and Pedro Moreno. 2014. Automatic language identification using deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 5337–5341. DOI:

[31]

Tomás Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network-based language model. In 11th Annual Conference of the International Speech Communication Association. ISCA, 1045–1048.

[32]

Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep learning-based text classification: A comprehensive review. arXiv:2004.03705 [cs, stat] (Jan.2021).

[33]

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2020. M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In AAAI Conference on Artificial Intelligence. 1359–1367.

[34]

Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. 2016. Natural language inference by tree-based convolution and heuristic matching. In 54th Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics. DOI:

[35]

Arsha Nagrani, Samuel Albanie, and Andrew Zisserman. 2018. Seeing voices and hearing faces: Cross-modal biometric matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 8427–8436.

[36]

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In 28th International Conference on Machine Learning. Omnipress, 689–696.

[37]

Youssef Oualil, Dietrich Klakow, György Szaszák, Ajay Srinivasamurthy, Hartmut Helmke, and Petr Motlícek. 2017. A context-aware speech recognition and understanding system for air traffic control domain. In IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, 404–408. DOI:

[38]

John Edison Arevalo Ovalle, Thamar Solorio, Manuel Montes-y-Gómez, and Fabio A. González. 2017. Gated multimodal units for information fusion. In 5th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=S12_nquOe.

[39]

José Manuel Pardo, Javier Ferreiros, Fernando Fernández Martínez, Valentín Sama Rojo, Ricardo de Córdoba, Javier Macías Guarasa, Juan Manuel Montero, Rubén San-Segundo-Hernández, Luis Fernando D’Haro, and Germán González. 2011. Automatic understanding of ATC speech: Study of prospectives and field experiments for several controller positions. IEEE Trans. Aerosp. Electron. Syst. 47, 4 (2011), 2709–2730. DOI:

[40]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Retrieved from https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.

[41]

Mirco Ravanelli and Yoshua Bengio. 2018a. Speaker recognition from raw waveform with SincNet. In IEEE Spoken Language Technology Workshop (SLT’18). IEEE, 1021–1028. DOI:

[42]

Mirco Ravanelli and Yoshua Bengio. 2018b. Speech and speaker recognition from raw waveform with SincNet. CoRR abs/1812.05920 (2018).

[43]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[44]

Lubos Smídl, Jan Svec, Daniel Tihelka, Jindrich Matousek, Jan Romportl, and Pavel Ircing. 2019. Air traffic control communication (ATCC) speech corpora and their use for ASR and TTS development. Lang. Resour. Eval. 53, 3 (2019), 449–464. DOI:

Digital Library

[45]

David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-Vectors: Robust DNN embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 5329–5333. DOI:

Digital Library

[46]

Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune BERT for text classification? In Chinese Computational Linguistics - 18th China National Conference, CCL 2019(Lecture Notes in Computer Science, Vol. 11856). Springer, 194–206. DOI:

Digital Library

[47]

Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing. 1556–1566. DOI:

[48]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In International Conference on Advances in Neural Information Processing Systems. 5998–6008.

[49]

Zuxuan Wu, Yu-Gang Jiang, Jun Wang, Jian Pu, and Xiangyang Xue. 2014. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In 22nd ACM International Conference on Multimedia. 167–176.

Digital Library

[50]

Yong Xu, Qiuqiang Kong, Wenwu Wang, and Mark D. Plumbley. 2018. Large-scale weakly supervised audio classification using gated convolutional neural network. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). 121–125. DOI:

Digital Library

[51]

Bo Yang, Xianlong Tan, Zhengmao Chen, Bing Wang, Min Ruan, Dan Li, Zhongping Yang, Xiping Wu, and Yi Lin. 2020. ATCSpeech: A multilingual pilot-controller speech corpus from real air traffic control environment. In Annual Conference of the International Speech Communication Association. ISCA, 399–403. DOI:

[52]

Yuni Zeng, Hua Mao, Dezhong Peng, and Zhang Yi. 2019. Spectrogram-based multi-task audio classification. Multim. Tools Appl. 78, 3 (2019), 3705–3722. DOI:

Digital Library

[53]

Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In International Conference on Advances in Neural Information Processing Systems. 649–657.

[54]

Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In 54th Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics. DOI:

[55]

Pan Zhou, Wenwen Yang, Wei Chen, Yanfeng Wang, and Jia Jia. 2019. Modality attention for end-to-end audio-visual speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). 6565–6569. DOI:

[56]

Juan Zuluaga-Gomez, Petr Motlícek, Qingran Zhan, Karel Veselý, and Rudolf A. Braun. 2020. Automatic speech recognition benchmark for air-traffic communications. In 21st Annual Conference of the International Speech Communication Association. ISCA, 2297–2301. DOI:

Cited By

Yuan XLi D(2024)Application of Speech Recognition Technology in Air-Ground Communication Based on Bidirectional Recurrent Neural Network and Transfer Learning2024 3rd International Conference on Artificial Intelligence and Computer Information Technology (AICIT)10.1109/AICIT62434.2024.10730296(1-7)Online publication date: 20-Sep-2024
https://doi.org/10.1109/AICIT62434.2024.10730296
Guo DZhang ZYang BZhang JYang HLin Y(2024)Integrating spoken instructions into flight trajectory prediction to optimize automation in air traffic controlNature Communications10.1038/s41467-024-54069-515:1Online publication date: 7-Nov-2024
https://doi.org/10.1038/s41467-024-54069-5
FAN PHUA XLIN YYANG BZHANG JGE WGUO D(2023)Speech Recognition for Air Traffic Control via Feature Learning and End-to-End TrainingIEICE Transactions on Information and Systems10.1587/transinf.2022EDP7151E106.D:4(538-544)Online publication date: 1-Apr-2023
https://doi.org/10.1587/transinf.2022EDP7151
Show More Cited By

Index Terms

A Comparative Study of Speaker Role Identification in Air Traffic Communication Using Deep Learning Approaches
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction

Recommendations

Characterizing and detecting spontaneous speech: Application to speaker role recognition

Processing spontaneous speech is one of the many challenges that automatic speech recognition systems have to deal with. The main characteristics of this kind of speech are disfluencies (filled pause, repetition, false start, etc.) and many studies have ...
Self-Learning Speaker Identification: A System for Enhanced Speech Recognition
Spectral histogram of oriented gradients (SHOGs) for Tamil language male/female speaker classification

Gender (Male/Female) classification plays a primary vital role to develop a robust Automatic Tamil Speech Recognition (ASR) applications due to the diversity in the vocal tract of speakers. Various features including Formants (F1, F2, F3, F4), Zero ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 4

April 2023

682 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3588902

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 March 2023

Online AM: 24 November 2022

Accepted: 18 November 2022

Revised: 17 September 2022

Received: 21 September 2021

Published in TALLIP Volume 22, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Open Fund of Key Laboratory of Flight Techniques and Flight Safety, Civil Aviation Administration of China (CAAC)
Fundamental Research Funds for the Central Universities

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
197
Total Downloads

Downloads (Last 12 months)77
Downloads (Last 6 weeks)9

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yuan XLi D(2024)Application of Speech Recognition Technology in Air-Ground Communication Based on Bidirectional Recurrent Neural Network and Transfer Learning2024 3rd International Conference on Artificial Intelligence and Computer Information Technology (AICIT)10.1109/AICIT62434.2024.10730296(1-7)Online publication date: 20-Sep-2024
https://doi.org/10.1109/AICIT62434.2024.10730296
Guo DZhang ZYang BZhang JYang HLin Y(2024)Integrating spoken instructions into flight trajectory prediction to optimize automation in air traffic controlNature Communications10.1038/s41467-024-54069-515:1Online publication date: 7-Nov-2024
https://doi.org/10.1038/s41467-024-54069-5
FAN PHUA XLIN YYANG BZHANG JGE WGUO D(2023)Speech Recognition for Air Traffic Control via Feature Learning and End-to-End TrainingIEICE Transactions on Information and Systems10.1587/transinf.2022EDP7151E106.D:4(538-544)Online publication date: 1-Apr-2023
https://doi.org/10.1587/transinf.2022EDP7151
Guo DLin YYou XYang ZZhou JYang BZhang JShi HHu SZhang ZEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)M2ATS: A Real-world Multimodal Air Traffic Situation Benchmark Dataset and BeyondProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613759(213-221)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3613759
Guo DZhang ZYang BZhang JLin Y(2023)Boosting Low-Resource Speech Recognition in Air Traffic Communication via Pretrained Feature Aggregation and Multi-Task LearningIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2023.326905170:9(3714-3718)Online publication date: Sep-2023
https://doi.org/10.1109/TCSII.2023.3269051

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents