Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Comparative Study of Speaker Role Identification in Air Traffic Communication Using Deep Learning Approaches

Published: 24 March 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Automatic spoken instruction understanding (SIU) of the controller-pilot conversations in the air traffic control (ATC) requires not only recognizing the words and semantics of the speech but also determining the role of the speaker. However, few of the published works on the automatic understanding systems in air traffic communication focus on speaker role identification (SRI). In this article, we formulate the SRI task of controller-pilot communication as a binary classification problem. Furthermore, the text-based, speech-based, and speech-and-text-based multi-modal methods are proposed to achieve a comprehensive comparison of the SRI task. To ablate the impacts of the comparative approaches, various advanced neural network architectures are applied to optimize the implementation of text-based and speech-based methods. Most importantly, a multi-modal speaker role identification network (MMSRINet) is designed to achieve the SRI task by considering both the speech and textual modality features. To aggregate modality features, the modal fusion module is proposed to fuse and squeeze acoustic and textual representations by modal attention mechanism and self-attention pooling layer, respectively. Finally, the comparative approaches are validated on the ATCSpeech corpus collected from a real-world ATC environment. The experimental results demonstrate that all the comparative approaches worked for the SRI task, and the proposed MMSRINet shows competitive performance and robustness compared with the other methods on both seen and unseen data, achieving 98.56% and 98.08% accuracy, respectively.

    References

    [1]
    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In International Conference on Advances in Neural Information Processing Systems.
    [2]
    Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (Feb.2019), 423–443. DOI:
    [3]
    Emre Cakır, Giambattista Parascandolo, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen. 2017. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio, Speech Lang. Process. 25, 6 (2017), 1291–1303.
    [4]
    Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho. 2017. Convolutional recurrent neural networks for music classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 2392–2396.
    [5]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 4171–4186. DOI:
    [6]
    Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 457–468. DOI:
    [7]
    Ignazio Gallo, Alessandro Calefati, Shah Nawaz, and Muhammad Kamran Janjua. 2018. Image and encoded text fusion for multi-modal classification. In Digital Image Computing: Techniques and Applications (DICTA’18). 1–7. DOI:
    [8]
    Dongyue Guo, Zichen Zhang, Peng Fan, Jianwei Zhang, and Bo Yang. 2021. A context-aware language model to improve the speech recognition in air traffic control. Aerospace 8, 11 (2021), 348.
    [9]
    Yoonchang Han, Jae-Hun Kim, and Kyogu Lee. 2017. Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE ACM Trans. Audio Speech Lang. Process. 25, 1 (2017), 208–221. DOI:
    [10]
    John H. L. Hansen and Gang Liu. 2016. Unsupervised accent classification for deep data fusion of accent and language information. Speech Commun. 78 (2016), 19–33.
    [11]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 770–778. DOI:
    [12]
    Hartmut Helmke, Jürgen Rataj, Thorsten Mühlhausen, Oliver Ohneiser, Heiko Ehr, Matthias Kleinert, Youssef Oualil, Marc Schulder, and D. Klakow. 2015. Assistant-based speech recognition for ATM applications. In 11th USA/Europe Air Traffic Management Research and Development Seminar (ATM’15).
    [13]
    Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. 2017. CNN architectures for large-scale audio classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 131–135. DOI:
    [14]
    Kittisak Jermsittiparsert, Abdurrahman Abdurrahman, Parinya Siriattakul, Ludmila A. Sundeeva, Wahidah Hashim, Robbi Rahim, and Andino Maseleno. 2020. Pattern recognition and features selection for speech emotion recognition model using deep learning. Int. J. Speech Technol. 23, 4 (2020), 799–806.
    [15]
    Jee-weon Jung, Hee-Soo Heo, Ju-ho Kim, Hye-jin Shim, and Ha-Jin Yu. 2019. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. arXiv:1904.08104 [cs, eess] (July2019).
    [16]
    Douwe Kiela and Léon Bottou. 2014. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 36–45.
    [17]
    Douwe Kiela, Edouard Grave, Armand Joulin, and Tomás Mikolov. 2018. Efficient large-scale multi-modal classification. In 32nd AAAI Conference on Artificial Intelligence, (AAAI’18). AAAI Press, 5198–5204.
    [18]
    Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Conference on Empirical Methods in Natural Language Processing. ACL, 1746–1751. DOI:
    [19]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012), 1097–1105.
    [20]
    Anurag Kumar and Bhiksha Raj. 2016. Audio event detection using weakly labeled data. In 24th ACM International Conference on Multimedia. 1038–1047.
    [21]
    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324. DOI:
    [22]
    Yi Lin. 2021. Spoken instruction understanding in air traffic control: Challenge, technique, and application. Aerospace 8, 3 (2021), 65.
    [23]
    Yi Lin, Linjie Deng, Zhengmao Chen, Xiping Wu, Jianwei Zhang, and Bo Yang. 2020. A real-time ATC safety monitoring framework using a deep learning approach. IEEE Trans. Intell. Transp. Syst. 21, 11 (2020), 4572–4581. DOI:
    [24]
    Yi Lin, Dongyue Guo, Jianwei Zhang, Zhengmao Chen, and Bo Yang. 2021a. A unified framework for multilingual speech recognition in air traffic control systems. IEEE Trans. Neural Netw. Learn. Syst. 32, 8 (2021), 3608–3620. DOI:
    [25]
    Yi Lin, Min Ruan, Kunjie Cai, Dan Li, Ziqiang Zeng, Fan Li, and Bo Yang. 2022. Identifying and managing risks of AI-driven operations: A case study of automatic speech recognition for improving air traffic safety. Chinese J. Aeron. (2022). DOI:
    [26]
    Yi Lin, Xianlong Tan, Bo Yang, Kai Yang, Jianwei Zhang, and Jing Yu. 2019. Real-time controlling dynamics sensing in air traffic system. Sensors 19, 3 (2019), 679. DOI:
    [27]
    Yi Lin, YuanKai Wu, Dongyue Guo, Pan Zhang, Changyu Yin, Bo Yang, and Jianwei Zhang. 2021b. A deep learning framework of autonomous pilot agent for air traffic controller training. IEEE Trans. Hum.-mach. Syst. (2021), 1–9. DOI:
    [28]
    Yi Lin, Bo Yang, Linchao Li, Dongyue Guo, Jianwei Zhang, Hu Chen, and Yi Zhang. 2021c. ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems. Appl. Soft Comput. 112 (2021), 107847. DOI:
    [29]
    Pengfei Liu, Xipeng Qiu, Xinchi Chen, Shiyu Wu, and Xuanjing Huang. 2015. Multi-timescale long short-term memory neural network for modelling sentences and documents. In Conference on Empirical Methods in Natural Language Processing. 2326–2335. DOI:
    [30]
    Ignacio Lopez-Moreno, Javier Gonzalez-Dominguez, Oldrich Plchot, David Martinez, Joaquin Gonzalez-Rodriguez, and Pedro Moreno. 2014. Automatic language identification using deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 5337–5341. DOI:
    [31]
    Tomás Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network-based language model. In 11th Annual Conference of the International Speech Communication Association. ISCA, 1045–1048.
    [32]
    Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep learning-based text classification: A comprehensive review. arXiv:2004.03705 [cs, stat] (Jan.2021).
    [33]
    Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2020. M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In AAAI Conference on Artificial Intelligence. 1359–1367.
    [34]
    Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. 2016. Natural language inference by tree-based convolution and heuristic matching. In 54th Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics. DOI:
    [35]
    Arsha Nagrani, Samuel Albanie, and Andrew Zisserman. 2018. Seeing voices and hearing faces: Cross-modal biometric matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 8427–8436.
    [36]
    Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In 28th International Conference on Machine Learning. Omnipress, 689–696.
    [37]
    Youssef Oualil, Dietrich Klakow, György Szaszák, Ajay Srinivasamurthy, Hartmut Helmke, and Petr Motlícek. 2017. A context-aware speech recognition and understanding system for air traffic control domain. In IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, 404–408. DOI:
    [38]
    John Edison Arevalo Ovalle, Thamar Solorio, Manuel Montes-y-Gómez, and Fabio A. González. 2017. Gated multimodal units for information fusion. In 5th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=S12_nquOe.
    [39]
    José Manuel Pardo, Javier Ferreiros, Fernando Fernández Martínez, Valentín Sama Rojo, Ricardo de Córdoba, Javier Macías Guarasa, Juan Manuel Montero, Rubén San-Segundo-Hernández, Luis Fernando D’Haro, and Germán González. 2011. Automatic understanding of ATC speech: Study of prospectives and field experiments for several controller positions. IEEE Trans. Aerosp. Electron. Syst. 47, 4 (2011), 2709–2730. DOI:
    [40]
    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Retrieved from https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
    [41]
    Mirco Ravanelli and Yoshua Bengio. 2018a. Speaker recognition from raw waveform with SincNet. In IEEE Spoken Language Technology Workshop (SLT’18). IEEE, 1021–1028. DOI:
    [42]
    Mirco Ravanelli and Yoshua Bengio. 2018b. Speech and speaker recognition from raw waveform with SincNet. CoRR abs/1812.05920 (2018).
    [43]
    Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
    [44]
    Lubos Smídl, Jan Svec, Daniel Tihelka, Jindrich Matousek, Jan Romportl, and Pavel Ircing. 2019. Air traffic control communication (ATCC) speech corpora and their use for ASR and TTS development. Lang. Resour. Eval. 53, 3 (2019), 449–464. DOI:
    [45]
    David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-Vectors: Robust DNN embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 5329–5333. DOI:
    [46]
    Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune BERT for text classification? In Chinese Computational Linguistics - 18th China National Conference, CCL 2019(Lecture Notes in Computer Science, Vol. 11856). Springer, 194–206. DOI:
    [47]
    Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing. 1556–1566. DOI:
    [48]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In International Conference on Advances in Neural Information Processing Systems. 5998–6008.
    [49]
    Zuxuan Wu, Yu-Gang Jiang, Jun Wang, Jian Pu, and Xiangyang Xue. 2014. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In 22nd ACM International Conference on Multimedia. 167–176.
    [50]
    Yong Xu, Qiuqiang Kong, Wenwu Wang, and Mark D. Plumbley. 2018. Large-scale weakly supervised audio classification using gated convolutional neural network. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). 121–125. DOI:
    [51]
    Bo Yang, Xianlong Tan, Zhengmao Chen, Bing Wang, Min Ruan, Dan Li, Zhongping Yang, Xiping Wu, and Yi Lin. 2020. ATCSpeech: A multilingual pilot-controller speech corpus from real air traffic control environment. In Annual Conference of the International Speech Communication Association. ISCA, 399–403. DOI:
    [52]
    Yuni Zeng, Hua Mao, Dezhong Peng, and Zhang Yi. 2019. Spectrogram-based multi-task audio classification. Multim. Tools Appl. 78, 3 (2019), 3705–3722. DOI:
    [53]
    Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In International Conference on Advances in Neural Information Processing Systems. 649–657.
    [54]
    Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In 54th Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics. DOI:
    [55]
    Pan Zhou, Wenwen Yang, Wei Chen, Yanfeng Wang, and Jia Jia. 2019. Modality attention for end-to-end audio-visual speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). 6565–6569. DOI:
    [56]
    Juan Zuluaga-Gomez, Petr Motlícek, Qingran Zhan, Karel Veselý, and Rudolf A. Braun. 2020. Automatic speech recognition benchmark for air-traffic communications. In 21st Annual Conference of the International Speech Communication Association. ISCA, 2297–2301. DOI:

    Cited By

    View all
    • (2023)Speech Recognition for Air Traffic Control via Feature Learning and End-to-End TrainingIEICE Transactions on Information and Systems10.1587/transinf.2022EDP7151E106.D:4(538-544)Online publication date: 1-Apr-2023
    • (2023)M2ATS: A Real-world Multimodal Air Traffic Situation Benchmark Dataset and BeyondProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613759(213-221)Online publication date: 26-Oct-2023
    • (2023)Boosting Low-Resource Speech Recognition in Air Traffic Communication via Pretrained Feature Aggregation and Multi-Task LearningIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2023.326905170:9(3714-3718)Online publication date: Oct-2023

    Index Terms

    1. A Comparative Study of Speaker Role Identification in Air Traffic Communication Using Deep Learning Approaches

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 4
      April 2023
      682 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3588902
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 March 2023
      Online AM: 24 November 2022
      Accepted: 18 November 2022
      Revised: 17 September 2022
      Received: 21 September 2021
      Published in TALLIP Volume 22, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Speaker role identification
      2. air traffic control
      3. text classification
      4. speech classification
      5. spoken instruction understanding
      6. multi-modal learning

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China
      • Open Fund of Key Laboratory of Flight Techniques and Flight Safety, Civil Aviation Administration of China (CAAC)
      • Fundamental Research Funds for the Central Universities

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)93
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 26 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Speech Recognition for Air Traffic Control via Feature Learning and End-to-End TrainingIEICE Transactions on Information and Systems10.1587/transinf.2022EDP7151E106.D:4(538-544)Online publication date: 1-Apr-2023
      • (2023)M2ATS: A Real-world Multimodal Air Traffic Situation Benchmark Dataset and BeyondProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613759(213-221)Online publication date: 26-Oct-2023
      • (2023)Boosting Low-Resource Speech Recognition in Air Traffic Communication via Pretrained Feature Aggregation and Multi-Task LearningIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2023.326905170:9(3714-3718)Online publication date: Oct-2023

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media