research-article

Research on Chinese Audio and Text Alignment Algorithm Based on AIC-FCM and Doc2Vec

Authors:

Jianming Huang,

Weizheng RenAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 3

Article No.: 71, Pages 1 - 22

https://doi.org/10.1145/3532852

Published: 02 April 2023 Publication History

Abstract

‘‘Audiobook” is a multimedia-based reading technology that has emerged in recent years. Realizing the alignment of e-book text and book audio is the most important part of its processing. This article describes an audio and text alignment algorithm using deep learning and neural network technology to improve the efficiency and quality of audiobook production. The algorithm first uses dual-threshold endpoint detection technology to segment long audio into short audio with sentence dimensions and recognizes it as short text. The threshold is calculated by AIC-FCM optimized based on simulated annealing genetic algorithm. Then the algorithm uses Doc2vec optimized by the threshold prediction method based on the average length of the short text to calculate the text similarity. Finally, proofread and output the text sequence and audio segment aligned in the time dimension to meet the needs of audiobook production. Experiments show that compared to traditional audio and text alignment algorithms, the proposed algorithm is closer to the ideal segmentation result in long audio segmentation, and the alignment effect is basically the same as Doc2vec and the time complexity is reduced by about 35%.

References

[1]

Y. Sun, J. Liu, K. Yu, M. Alazab, and K. Lin. 2021. PMRSS: Privacy-preserving medical record searching scheme for intelligent diagnosis in IoT healthcare. IEEE Transactions on Industrial Informatics, 99 (2021), 1–1.

[2]

Z. Guo, Y. Shen, A. K. Bashir, M. Imran, and K. Yu. 2020. Robust spammer detection using collaborative neural network in internet of thing applications. IEEEInternet of Things Journal 8, 12 (2020), 9549–9558.

[3]

Y. Gong, L. Zhang, R. P. Liu, K. Yu, and G. Srivastava. 2020. Non-linear MIMO for industrial internet of things in cyber-physical systems. IEEE Transactions on Industrial Informatics, 99 (2020), 1–1.

[4]

Y. Zhang, Y. Sun, R. Jin, K. Lin, and W. Liu. 2021. High-performance isolation computing technology for smart IoT healthcare in cloud environments. IEEE Internet of Things Journal, 99 (2021), 1–1.

[5]

L. Tan, H. Xiao, K. Yu, et al. 2021. A blockchain-empowered crowdsourcing system for 5G-enabled smart cities [J]. Computer Standards & Interfaces 76 (2021), 103517.

[6]

W. Zeng, Z. Guo, Y. Shen, et al. 2021. Data-driven management for fuzzy sewage treatment processes using hybrid neural computing [J]. Neural Computing and Applications (2021), 1–14.

[7]

Emanuela Marchetti and Andrea Valente. 2018. Interactivity and multimodality in language learning: The untapped potential of audiobooks. Universal Access in the Information Society 17, 2 (2018), 257–274.

Digital Library

[8]

Y. Zhang, Y. Qian, D. Wu, et al. 2018. Emotion-aware multimedia systems security [J]. IEEE Transactions on Multimedia 21, 3 (2018), 617–624.

[9]

Y. Shao, J. C. W. Lin, G. Srivastava, et al. 2021. Self-attention-based conditional random fields latent variables model for sequence labeling [J]. Pattern Recognition Letters 145 (2021), 157–164.

[10]

J. C. W. Lin, Y. N. Shao, Y. Djenouri, and U. Yun. 2021. ASRNN: A recurrent neural network with an attention model for sequence labeling. Knowledge-based Systems 212 (2021), 106548.

[11]

J. C. W. Lin, Y. N. Shao, J. Zhang, and U. Yun. 2020. Enhanced sequence labeling based on latent variable conditional random fields. Neurocomputing 403 (2020), 431–440.

[12]

Christian Brauchli, Simon Leipold, and Lutz Jäncke. 2020. Diminished large-scale functional brain networks in absolute pitch during the perception of naturalistic music and audiobooks. NeuroImage 216 (2020), 116513.

[13]

German Bordel, Mikel Penagarikano, Luis Javier Rodríguez-Fuentes, Aitor Álvarez, and Amparo Varona. 2015. Probabilistic kernels for improved text-to-speech alignment in long audio tracks. IEEE Signal Processing Letters 23, 1 (2015), 126–129.

[14]

Ashokkumar P., Siva Shankar G., Gautam Srivastava, Praveen Kumar Reddy Maddikunta, and Thippa Reddy Gadekallu. 2021. A two-stage text feature selection algorithm for improving text classification. ACM Transactions on Asian and Low-resource Language Information Processing 20, 3 (2021), 49.

[15]

Pedro J. Moreno, Chris Joerg, Jean-Manuel Van Thong, and Oren Glickman. 1998. A recursive algorithm for the forced alignment of very long audio segments. In Proceedings of the 5th International Conference on Spoken Language Processing.

[16]

Fabrice Malfrère, Olivier Deroo, Thierry Dutoit, and Christophe Ris. 2003. Phonetic alignment: Speech synthesis-based vs. viterbi-based. Speech Communication 40, 4 (2003), 503–515.

Digital Library

[17]

Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Proceedings of the Interspeech. 498–502.

[18]

Athanasios Katsamanis, Matthew Black, Panayiotis G. Georgiou, Louis Goldstein, and Shrikanth Narayanan. 2011. SailAlign: Robust long speech-text alignment. In Proceedings of the Workshop on New Tools and Methods for Very-large Scale Phonetics Research.

[19]

Norbert Braunschweiler, Mark J. F. Gales, and Sabine Buchholz. 2010. Lightly supervised recognition for automatic alignment of large coherent speech recordings. In Proceedings of the 11th Annual Conference of the International Speech Communication Association.

[20]

Adriana Stan, Peter Bell, and Simon King. 2012. A grapheme-based method for automatic alignment of speech and text data. In Proceedings of the 2012 IEEE Spoken Language Technology Workshop. IEEE, 286–290.

[21]

Sakshi Dhall, Ashutosh Dhar Dwivedi, Saibal K. Pal, and Gautam Srivastava. 2021. Blockchain-based framework for reducing fake or vicious news spread on social media/messaging platforms[J]. Transactions on Asian and Low-Resource Language Information Processing 21, 1 (2021), 1–33.

[22]

T. Mikolov, K. Chen, G. Corrado, et al. 2013. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781.

[23]

Armand Joulin, Édouard Grave, Piotr Bojanowski, and Tomáš Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 427–431.

[24]

Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning. PMLR, 1188–1196.

[25]

Yi-Chen Chen, Sung-Feng Huang, Hung-yi Lee, Yu-Hsuan Wang, and Chia-Hao Shen. 2019. Audio word2vec: Sequence-to-sequence autoencoding for unsupervised learning of audio segmentation and representation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 9 (2019), 1481–1493.

Digital Library

[26]

Oscar Saz, Salil Deena, Mortaza Doulaty, Madina Hasan, Bilal Khaliq, Rosanna Milner, Raymond W. M. Ng, Julia Olcoz, and Thomas Hain. 2018. Lightly supervised alignment of subtitles on multi-genre broadcasts. Multimedia Tools and Applications 77, 23 (2018), 30533–30550.

Digital Library

[27]

Der-Chiang Li, Liang-Sian Lin, Chien-Chih Chen, and Wei-Hao Yu. 2019. Using virtual samples to improve learning performance for small datasets with multimodal distributions. Soft Computing 23, 22 (2019), 11883–11900.

Digital Library

[28]

Ronald E. Shaffer and Gary W. Small. 1997. Peer reviewed: Learning optimization from nature: Genetic algorithms and simulated annealing. Analytical Chemistry 69, 7 (1997), 236A–242A.

[29]

Qiuyu Guo, Nan Li, and Guangrong Ji. 2010. A improved dual-threshold speech endpoint detection algorithm. In Proceedings of the 2010 The 2nd International Conference on Computer and Automation Engineering. IEEE, 123–126.

[30]

J. C. W. Lin, Y. N. Shao, Y. J. Zhou, M. Pirouz, and H. C. Chen. 2019. Bi-LSTM mention hypergraph model with encoding schema for mention extraction. Engineering Applications of Artificial Intelligence 85 (2019), 175–181.

[31]

J. C. W. Lin, Y. N. Shao, and F. Fournier-Viger, P. Hamido. 2019. BILU-NEMH: A BILU neural-encoded mention hypergraph for mention extraction. IInformation Sciences 496 (2019), 53–64.

Digital Library

[32]

Hiromasa Fujihara, Masataka Goto, Jun Ogata, and Hiroshi G. Okuno. 2011. LyricSynchronizer: Automatic synchronization system between musical audio signals and lyrics. IEEE Journal of Selected Topics in Signal Processing 5, 6 (2011), 1252–1261.

[33]

Minghe Yu, Jin Wang, Guoliang Li, Yong Zhang, Dong Deng, and Jianhua Feng. 2017. A unified framework for string similarity search with edit-distance constraint. The VLDB Journal 26, 2 (2017), 249–274.

Digital Library

[34]

YunZhi Chen, HuiJuan Lu, and LanJuan Li. 2017. Automatic ICD-10 coding algorithm using an improved longest common subsequence based on semantic similarity. PloS One 12, 3 (2017), e0173410.

[35]

Inigo Lopez-Gazpio, Montse Maritxalar, Mirella Lapata, and Eneko Agirre. 2019. Word n-gram attention models for sentence similarity and inference. Expert Systems with Applications 132 (2019), 1–11.

Digital Library

[36]

Sujoy Bag, Sri Krishna Kumar, and Manoj Kumar Tiwari. 2019. An efficient recommendation generation using relevant Jaccard similarity. Information Sciences 483 (2019), 53–64.

Digital Library

[37]

Ching-Hua Chuan, Kat Agres, and Dorien Herremans. 2020. From context to concept: Exploring semantic relationships in music with word2vec. Neural Computing and Applications 32, 4 (2020), 1023–1036.

Digital Library

[38]

Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised learning of sentence embeddings using compositional n-gram features. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 528–540.

[39]

Ryan Kiros, Yukun Zhu, Russ R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Proceedings of the Advances in Neural Information Processing Systems. 3294–3302.

Digital Library

[40]

Vidyut Dey, Dilip Kumar Pratihar, and Gauranga Lal Datta. 2011. Genetic algorithm-tuned entropy-based fuzzy C-means algorithm for obtaining distinct and compact clusters. Fuzzy Optimization and Decision Making 10, 2 (2011), 153–166.

Digital Library

[41]

Haojin Li, Junjie Li, and Fei Kang. 2011. Risk analysis of dam based on artificial bee colony algorithm with fuzzy c-means clustering. Canadian Journal of Civil Engineering 38, 5 (2011), 483–492.

[42]

Mehdi Ghazanfari, Somayeh Alizadeh, Mohammad Fathian, and Dimitris E. Koulouriotis. 2007. Comparing simulated annealing and genetic algorithm in learning FCM. Applied Mathematics and Computation 192, 1 (2007), 56–68.

Digital Library

[43]

Vikram Singh, Siddhant Garg, and Pradeep Kaur. 2016. Efficient algorithm for web search query reformulation using genetic algorithm. In Proceedings of the Computational Intelligence in Data Mining’Volume 1. Springer, 459–470.

[44]

Nil Mamano and Wayne B. Hayes. 2017. SANA: Simulated annealing far outperforms many other search algorithms for biological network alignment. Bioinformatics 33, 14 (2017), 2156–2164.

[45]

L. I. U. Yi-lin and A. N. Jian-cheng. 2018. Optimized kernel fuzzy c-means clustering algorithm. Microelectronics and Computer 35, 2 (2018), 79–83.

[46]

Stéphanie Portet. 2020. A primer on model selection using the akaike information criterion. Infectious Disease Modelling 5 (2020), 111–128.

[47]

Jun Li, Guimin Huang, Chunli Fan, Zhenglin Sun, and Hongtao Zhu. 2019. Key word extraction for short text via word2vec, doc2vec, and textrank. Turkish Journal of Electrical Engineering and Computer Sciences 27, 3 (2019), 1794–1805.

[48]

Wu Yongliang, Zhao Shuliang, Li Changjing, Wei Nadi, and wang Ziyan. 2017. Text classificationmethod based on tf-idf and cosine similarity. Journal of Chinese Information Processing 31, 5 (2017), 138–145.

[49]

Martin Toepfer and Christin Seifert. 2018. Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints. Journal: Digital Libraries for Open Knowledge Lecture Notes in Computer Science (2018), 3–15.

Cited By

Vats PSharma NSharma D(2023)HKG: A Novel Approach for Low Resource Indic Languages to Automatic Knowledge Graph ConstructionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3611306Online publication date: 2-Aug-2023
https://doi.org/10.1145/3611306
Ahmed GLawaye A(2023)CNN-based speech segments endpoints detection framework using short-time signal energy featuresInternational Journal of Information Technology10.1007/s41870-023-01466-615:8(4179-4191)Online publication date: 10-Sep-2023
https://doi.org/10.1007/s41870-023-01466-6

Index Terms

Research on Chinese Audio and Text Alignment Algorithm Based on AIC-FCM and Doc2Vec
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Automatic music video summarization based on audio-visual-text analysis and alignment
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

In this paper, we propose a novel approach for automatic music video summarization based on audio-visual-text analysis and alignment. The music video is separated into the music and video tracks. For the music track, the chorus is detected based on ...
Audio Feature Extraction for DTW-based Audio-to-Score Alignment
ICCCM '22: Proceedings of the 10th International Conference on Computer and Communications Management

Audio-to-score alignment is one of the music information retrieval (MIR) tasks that concerns the real world time when notes appeared in a corresponding audio. Although recent studies based on synthesizing MIDI to audio then applying audio feature ...
Incremental polyphonic audio to score alignment using beat tracking for singer robots
IROS'09: Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems

We aim at developing a singer robot capable of listening to music with its own ?ears? and interacting with a human's musical performance. Such a singer robot requires at least three functions: listening to the music, understanding what position in the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 3

March 2023

570 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3579816

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 April 2023

Online AM: 18 July 2022

Accepted: 06 December 2021

Revised: 22 November 2021

Received: 04 August 2021

Published in TALLIP Volume 22, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
191
Total Downloads

Downloads (Last 12 months)97
Downloads (Last 6 weeks)6

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Vats PSharma NSharma D(2023)HKG: A Novel Approach for Low Resource Indic Languages to Automatic Knowledge Graph ConstructionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3611306Online publication date: 2-Aug-2023
https://doi.org/10.1145/3611306
Ahmed GLawaye A(2023)CNN-based speech segments endpoints detection framework using short-time signal energy featuresInternational Journal of Information Technology10.1007/s41870-023-01466-615:8(4179-4191)Online publication date: 10-Sep-2023
https://doi.org/10.1007/s41870-023-01466-6

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents