Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Research on Chinese Audio and Text Alignment Algorithm Based on AIC-FCM and Doc2Vec

Published: 02 April 2023 Publication History

Abstract

‘‘Audiobook” is a multimedia-based reading technology that has emerged in recent years. Realizing the alignment of e-book text and book audio is the most important part of its processing. This article describes an audio and text alignment algorithm using deep learning and neural network technology to improve the efficiency and quality of audiobook production. The algorithm first uses dual-threshold endpoint detection technology to segment long audio into short audio with sentence dimensions and recognizes it as short text. The threshold is calculated by AIC-FCM optimized based on simulated annealing genetic algorithm. Then the algorithm uses Doc2vec optimized by the threshold prediction method based on the average length of the short text to calculate the text similarity. Finally, proofread and output the text sequence and audio segment aligned in the time dimension to meet the needs of audiobook production. Experiments show that compared to traditional audio and text alignment algorithms, the proposed algorithm is closer to the ideal segmentation result in long audio segmentation, and the alignment effect is basically the same as Doc2vec and the time complexity is reduced by about 35%.

References

[1]
Y. Sun, J. Liu, K. Yu, M. Alazab, and K. Lin. 2021. PMRSS: Privacy-preserving medical record searching scheme for intelligent diagnosis in IoT healthcare. IEEE Transactions on Industrial Informatics, 99 (2021), 1–1.
[2]
Z. Guo, Y. Shen, A. K. Bashir, M. Imran, and K. Yu. 2020. Robust spammer detection using collaborative neural network in internet of thing applications. IEEEInternet of Things Journal 8, 12 (2020), 9549–9558.
[3]
Y. Gong, L. Zhang, R. P. Liu, K. Yu, and G. Srivastava. 2020. Non-linear MIMO for industrial internet of things in cyber-physical systems. IEEE Transactions on Industrial Informatics, 99 (2020), 1–1.
[4]
Y. Zhang, Y. Sun, R. Jin, K. Lin, and W. Liu. 2021. High-performance isolation computing technology for smart IoT healthcare in cloud environments. IEEE Internet of Things Journal, 99 (2021), 1–1.
[5]
L. Tan, H. Xiao, K. Yu, et al. 2021. A blockchain-empowered crowdsourcing system for 5G-enabled smart cities [J]. Computer Standards & Interfaces 76 (2021), 103517.
[6]
W. Zeng, Z. Guo, Y. Shen, et al. 2021. Data-driven management for fuzzy sewage treatment processes using hybrid neural computing [J]. Neural Computing and Applications (2021), 1–14.
[7]
Emanuela Marchetti and Andrea Valente. 2018. Interactivity and multimodality in language learning: The untapped potential of audiobooks. Universal Access in the Information Society 17, 2 (2018), 257–274.
[8]
Y. Zhang, Y. Qian, D. Wu, et al. 2018. Emotion-aware multimedia systems security [J]. IEEE Transactions on Multimedia 21, 3 (2018), 617–624.
[9]
Y. Shao, J. C. W. Lin, G. Srivastava, et al. 2021. Self-attention-based conditional random fields latent variables model for sequence labeling [J]. Pattern Recognition Letters 145 (2021), 157–164.
[10]
J. C. W. Lin, Y. N. Shao, Y. Djenouri, and U. Yun. 2021. ASRNN: A recurrent neural network with an attention model for sequence labeling. Knowledge-based Systems 212 (2021), 106548.
[11]
J. C. W. Lin, Y. N. Shao, J. Zhang, and U. Yun. 2020. Enhanced sequence labeling based on latent variable conditional random fields. Neurocomputing 403 (2020), 431–440.
[12]
Christian Brauchli, Simon Leipold, and Lutz Jäncke. 2020. Diminished large-scale functional brain networks in absolute pitch during the perception of naturalistic music and audiobooks. NeuroImage 216 (2020), 116513.
[13]
German Bordel, Mikel Penagarikano, Luis Javier Rodríguez-Fuentes, Aitor Álvarez, and Amparo Varona. 2015. Probabilistic kernels for improved text-to-speech alignment in long audio tracks. IEEE Signal Processing Letters 23, 1 (2015), 126–129.
[14]
Ashokkumar P., Siva Shankar G., Gautam Srivastava, Praveen Kumar Reddy Maddikunta, and Thippa Reddy Gadekallu. 2021. A two-stage text feature selection algorithm for improving text classification. ACM Transactions on Asian and Low-resource Language Information Processing 20, 3 (2021), 49.
[15]
Pedro J. Moreno, Chris Joerg, Jean-Manuel Van Thong, and Oren Glickman. 1998. A recursive algorithm for the forced alignment of very long audio segments. In Proceedings of the 5th International Conference on Spoken Language Processing.
[16]
Fabrice Malfrère, Olivier Deroo, Thierry Dutoit, and Christophe Ris. 2003. Phonetic alignment: Speech synthesis-based vs. viterbi-based. Speech Communication 40, 4 (2003), 503–515.
[17]
Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Proceedings of the Interspeech. 498–502.
[18]
Athanasios Katsamanis, Matthew Black, Panayiotis G. Georgiou, Louis Goldstein, and Shrikanth Narayanan. 2011. SailAlign: Robust long speech-text alignment. In Proceedings of the Workshop on New Tools and Methods for Very-large Scale Phonetics Research.
[19]
Norbert Braunschweiler, Mark J. F. Gales, and Sabine Buchholz. 2010. Lightly supervised recognition for automatic alignment of large coherent speech recordings. In Proceedings of the 11th Annual Conference of the International Speech Communication Association.
[20]
Adriana Stan, Peter Bell, and Simon King. 2012. A grapheme-based method for automatic alignment of speech and text data. In Proceedings of the 2012 IEEE Spoken Language Technology Workshop. IEEE, 286–290.
[21]
Sakshi Dhall, Ashutosh Dhar Dwivedi, Saibal K. Pal, and Gautam Srivastava. 2021. Blockchain-based framework for reducing fake or vicious news spread on social media/messaging platforms[J]. Transactions on Asian and Low-Resource Language Information Processing 21, 1 (2021), 1–33.
[22]
T. Mikolov, K. Chen, G. Corrado, et al. 2013. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781.
[23]
Armand Joulin, Édouard Grave, Piotr Bojanowski, and Tomáš Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 427–431.
[24]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning. PMLR, 1188–1196.
[25]
Yi-Chen Chen, Sung-Feng Huang, Hung-yi Lee, Yu-Hsuan Wang, and Chia-Hao Shen. 2019. Audio word2vec: Sequence-to-sequence autoencoding for unsupervised learning of audio segmentation and representation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 9 (2019), 1481–1493.
[26]
Oscar Saz, Salil Deena, Mortaza Doulaty, Madina Hasan, Bilal Khaliq, Rosanna Milner, Raymond W. M. Ng, Julia Olcoz, and Thomas Hain. 2018. Lightly supervised alignment of subtitles on multi-genre broadcasts. Multimedia Tools and Applications 77, 23 (2018), 30533–30550.
[27]
Der-Chiang Li, Liang-Sian Lin, Chien-Chih Chen, and Wei-Hao Yu. 2019. Using virtual samples to improve learning performance for small datasets with multimodal distributions. Soft Computing 23, 22 (2019), 11883–11900.
[28]
Ronald E. Shaffer and Gary W. Small. 1997. Peer reviewed: Learning optimization from nature: Genetic algorithms and simulated annealing. Analytical Chemistry 69, 7 (1997), 236A–242A.
[29]
Qiuyu Guo, Nan Li, and Guangrong Ji. 2010. A improved dual-threshold speech endpoint detection algorithm. In Proceedings of the 2010 The 2nd International Conference on Computer and Automation Engineering. IEEE, 123–126.
[30]
J. C. W. Lin, Y. N. Shao, Y. J. Zhou, M. Pirouz, and H. C. Chen. 2019. Bi-LSTM mention hypergraph model with encoding schema for mention extraction. Engineering Applications of Artificial Intelligence 85 (2019), 175–181.
[31]
J. C. W. Lin, Y. N. Shao, and F. Fournier-Viger, P. Hamido. 2019. BILU-NEMH: A BILU neural-encoded mention hypergraph for mention extraction. IInformation Sciences 496 (2019), 53–64.
[32]
Hiromasa Fujihara, Masataka Goto, Jun Ogata, and Hiroshi G. Okuno. 2011. LyricSynchronizer: Automatic synchronization system between musical audio signals and lyrics. IEEE Journal of Selected Topics in Signal Processing 5, 6 (2011), 1252–1261.
[33]
Minghe Yu, Jin Wang, Guoliang Li, Yong Zhang, Dong Deng, and Jianhua Feng. 2017. A unified framework for string similarity search with edit-distance constraint. The VLDB Journal 26, 2 (2017), 249–274.
[34]
YunZhi Chen, HuiJuan Lu, and LanJuan Li. 2017. Automatic ICD-10 coding algorithm using an improved longest common subsequence based on semantic similarity. PloS One 12, 3 (2017), e0173410.
[35]
Inigo Lopez-Gazpio, Montse Maritxalar, Mirella Lapata, and Eneko Agirre. 2019. Word n-gram attention models for sentence similarity and inference. Expert Systems with Applications 132 (2019), 1–11.
[36]
Sujoy Bag, Sri Krishna Kumar, and Manoj Kumar Tiwari. 2019. An efficient recommendation generation using relevant Jaccard similarity. Information Sciences 483 (2019), 53–64.
[37]
Ching-Hua Chuan, Kat Agres, and Dorien Herremans. 2020. From context to concept: Exploring semantic relationships in music with word2vec. Neural Computing and Applications 32, 4 (2020), 1023–1036.
[38]
Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised learning of sentence embeddings using compositional n-gram features. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 528–540.
[39]
Ryan Kiros, Yukun Zhu, Russ R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Proceedings of the Advances in Neural Information Processing Systems. 3294–3302.
[40]
Vidyut Dey, Dilip Kumar Pratihar, and Gauranga Lal Datta. 2011. Genetic algorithm-tuned entropy-based fuzzy C-means algorithm for obtaining distinct and compact clusters. Fuzzy Optimization and Decision Making 10, 2 (2011), 153–166.
[41]
Haojin Li, Junjie Li, and Fei Kang. 2011. Risk analysis of dam based on artificial bee colony algorithm with fuzzy c-means clustering. Canadian Journal of Civil Engineering 38, 5 (2011), 483–492.
[42]
Mehdi Ghazanfari, Somayeh Alizadeh, Mohammad Fathian, and Dimitris E. Koulouriotis. 2007. Comparing simulated annealing and genetic algorithm in learning FCM. Applied Mathematics and Computation 192, 1 (2007), 56–68.
[43]
Vikram Singh, Siddhant Garg, and Pradeep Kaur. 2016. Efficient algorithm for web search query reformulation using genetic algorithm. In Proceedings of the Computational Intelligence in Data Mining’Volume 1. Springer, 459–470.
[44]
Nil Mamano and Wayne B. Hayes. 2017. SANA: Simulated annealing far outperforms many other search algorithms for biological network alignment. Bioinformatics 33, 14 (2017), 2156–2164.
[45]
L. I. U. Yi-lin and A. N. Jian-cheng. 2018. Optimized kernel fuzzy c-means clustering algorithm. Microelectronics and Computer 35, 2 (2018), 79–83.
[46]
Stéphanie Portet. 2020. A primer on model selection using the akaike information criterion. Infectious Disease Modelling 5 (2020), 111–128.
[47]
Jun Li, Guimin Huang, Chunli Fan, Zhenglin Sun, and Hongtao Zhu. 2019. Key word extraction for short text via word2vec, doc2vec, and textrank. Turkish Journal of Electrical Engineering and Computer Sciences 27, 3 (2019), 1794–1805.
[48]
Wu Yongliang, Zhao Shuliang, Li Changjing, Wei Nadi, and wang Ziyan. 2017. Text classificationmethod based on tf-idf and cosine similarity. Journal of Chinese Information Processing 31, 5 (2017), 138–145.
[49]
Martin Toepfer and Christin Seifert. 2018. Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints. Journal: Digital Libraries for Open Knowledge Lecture Notes in Computer Science (2018), 3–15.

Cited By

View all
  • (2023)HKG: A Novel Approach for Low Resource Indic Languages to Automatic Knowledge Graph ConstructionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3611306Online publication date: 2-Aug-2023
  • (2023)CNN-based speech segments endpoints detection framework using short-time signal energy featuresInternational Journal of Information Technology10.1007/s41870-023-01466-615:8(4179-4191)Online publication date: 10-Sep-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
March 2023
570 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3579816
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 April 2023
Online AM: 18 July 2022
Accepted: 06 December 2021
Revised: 22 November 2021
Received: 04 August 2021
Published in TALLIP Volume 22, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Audio and text alignment
  2. fuzzy C-means clustering algorithm
  3. akaike information criterion
  4. Doc2vec
  5. dual threshold endpoint detection

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)97
  • Downloads (Last 6 weeks)6
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)HKG: A Novel Approach for Low Resource Indic Languages to Automatic Knowledge Graph ConstructionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3611306Online publication date: 2-Aug-2023
  • (2023)CNN-based speech segments endpoints detection framework using short-time signal energy featuresInternational Journal of Information Technology10.1007/s41870-023-01466-615:8(4179-4191)Online publication date: 10-Sep-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media