Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3330482.3330508acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccaiConference Proceedingsconference-collections
research-article

Text-Independent Speaker ID for Automatic Video Lecture Classification Using Deep Learning

Published: 19 April 2019 Publication History

Abstract

This paper proposes to use acoustic features employing deep neural network (DNN) and convolutional neural network (CNN) models for classifying video lectures in a massive open online course (MOOC). The models exploit the voice pattern of the lecturer for identification and for classifying the video lecture according to the right speaker category. Filter bank and Mel frequency cepstral coefficient (MFCC) feature along with first and second order derivatives (Δ/ΔΔ) are used as input features to the proposed models. These features are extracted from the speech signal which is obtained from the video lectures by separating the audio from the video using FFmpeg.
The deep learning models are evaluated using precision, recall, and F1 score and the obtained accuracy is compared for both acoustic features with traditional machine learning classifiers for speaker identification. A significant improvement of 3% to 7% classification accuracy is achieved over the DNN and twice to that of shallow machine learning classifiers for 2D-CNN with MFCC. The proposed 2D-CNN model with an F1 score of 85.71% for text-independent speaker identification makes it plausible to use speaker ID as a classification approach for organizing video lectures automatically in a MOOC setting.

References

[1]
J. Jacoby, "The disruptive potential of the massive open online course: A literature review," Journal of Open, Flexible, and Distance Learning, vol. 18, no. 1, pp. 73--85, 2014.
[2]
A. S. Imran, K. Pireva, F. Dalipi, and Z. Kastrati, "An analysis of social collaboration and networking tools in elearning," in International Conference on Learning and Collaboration Technologies, pp. 332--343, Springer, 2016.
[3]
M. Ebner, A. Lorenz, E. Lackner, M. Kopp, S. Kumar, S. Schon, and A. Wittke, How OER enhances MOOCs---A Perspective from German-speaking Europe, pp. 205--220. Springer Berlin Heidelberg, 2017.
[4]
I. F. Silveira, "Oer and mooc: The need for openness.," Issues in Informing Science & Information Technology, vol. 13, 2016.
[5]
F. Dalipi, S. Y. Yayilgan, A. S. Imran, and Z. Kastrati, "Towards understanding the MOOC trend: pedagogical challenges and business opportunities," in International Conference on Learning and Collaboration Technologies, pp. 281--291, Springer, 2016.
[6]
S. S. Tirumala, S. R. Shahamiri, A. S. Garhwal, and R. Wang, "Speaker identification features extraction methods," Expert Systems with Applications, vol. 90, no. C, pp. 250--271, 2017.
[7]
N. P. Jawarkar, R. S. Holambe, and T. K. Basu, "Effect of nonlinear compression function on the performance of the speaker identification system under noisy conditions," in Proceedings of the 2nd International Conference on Perception and Machine Intelligence, PerMIn'15, pp. 137--144, ACM, 2015.
[8]
S. Chakroborty, A. Roy, and G. Saha, "Improved closed set text-independent speaker identification by combining MFCC with evidence from flipped filter banks," International Journal of Signal Processing, vol. 4, no. 2, pp. 114--121, 2007.
[9]
H. Do, I. Tashev, and A. Acero, "A new speaker identification algorithm for gaming scenarios," in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5436--5439, 2011.
[10]
G. Chenghui, Z. Heming, and T. Zhi, "Speaker identification of whispered speech with perceptible mood," Journal of Multimedia, vol. 9, no. 4, pp. 553--561, 2014.
[11]
L. Schmidt, M. Sharifi, and I. L. Moreno, "Large-scale speaker identification," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1650--1654, 2014.
[12]
Z. Ma, H. Yu, Z. Tan, and J. Guo, "Text-independent speaker identification using the histogram transform model," IEEE Access, vol. 4, pp. 9733--9739, 2016.
[13]
G. K. Verma, "Multi-feature fusion for closed set text independent speaker identification," in Information Intelligence, Systems, Technology and Management (S. Dua, S. Sahni, and D. P. Goyal, eds.), pp. 170--179, Springer Berlin Heidelberg, 2011.
[14]
Y. Lukic, C. Vogt, O. DAirr, and T. Stadelmann, "Speaker identification and clustering using convolutional neural networks," in 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1--6, 2016.
[15]
S. T. Nguyen, V. D. Lai, Q. Dam-Ba, A. Nguyen-Xuan, and C. Pham, "Vietnamese speaker authentication using deep models," in Proceedings of the Ninth International Symposium on Information and Communication Technology, SoICT 2018, pp. 177--184, ACM, 2018.
[16]
A. Antony and R.Gopikakumari, "Speaker identification based on combination of MFCC and UMRT based features," Procedia Computer Science, vol. 143, pp. 250--257, 2018.
[17]
X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Upper Saddle River, NJ, USA: Prentice Hall PTR, 1st ed., 2001.
[18]
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
[19]
A. S. Shahrebabaki, A. S. Imran, N. Olfati, and T. Svendsen, "Acoustic feature comparison for different speaking rates," in Human-Computer Interaction. Interaction Technologies, (Cham), pp. 176--189, Springer International Publishing, 2018.

Cited By

View all
  • (2019)Text-Independent Speaker ID Employing 2D-CNN for Automatic Video Lecture Categorization in a MOOC Setting2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI.2019.00046(273-277)Online publication date: Nov-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICCAI '19: Proceedings of the 2019 5th International Conference on Computing and Artificial Intelligence
April 2019
267 pages
ISBN:9781450361064
DOI:10.1145/3330482
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 April 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 2D-CNN
  2. DNN
  3. MFCC Filter banks
  4. MOOC
  5. Speaker identification
  6. deep learning
  7. video classification

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICCAI '19

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Text-Independent Speaker ID Employing 2D-CNN for Automatic Video Lecture Categorization in a MOOC Setting2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI.2019.00046(273-277)Online publication date: Nov-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media