research-article

Text-Independent Speaker ID for Automatic Video Lecture Classification Using Deep Learning

Authors:

Ali Shariq Imran,

Zenun Kastrati,

Torbjørn Karl Svendsen,

Arianit KurtiAuthors Info & Claims

ICCAI '19: Proceedings of the 2019 5th International Conference on Computing and Artificial Intelligence

Pages 175 - 180

https://doi.org/10.1145/3330482.3330508

Published: 19 April 2019 Publication History

Abstract

This paper proposes to use acoustic features employing deep neural network (DNN) and convolutional neural network (CNN) models for classifying video lectures in a massive open online course (MOOC). The models exploit the voice pattern of the lecturer for identification and for classifying the video lecture according to the right speaker category. Filter bank and Mel frequency cepstral coefficient (MFCC) feature along with first and second order derivatives (Δ/ΔΔ) are used as input features to the proposed models. These features are extracted from the speech signal which is obtained from the video lectures by separating the audio from the video using FFmpeg.

The deep learning models are evaluated using precision, recall, and F1 score and the obtained accuracy is compared for both acoustic features with traditional machine learning classifiers for speaker identification. A significant improvement of 3% to 7% classification accuracy is achieved over the DNN and twice to that of shallow machine learning classifiers for 2D-CNN with MFCC. The proposed 2D-CNN model with an F1 score of 85.71% for text-independent speaker identification makes it plausible to use speaker ID as a classification approach for organizing video lectures automatically in a MOOC setting.

References

[1]

J. Jacoby, "The disruptive potential of the massive open online course: A literature review," Journal of Open, Flexible, and Distance Learning, vol. 18, no. 1, pp. 73--85, 2014.

[2]

A. S. Imran, K. Pireva, F. Dalipi, and Z. Kastrati, "An analysis of social collaboration and networking tools in elearning," in International Conference on Learning and Collaboration Technologies, pp. 332--343, Springer, 2016.

[3]

M. Ebner, A. Lorenz, E. Lackner, M. Kopp, S. Kumar, S. Schon, and A. Wittke, How OER enhances MOOCs---A Perspective from German-speaking Europe, pp. 205--220. Springer Berlin Heidelberg, 2017.

[4]

I. F. Silveira, "Oer and mooc: The need for openness.," Issues in Informing Science & Information Technology, vol. 13, 2016.

[5]

F. Dalipi, S. Y. Yayilgan, A. S. Imran, and Z. Kastrati, "Towards understanding the MOOC trend: pedagogical challenges and business opportunities," in International Conference on Learning and Collaboration Technologies, pp. 281--291, Springer, 2016.

[6]

S. S. Tirumala, S. R. Shahamiri, A. S. Garhwal, and R. Wang, "Speaker identification features extraction methods," Expert Systems with Applications, vol. 90, no. C, pp. 250--271, 2017.

Digital Library

[7]

N. P. Jawarkar, R. S. Holambe, and T. K. Basu, "Effect of nonlinear compression function on the performance of the speaker identification system under noisy conditions," in Proceedings of the 2nd International Conference on Perception and Machine Intelligence, PerMIn'15, pp. 137--144, ACM, 2015.

Digital Library

[8]

S. Chakroborty, A. Roy, and G. Saha, "Improved closed set text-independent speaker identification by combining MFCC with evidence from flipped filter banks," International Journal of Signal Processing, vol. 4, no. 2, pp. 114--121, 2007.

[9]

H. Do, I. Tashev, and A. Acero, "A new speaker identification algorithm for gaming scenarios," in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5436--5439, 2011.

[10]

G. Chenghui, Z. Heming, and T. Zhi, "Speaker identification of whispered speech with perceptible mood," Journal of Multimedia, vol. 9, no. 4, pp. 553--561, 2014.

[11]

L. Schmidt, M. Sharifi, and I. L. Moreno, "Large-scale speaker identification," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1650--1654, 2014.

[12]

Z. Ma, H. Yu, Z. Tan, and J. Guo, "Text-independent speaker identification using the histogram transform model," IEEE Access, vol. 4, pp. 9733--9739, 2016.

[13]

G. K. Verma, "Multi-feature fusion for closed set text independent speaker identification," in Information Intelligence, Systems, Technology and Management (S. Dua, S. Sahni, and D. P. Goyal, eds.), pp. 170--179, Springer Berlin Heidelberg, 2011.

[14]

Y. Lukic, C. Vogt, O. DAirr, and T. Stadelmann, "Speaker identification and clustering using convolutional neural networks," in 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1--6, 2016.

[15]

S. T. Nguyen, V. D. Lai, Q. Dam-Ba, A. Nguyen-Xuan, and C. Pham, "Vietnamese speaker authentication using deep models," in Proceedings of the Ninth International Symposium on Information and Communication Technology, SoICT 2018, pp. 177--184, ACM, 2018.

Digital Library

[16]

A. Antony and R.Gopikakumari, "Speaker identification based on combination of MFCC and UMRT based features," Procedia Computer Science, vol. 143, pp. 250--257, 2018.

[17]

X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Upper Saddle River, NJ, USA: Prentice Hall PTR, 1st ed., 2001.

Digital Library

[18]

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

Digital Library

[19]

A. S. Shahrebabaki, A. S. Imran, N. Olfati, and T. Svendsen, "Acoustic feature comparison for different speaking rates," in Human-Computer Interaction. Interaction Technologies, (Cham), pp. 176--189, Springer International Publishing, 2018.

Cited By

Imran AKastrati ZSvendsen TKurti A(2019)Text-Independent Speaker ID Employing 2D-CNN for Automatic Video Lecture Categorization in a MOOC Setting2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI.2019.00046(273-277)Online publication date: Nov-2019
https://doi.org/10.1109/ICTAI.2019.00046

Index Terms

Text-Independent Speaker ID for Automatic Video Lecture Classification Using Deep Learning
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Speech / audio search

Recommendations

Transfer Learning to Timed Text Based Video Classification Using CNN
WIMS2019: Proceedings of the 9th International Conference on Web Intelligence, Mining and Semantics

Open educational video resources are gaining popularity with a growing number of massive open online courses (MOOCs). This has created a niche for content providers to adopt effective solutions in automatically organizing and structuring of educational ...
Text-Independent Speaker Identification Using Vowel Formants

Automatic speaker identification has become a challenging research problem due to its wide variety of applications. Neural networks and audio-visual identification systems can be very powerful, but they have limitations related to the number of ...
Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM

We presented a new text-independent/text-prompted speaker recognition method by combining speaker-specific Gaussian Mixture Model (GMM) with syllable-based HMM adapted by MLLR or MAP. The robustness of this speaker recognition method for speaking style'...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICCAI '19: Proceedings of the 2019 5th International Conference on Computing and Artificial Intelligence

April 2019

267 pages

ISBN:9781450361064

DOI:10.1145/3330482

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 April 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICCAI '19

ICCAI '19: 2019 5th International Conference on Computing and Artificial Intelligence

April 19 - 22, 2019

Bali, Indonesia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
92
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Imran AKastrati ZSvendsen TKurti A(2019)Text-Independent Speaker ID Employing 2D-CNN for Automatic Video Lecture Categorization in a MOOC Setting2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI.2019.00046(273-277)Online publication date: Nov-2019
https://doi.org/10.1109/ICTAI.2019.00046

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten