Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

Weng, Xinshuo; Kitani, Kris

Computer Science > Computer Vision and Pattern Recognition

arXiv:1905.02540 (cs)

[Submitted on 4 May 2019 (v1), last revised 19 Jul 2019 (this version, v2)]

Title:Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

Authors:Xinshuo Weng, Kris Kitani

View PDF

Abstract:We focus on the word-level visual lipreading, which requires recognizing the word being spoken, given only the video but not the audio. State-of-the-art methods explore the use of end-to-end neural networks, including a shallow (up to three layers) 3D convolutional neural network (CNN) + a deep 2D CNN (e.g., ResNet) as the front-end to extract visual features, and a recurrent neural network (e.g., bidirectional LSTM) as the back-end for classification. In this work, we propose to replace the shallow 3D CNNs + deep 2D CNNs front-end with recent successful deep 3D CNNs --- two-stream (i.e., grayscale video and optical flow streams) I3D. We evaluate different combinations of front-end and back-end modules with the grayscale video and optical flow inputs on the LRW dataset. The experiments show that, compared to the shallow 3D CNNs + deep 2D CNNs front-end, the deep 3D CNNs front-end with pre-training on the large-scale image and video datasets (e.g., ImageNet and Kinetics) can improve the classification accuracy. Also, we demonstrate that using the optical flow input alone can achieve comparable performance as using the grayscale video as input. Moreover, the two-stream network using both the grayscale video and optical flow inputs can further improve the performance. Overall, our two-stream I3D front-end with a Bi-LSTM back-end results in an absolute improvement of 5.3% over the previous art on the LRW dataset.

Comments:	camera ready version for BMVC 2019
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:1905.02540 [cs.CV]
	(or arXiv:1905.02540v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1905.02540

Submission history

From: Xinshuo Weng [view email]
[v1] Sat, 4 May 2019 02:32:06 UTC (1,378 KB)
[v2] Fri, 19 Jul 2019 03:19:21 UTC (1,973 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators