research-article

Online Early-Late Fusion Based on Adaptive HMM for Sign Language Recognition

Authors:

Meng WangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 14, Issue 1

Article No.: 8, Pages 1 - 18

https://doi.org/10.1145/3152121

Published: 20 December 2017 Publication History

Abstract

In sign language recognition (SLR) with multimodal data, a sign word can be represented by multiply features, for which there exist an intrinsic property and a mutually complementary relationship among them. To fully explore those relationships, we propose an online early-late fusion method based on the adaptive Hidden Markov Model (HMM). In terms of the intrinsic property, we discover that inherent latent change states of each sign are related not only to the number of key gestures and body poses but also to their translation relationships. We propose an adaptive HMM method to obtain the hidden state number of each sign by affinity propagation clustering. For the complementary relationship, we propose an online early-late fusion scheme. The early fusion (feature fusion) is dedicated to preserving useful information to achieve a better complementary score, while the late fusion (score fusion) uncovers the significance of those features and aggregates them in a weighting manner. Different from classical fusion methods, the fusion is query adaptive. For different queries, after feature selection (including the combined feature), the fusion weight is inversely proportional to the area under the curve of the normalized query score list for each selected feature. The whole fusion process is effective and efficient. Experiments verify the effectiveness on the signer-independent SLR with large vocabulary. Compared either on different dataset sizes or to different SLR models, our method demonstrates consistent and promising performance.

References

[1]

Jonathan Alon, Vassilis Athitsos, Quan Yuan, and Stan Sclaroff. 2009. A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 9 (2009), 1685--1699.

Digital Library

[2]

Serge Belongie, Chad Carson, Hayit Greenspan, and Jitendra Malik. 1998. Color- and texture-based image segmentation using em and its application to content-based image retrieval. In IEEE Conference on Computer Vision. 675--682.

Digital Library

[3]

Xingyang Cai, Wengang Zhou, Lei Wu, Jiebo Luo, and Houqiang Li. 2016. Effective active skeleton representation for low latency human action recognition. IEEE Transactions on Multimedia 18, 2 (2016), 141--154.

Digital Library

[4]

Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, and Richard Bowden. 2016. Using convolutional 3d neural networks for user-independent continuous gesture recognition. In IEEE Conference on Pattern Recognition. 49--54.

[5]

Sait Celebi, Ali Selman Aydin, Talha Tarik Temiz, and Tarik Arici. 2013. Gesture recognition using skeleton data with weighted dynamic time warping. In International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. 620--625.

[6]

Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. Arxiv Preprint Arxiv:1405.3531 (2014).

[7]

Hong Cheng, Lu Yang, and Zicheng Liu. 2016. A survey on 3d hand gesture recognition. IEEE Transactions on Circuits and Systems for Video Technology 29, 9 (2016), 1659--1673.

Digital Library

[8]

Cao Dong, Ming Leu, and Zhaozheng Yin. 2015. American sign language alphabet recognition using microsoft kinect. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 44--52.

[9]

Sergio Escalera, Xavier Baró, Jordi Gonzalez, Miguel A Bautista, Meysam Madadi, Miguel Reyes, Víctor Ponce-López, Hugo J. Escalante, Jamie Shotton, and Isabelle Guyon. 2014. Chalearn looking at people challenge 2014: Dataset and results. In European Conference on Computer Vision Workshop. 459--473.

[10]

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 1933--1941.

[11]

Simon Fothergill, Helena Mentis, Pushmeet Kohli, and Sebastian Nowozin. 2012. Instructing people for training gestural interactive systems. In The SIGCHI Conference on Human Factors in Computing Systems. 1737--1746.

Digital Library

[12]

Brendan J. Frey and Delbert Dueck. 2007. Clustering by passing messages between data points. Science 315, 5814 (2007), 972--976.

[13]

Dan Guo, Wengang Zhou, Meng Wang, and Houqiang Li. 2016. Sign language recognition based on adaptive hmms with data augmentation. In IEEE Conference on Image Processing. 2876--2880.

[14]

Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. 2015. Sign language recognition using 3d convolutional neural networks. In IEEE Conference on Multimedia and Expo. 1--6.

[15]

Anil Jain, Karthik Nandakumar, and Arun Ross. 2005. Score normalization in multimodal biometric systems. Pattern Recognition 38, 12 (2005), 2270--2285.

Digital Library

[16]

Fahad Shahbaz Khan, Rao Muhammad Anwer, Joost Van de Weijer, Andrew D. Bagdanov, Maria Vanrell, and Antonio M. Lopez. 2012a. Color attributes for object detection. In IEEE Conference on Computer Vision and Pattern Recognition. 3306--3313.

Digital Library

[17]

Fahad Shahbaz Khan, Joost Van de Weijer, and Maria Vanrell. 2012b. Modulating shape features by color attention for object recognition. International Journal of Computer Vision 98, 1 (2012), 49--64.

Digital Library

[18]

Josef Kittler, Mohamad Hatef, Robert P. W. Duin, and Jiri Matas. 1998. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 3 (1998), 226--239.

Digital Library

[19]

Tao Kong, Anbang Yao, Yurong Chen, and Fuchun Sun. 2016. HyperNet: Towards accurate region proposal generation and joint object detection. In IEEE Conference on Computer Vision and Pattern Recognition. 845--853.

[20]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems. 1097--1105.

Digital Library

[21]

Alexey Kurakin, Zhengyou Zhang, and Zicheng Liu. 2012. A real time system for dynamic hand gesture recognition with a depth sensor. In European Signal Processing Conference. 1975--1979.

[22]

Shih Yao Lin, Yen Yu Lin, Chu Song Chen, and Yi Ping Hung. 2017. Recognizing human actions with outlier frames by observation filtering and completion. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 3 (2017), 28.

Digital Library

[23]

Yushun Lin, Xiujuan Chai, Yu Zhou, and Xilin Chen. 2014. Curve matching from the view of manifold for sign language recognition. In Asian Conference on Computer Vision. 233--246.

[24]

Tao Liu, Wengang Zhou, and Houqiang Li. 2016. Sign language recognition with long short-term memory. In IEEE Conference on Image Processing. 2871--2875.

[25]

Wei Liu, Yu Gang Jiang, Jiebo Luo, and Shih Fu Chang. 2011. Noise resistant graph ranking for improved web image search. In IEEE Conference on Computer Vision and Pattern Recognition. 849--856.

Digital Library

[26]

Zhen Liu, Houqiang Li, Wengang Zhou, Richang Hong, and Qi Tian. 2015. Uniting keypoints: Local visual information fusion for large-scale image search. IEEE Transactions on Multimedia 17, 4 (2015), 538--548.

Digital Library

[27]

Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. 2016. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition. 4207--4215.

[28]

Karthik Nandakumar, Yi Chen, Sarat C. Dass, and Anil K. Jain. 2008. Likelihood ratio-based biometric score fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 2 (2008), 342--347.

Digital Library

[29]

Natalia Neverova, Christian Wolf, Giulio Paci, Giacomo Sommavilla, Graham Taylor, and Florian Nebout. 2013. A multi-scale approach to gesture detection and recognition. In IEEE Conference on Computer Vision Workshops. 484--491.

Digital Library

[30]

Natalia Neverova, Christian Wolf, Graham Taylor, and Florian Nebout. 2016. Moddrop: Adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 8 (2016), 1692--1706.

Digital Library

[31]

Tomas Pfister, James Charles, and Andrew Zisserman. 2013. Large-scale learning of sign language by watching tv (using co-occurrences). In British Machine Vision Conference.

[32]

Lionel Pigou, Aäron Van Den Oord, Sander Dieleman, Mieke Van Herreweghe, and Joni Dambre. 2016. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video. International Journal of Computer Vision (2016), 1--10.

[33]

Lingyan Ran, Yanning Zhang, Qilin Zhang, and Tao Yang. 2017. Convolutional neural network-based robot navigation using uncalibrated spherical images. Sensors 17, 6 (2017), 1341.

[34]

Zhou Ren, Junsong Yuan, and Zhengyou Zhang. 2011. Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera. In ACM Conference on Multimedia. 1093--1096.

Digital Library

[35]

Stan Salvador and Philip Chan. 2007. Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis 11, 5 (2007), 561--580.

Digital Library

[36]

Chao Sun, Tianzhu Zhang, BingKun Bao, Changsheng Xu, and Tao Mei. 2013. Discriminative exemplar coding for sign language recognition with kinect. IEEE Transactions on Cybernetics 43, 5 (2013), 1418--1428.

[37]

Chao Sun, Tianzhu Zhang, and Changsheng Xu. 2015. Latent support vector machine modeling for sign language recognition with Kinect. ACM Transactions on Intelligent Systems and Technology 6, 2 (2015), 20.

Digital Library

[38]

Shaoyan Sun, Wengang Zhou, Qi Tian, and Houqiang Li. 2016. Scalable object retrieval with compact image representation from generic object regions. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 2 (2016), 29.

Digital Library

[39]

Oriol Ramos Terrades, Ernest Valveny, and Salvatore Tabbone. 2009. Optimal classifier fusion in a non-Bayesian probabilistic framework. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 9 (2009), 1630--1644.

Digital Library

[40]

Jun Wan, Yibing Zhao, Shuai Zhou, Isabelle Guyon, Sergio Escalera, and Stan Z. Li. 2016. Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 56--64.

[41]

Hanjie Wang, Xiujuan Chai, Yu Zhou, and Xilin Chen. 2015. Fast sign language recognition benefited from low rank approximation. In IEEE Conference and Workshops on Automatic Face and Gesture Recognition. 1--6.

[42]

Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012b. Mining actionlet ensemble for action recognition with depth cameras. In IEEE Conference on Computer Vision and Pattern Recognition. 1290--1297.

Digital Library

[43]

Meng Wang, Xian Sheng Hua, Richang Hong, Jinhui Tang, Guo Jun Qi, and Yan Song. 2009. Unified video annotation via multigraph learning. IEEE Transactions on Circuits and Systems for Video Technology 19, 5 (2009), 733--746.

Digital Library

[44]

Meng Wang, Hao Li, Dacheng Tao, Ke Lu, and Xindong Wu. 2012a. Multimodal graph-based reranking for web image search. IEEE Transactions on Image Processing 21, 11 (2012), 4649--4661.

Digital Library

[45]

Meng Wang, Changzhi Luo, Bingbing Ni, Jun Yuan, Jianfeng Wang, and Shuicheng Yan. 2017. First-person daily activity recognition with manipulated object proposals and non-linear feature fusion. IEEE Transactions on Circuits and Systems for Video Technology 99 (2017), 1--1.

[46]

Di Wu, Lionel Pigou, Pieter-Jan Kindermans, L. E. Nam, Ling Shao, Joni Dambre, and Jean-Marc Odobez. 2016b. Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 8 (2016), 1583--1597.

Digital Library

[47]

Zuxuan Wu, Yu Gang Jiang, Jun Wang, Jian Pu, and Xiangyang Xue. 2014. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In ACM Conference on Multimedia. 167--176.

Digital Library

[48]

Zuxuan Wu, Yu Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. 2016a. Multi-stream multi-class fusion of deep networks for video classification. In ACM Conference on Multimedia. 791--800.

Digital Library

[49]

Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2016. Semantic feature mining for video event understanding. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 4 (2016), 55.

Digital Library

[50]

Jun Ye, Hao Hu, Guo Jun Qi, and Kien A. Hua. 2017. A temporal order modeling approach to human action recognition from multimodal sensor data. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 2 (2017), 14.

Digital Library

[51]

Jihai Zhang, Wengang Zhou, Chao Xie, Junfu Pu, and Houqiang Li. 2016. Chinese sign language recognition with adaptive HMM. In IEEE Conference on Multimedia and Expo. 1--6.

[52]

Qilin Zhang and Gang Hua. 2015. Multi-view visual recognition of imperfect testing data. In ACM Conference on Multimedia. 561--570.

Digital Library

[53]

Qilin Zhang, Gang Hua, Wei Liu, Zicheng Liu, and Zhengyou Zhang. 2014. Can visual recognition benefit from auxiliary information in training? In Asian Conference on Computer Vision. 65--80.

[54]

Shaoting Zhang, Ming Yang, Timothee Cour, Kai Yu, and Dimitris N. Metaxas. 2012. Query specific fusion for image retrieval. In European Conference on Computer Vision. 660--673.

[55]

Xin Zhao, Xue Li, Chaoyi Pang, Quan Z. Sheng, Sen Wang, and Mao Ye. 2014. Structured streaming skeleton -- a new feature for online human gesture recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 11, 1s (2014), 22.

Digital Library

[56]

Liang Zheng, Shengjin Wang, Lu Tian, Fei He, Ziqiong Liu, and Qi Tian. 2015. Query-adaptive late fusion for image search and person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 1741--1750.

Cited By

Shen XZheng ZYang Y(2024)StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604620:7(1-19)Online publication date: 3-Apr-2024
https://dl.acm.org/doi/10.1145/3656046
Chen CZhang P(2024)Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364034320:5(1-23)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3640343
Shi LXu FWang RWei YWang GWang BLiu P(2024)Information Aggregate and Sentiment Enhance Network to Handle Missing Modalities for Multimodal Sentiment Analysis2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687981(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687981
Show More Cited By

Index Terms

Online Early-Late Fusion Based on Adaptive HMM for Sign Language Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Online learning theory

Recommendations

Early versus late fusion in semantic video analysis
MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia

Semantic analysis of multimodal video aims to index segments of interest at a conceptual level. In reaching this goal, it requires an analysis of several information streams. At some point in the analysis these streams need to be fused. In this paper, ...
Coupled HMM-based multi-sensor data fusion for sign language recognition

A novel multi-sensor framework is proposed for Sign Language Recognition (SLR).The framework recognize dynamic sign words performed by hearing impaired persons.A recognition combination framework using Coupled-HMM (CHMM) is proposed for SLR.Results ...
DCA‐based unimodal feature‐level fusion of orthogonal moments for Indian sign language dataset

Sign language recognition system classifies signs made by hand gestures. An adequate number of features are required to represent the shape variations of sign language. As compared to individual feature set, a combination of features can be effective due ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 14, Issue 1

February 2018

287 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3173554

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 December 2017

Accepted: 01 October 2017

Revised: 01 October 2017

Received: 01 January 2017

Published in TOMM Volume 14, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

NSFC

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
535
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)4

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shen XZheng ZYang Y(2024)StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604620:7(1-19)Online publication date: 3-Apr-2024
https://dl.acm.org/doi/10.1145/3656046
Chen CZhang P(2024)Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364034320:5(1-23)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3640343
Shi LXu FWang RWei YWang GWang BLiu P(2024)Information Aggregate and Sentiment Enhance Network to Handle Missing Modalities for Multimodal Sentiment Analysis2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687981(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687981
Xu XMeng KChen CLu L(2023)Isolated Word Sign Language Recognition Based on Improved SKResNet-TCN NetworkJournal of Sensors10.1155/2023/95039612023(1-10)Online publication date: 4-Jul-2023
https://doi.org/10.1155/2023/9503961
Li KLi JGuo DYang XWang M(2023)Transformer-Based Visual Grounding with Cross-Modality InteractionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725119:6(1-19)Online publication date: 30-May-2023
https://doi.org/10.1145/3587251
Zeng JZhou JLiu T(2023)Robust Multimodal Sentiment Analysis via Tag Encoding of Uncertain Missing ModalitiesIEEE Transactions on Multimedia10.1109/TMM.2022.320757225(6301-6314)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3207572
Zeng JZhou JHuang C(2023)Exploring Semantic Relations for Social Media Sentiment AnalysisIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2023.328523831(2382-2394)Online publication date: 2023
https://doi.org/10.1109/TASLP.2023.3285238
Karim INadeem MGhayyas MToor HAkram F(2023)Design and Development of a Gesture Recording System for Pakistan Sign Language2023 3rd International Conference on Digital Futures and Transformative Technologies (ICoDT2)10.1109/ICoDT259378.2023.10325814(1-6)Online publication date: 3-Oct-2023
https://doi.org/10.1109/ICoDT259378.2023.10325814
Sahu KSaraswat TSawhney R(2023)SignNet: A Deep Learning Architecture for Accurate Sign Language Recognition from Images2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT56998.2023.10306591(1-6)Online publication date: 6-Jul-2023
https://doi.org/10.1109/ICCCNT56998.2023.10306591
Ashwath APeechatt MAlm CBailey R(2023)Early vs. Late Multimodal Fusion for Recognizing Confusion in Collaborative Tasks2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)10.1109/ACIIW59127.2023.10388144(1-4)Online publication date: 10-Sep-2023
https://doi.org/10.1109/ACIIW59127.2023.10388144
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents