research-article

CovLets: A Second-Order Descriptor for Modeling Multiple Features

Authors:

Junkai HuangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 16, Issue 1s

Article No.: 21, Pages 1 - 14

https://doi.org/10.1145/3357525

Published: 17 April 2020 Publication History

Abstract

State-of-the-art techniques for image and video classification take a bottom-up approach where local features are aggregated into a global final representation. Existing frameworks (i.e., bag of words or Fisher vectors) are specifically designed to aggregate vector-valued features such as SIFT descriptors. In this article, we propose a technique to aggregate local descriptors in the form of covariance descriptors (CovDs) into a rich descriptor, which in essence benefit from the second-order statistics along the coding pipeline. The difficulty in aggregating CovDs arises from the fact that CovDs lie on the Riemannian manifold of symmetric positive definite (SPD) matrices. Therefore, the aggregating scheme must take advantage of metrics and the geometry of the SPD manifolds. In our proposal, we make use of the Stein divergence and Nyström method to embed the SPD manifold into a Hilbert space. We compare our proposal, dubbed CovLets, against state-of-the-art methods on several image and video classification problems including facial expression recognition and action recognition.

References

[1]

Vincent Arsigny, Pierre Fillard, Xavier Pennec, and Nicholas Ayache. 2006. Log-Euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine 56, 2 (2006), 411--421.

[2]

Anoop Cherian and Suvrit Sra. 2014. Riemannian sparse coding for positive definite matrices. In Proceedings of the European Conference on Computer Vision (ICCV’14). 299--314.

[3]

Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. 2018. ATOM: Accurate tracking by overlap maximization. arxiv:1811.07628.

[4]

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The PASCAL Visual Object Classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303--338.

Digital Library

[5]

G. Griffin, A. Holub, and P. Perona. 2007. Caltech-256 Object cCategory Dataset. Technical Report 7694. California Institute of Technology.

[6]

K. Guo, P. Ishwar, and J. Konrad. 2013. Action recognition from video using feature covariance matrices. IEEE Transactions on Image Processing 22, 6 (2013), 2479--2494.

Digital Library

[7]

M. T. Harandi, R. Hartley, B. Lovell, and C. Sanderson. 2015. Sparse coding on symmetric positive definite manifolds using Bregman divergences. IEEE Transactions on Neural Networks and Learning Systems PP, 99 (2015), 1.

[8]

Zhenyu He, Xin Li, Xinge You, Dacheng Tao, and Yuan Yan Tang. 2016. Connected component model for multi-object tracking. IEEE Transactions on Image Processing 25, 8 (2016), 3698--3711.

Digital Library

[9]

Zhenyu He, Shuangyan Yi, Yiu-Ming Cheung, Xinge You, and Yuan Yan Tang. 2017. Robust object tracking via key patch sparse representation. IEEE Transactions on Cybernetics 47, 2 (2017), 354--364.

[10]

W. Hu, X. Li, W. Luo, X. Zhang, S. Maybank, and Z. Zhang. 2012. Single and multiple object tracking using log-Euclidean Riemannian subspace and block-division appearance model. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 12 (2012), 2420--2440.

Digital Library

[11]

T. S. Jaakkola and D. Haussler. 1999. Exploiting generative models in discriminative classifiers. In Proceedings of Neural Information Processing Systems (NIPS’99). 487--493.

[12]

S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi. 2013. Kernel methods on the Riemannian manifold of symmetric positive definite matrices. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 73--80.

[13]

Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3304--3311.

[14]

Jianchao, Kai Yu, Yihong Gong, and Thomas Huang. 2009. Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1794--1801.

[15]

Feng Jiang, Shengping Zhang, Shen Wu, Yang Gao, and Debin Zhao. 2015. Multi-layered gesture recognition with Kinect. Journal of Machine Learning Research 16 (2015), 227--254.

Digital Library

[16]

Zheheng Jiang, Danny Crookes, Brian D. Green, Yunfeng Zhao, Haiping Ma, Ling Li, Shengping Zhang, Dacheng Tao, and Huiyu Zhou. 2019. Context-aware mouse behaviour recognition using hidden Markov models. IEEE Transactions on Image Processing 28, 3 (2019), 1133--1148.

Digital Library

[17]

Alexander Kläser and M. Marszalek. 2008. A spatio-temporal descriptor based on 3D-gradients. In Proceedings of the British Machine Vision Conference (BMVC’08). 1--10.

[18]

Xiangyuan Lan, Andy Jinhua Ma, and Pong Chi Yuen. 2014. Multi-cue visual tracking using robust feature-level fusion based on joint sparse representation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1194--1201.

Digital Library

[19]

S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’06). 2169--2178.

[20]

Q. Le, Alexandre Karpenko, Jiquan Ngiam, and A. Ng. 2011. ICA with reconstruction cost for efficient overcomplete feature learning. In Proceedings of Neural Information Processing Systems (NIPS’11). 1017--1025.

[21]

F.-F. Li, R. Fergus, and P. Perona. 2004. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In Proceeding of the Computer Vision and Pattern Recognition Workshop on Generative Model Based Vision.

[22]

Peihua Li, Qilong Wang, Wangmeng Zuo, and Lei Zhang. 2013. Log-Euclidean kernels for sparse representation and dictionary learning. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV’13).

Digital Library

[23]

J. Liu, J. Luo, and M. Shah. 2009. Recognizing realistic actions from videos “in the wild.” In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1996--2003.

[24]

Mengyi Liu, Shiguang Shan, Ruiping Wang, and Xilin Chen. 2014. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1749--1756.

Digital Library

[25]

D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2 (2004), 91--110.

Digital Library

[26]

M. Marzalek, I. Laptev, and C. Schmid. 2009. Actions in context. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 2929--2936.

[27]

Xavier Pennec, Pierre Fillard, and Nicholas Ayache. 2006. A Riemannian framework for tensor computing. International Journal of Computer Vision 66, 1 (2006), 41--66.

Digital Library

[28]

Florent Perronnin, Jorge Sánchez, and Thomas Mensink. 2010. Improving the Fisher kernel for large-scale image classification. In Proceedings of the European Conference on Computer Vision (ECCV’10). 143--156.

Digital Library

[29]

Yuankai Qi, Lei Qin, Jian Zhang, Shengping Zhang, Qingming Huang, and Ming-Hsuan Yang. 2018. Structure-aware local sparse coding for visual tracking. IEEE Transactions on Image Processing 27, 8 (2018), 3857--3869.

Digital Library

[30]

Yuankai Qi, Shengping Zhang, Lei Qin, Hongxun Yao, Qingming Huang, Jongwoo Lim, and Ming-Hsuan Yang. 2019. Hedging deep features for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 5 (2019), 1116--1130.

[31]

M. Rodriguez, J. Ahmed, and M. Shah. 2008. Action MACH: A spatio-temporal maximum average correlation height filter for action recognition. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08). 1--8.

[32]

Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob J. Verbeek. 2013. Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision 105, 3 (2013), 222--245.

Digital Library

[33]

C. Schuldt, I. Laptev, and B. Caputo. 2004. Recognizing human actions: A local SVM approach. In Proceedings of the International Conference on Pattern Recognition. 32--36.

[34]

J. Sivic and A. Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV’03). 1470--1477.

[35]

S. Sra. 2012. A new metric on the manifold of kernel matrices with application to matrix geometric means. In Proceedings of Neural Information Processing Systems (NIPS’12). 144--152.

[36]

O. Tuzel, F. Porikli, and P. Meer. 2006. Region covariance: A fast descriptor for detection and classification. In Proceedings of the European Conference on Computer Vision (ECCV’06). 589--600.

[37]

M. Valstar and M. Pantic. 2010. Induced disgust, happiness and surprise: An addition to the MMI facial expression database. In Proceedingsof the International Conference on Language Resources and Evaluation, Workshop on EMOTION (LRECW’10).

[38]

Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2013. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision 103, 1 (2013), 60--79.

[39]

L. Wang, Y. Qiao, and X. Tang. 2013. Motionlets: Mid-level 3D parts for human motion recognition. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 2674--2681.

[40]

Peter Wilf, Shengping Zhang, Sharat Chikkerur, Stefan A. Little, Scott L. Wing, and Thomas Serre. 2016. Computer vision cracks the leaf code. Proceedings of the National Academy of Sciences of the United States of America 113, 12 (2016), 3305--3310.

[41]

Christopher Williams and Matthias Seeger. 2000. The effect of the input density distribution on kernel-based classifiers. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). 1159--1166.

[42]

Yingjie Yao, Xiaohe Wu, Lei Zhang, Shiguang Shan, and Wangmeng Zuo. 2018. Joint representation and truncated inference learning for correlation filter based tracking. In Proceedings of the European Conference on Computer Vision (ECCV’18). 552--567.

[43]

Shuangyan Yi, Zhihui Lai, Zhenyu He, Yiu-Ming Cheung, and Yang Liu. 2017. Joint sparse principal component analysis. Pattern Recognition 61 (2017), 524--536.

Digital Library

[44]

Shuangyan Yi, Yingyi Liang, Zhenyu He, Yi Li, and Yiu-Ming Cheung. 2019. Dual pursuit for subspace learning. IEEE Transactions on Multimedia 21, 6 (2019), 1399--1411.

Digital Library

[45]

Lei Zhang, Wen Wu, Terrence Chen, Norbert Strobel, and Dorin Comaniciu. 2015. Robust object tracking using semi-supervised appearance dictionary learning. Pattern Recognition Letters 62 (2015), 17--23.

Digital Library

[46]

Lei Zhang, Shengping Zhang, Feng Jiang, Yuankai Qi, Jun Zhang, Yuliang Guo, and Huiyu Zhou. 2018. BoMW: Bag of manifold words for one-shot learning gesture recognition from Kinect. IEEE Transactions on Circuits and Systems for Video Technology 28, 10 (2018), 2562--2573.

Digital Library

[47]

Shengping Zhang, Shiva Kasiviswanathan, Pong C. Yuen, and Mehrtash Harandi. 2015. Online dictionary learning on symmetric positive definite manifolds with vision applications. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 3165--3173.

[48]

Shengping Zhang, Xiangyuan Lan, Yuankai Qi, and Pong C. Yuen. 2017. Robust visual tracking via basis matching. IEEE Transactions on Circuits and Systems for Video Technology 27, 3 (2017), 421--430.

Digital Library

[49]

Shengping Zhang, Yuankai Qi, Feng Jiang, Xiangyuan Lan, Pong C. Yuen, and Huiyu Zhou. 2017. A biologically inspired appearance model for robust visual tracking. IEEE Transactions on Neural Networks and Learning Systems 28, 10 (2017), 2357--2370.

[50]

Shengping Zhang, Yuankai Qi, Feng Jiang, Xiangyuan Lan, Pong C. Yuen, and Huiyu Zhou. 2018. Point-to-set distance metric learning on deep representations for visual tracking. IEEE Transactions on Intelligent Transportation Systems 19, 1 (2018), 187--198.

[51]

Shengping Zhang, Hongxun Yao, Xin Sun, and Shaouhui Liu. 2012. Robust visual tracking using an effective appearance model based on sparse coding. ACM Transactions on Intelligent Systems and Technology 3, 3 (2012), 1--18.

Digital Library

[52]

Shengping Zhang, Hongxun Yao, Xin Sun, and Xiusheng Lu. 2013. Sparse coding based visual tracking: Review and experimental comparison. Pattern Recognition 46, 7 (2013), 1772--1788.

Digital Library

[53]

Shengping Zhang, Hongxun Yao, Xin Sun, Kuanquan Wang, Jun Zhang, Xiusheng Lu, and Yanhao Zhang. 2014. Action recognition based on overcomplete independent component analysis. Information Sciences 281 (2014), 635--647.

Digital Library

[54]

S. Zhang, H. Yao, H. Zhou, X. Sun, and S. Liu. 2013. Robust visual tracking based on online learning sparse representation. Neurocomputing 100, 1 (2013), 31--40.

Digital Library

[55]

Shengping Zhang, Huiyu Zhou, Feng Jiang, and Xuelong Li. 2015. Robust visual tracking using structurally random projection and weighted least squares. IEEE Transactions on Circuits and Systems for Video Technology 25, 11 (2015), 1749--1760.

Digital Library

[56]

Shengping Zhang, Huiyu Zhou, Hongxun Yao, Yanhao Zhang, Kuanquan Wang, and Jun Zhang. 2015. Adaptive NormalHedge for robust visual tracking. Signal Processing 110 (2015), 132--142.

Digital Library

[57]

G. Zhao, X. Huang, M. Taini, S. Z. Li, and Matti Pietikäinen. 2011. Facial expression recognition from near-infrared videos. Image and Vision Computing 29, 9 (2011), 607--619.

Digital Library

[58]

L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. N. Metaxas. 2012. Learning active facial patches for expression analysis. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2562--2569.

[59]

Heyan Zhu, Xinyuan Huang, Shengping Zhang, and Pong C. Yuen. 2017. Plant identification via multipath sparse coding. Multimedia Tools and Applications 76, 3 (2017), 4599--4615.

Digital Library

Cited By

Ha SNguyen TPhan HHa P(2024)Real-Time Change Detection with Convolutional Density ApproximationVietnam Journal of Computer Science10.1142/S219688882350015X(1-36)Online publication date: 2-Apr-2024
https://doi.org/10.1142/S219688882350015X
GhoshRoy DAlvi PSantosh K(2024)Leveraging Sampling Schemes on Skewed Class Distribution to Enhance Male Fertility Detection with Ensemble AI LearnersInternational Journal of Pattern Recognition and Artificial Intelligence10.1142/S021800142451003038:02Online publication date: 7-Mar-2024
https://doi.org/10.1142/S0218001424510030
Liao WShi GLv YLiu LTang XJin YNing ZZhao XLi XChen Z(2024)Accurate and robust segmentation of cerebral vasculature on four-dimensional arterial spin labeling magnetic resonance angiography using machine-learning approachMagnetic Resonance Imaging10.1016/j.mri.2024.04.022110(86-95)Online publication date: Jul-2024
https://doi.org/10.1016/j.mri.2024.04.022
Show More Cited By

Index Terms

CovLets: A Second-Order Descriptor for Modeling Multiple Features
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Feature selection

Recommendations

Robust Visual Tracking Using Kernel Sparse Coding on Multiple Covariance Descriptors
Special Issue on Multimodal Machine Learning for Human Behavior Analysis and Special Issue on Computational Intelligence for Biomedical Data and Imaging

In this article, we aim to improve the performance of visual tracking by combing different features of multiple modalities. The core idea is to use covariance matrices as feature descriptors and then use sparse coding to encode different features. The ...
Action recognition in depth videos using hierarchical gaussian descriptor

In this paper, we propose a new approach based on distribution descriptors for action recognition in depth videos. Our local features are computed from binary patterns which incorporate the shape and motion cues for effective action recognition. Given ...
Hierarchical Gaussian descriptor based on local pooling for action recognition

In this paper, we propose a new approach based on Gaussian descriptors for action recognition. We first develop a feature representation technique that encodes high-order statistics of local features in two levels, where single Gaussians are used to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 16, Issue 1s

Special Issue on Multimodal Machine Learning for Human Behavior Analysis and Special Issue on Computational Intelligence for Biomedical Data and Imaging

January 2020

376 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3388236

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2020

Accepted: 01 August 2019

Revised: 01 July 2019

Received: 01 April 2019

Published in TOMM Volume 16, Issue 1s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
161
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ha SNguyen TPhan HHa P(2024)Real-Time Change Detection with Convolutional Density ApproximationVietnam Journal of Computer Science10.1142/S219688882350015X(1-36)Online publication date: 2-Apr-2024
https://doi.org/10.1142/S219688882350015X
GhoshRoy DAlvi PSantosh K(2024)Leveraging Sampling Schemes on Skewed Class Distribution to Enhance Male Fertility Detection with Ensemble AI LearnersInternational Journal of Pattern Recognition and Artificial Intelligence10.1142/S021800142451003038:02Online publication date: 7-Mar-2024
https://doi.org/10.1142/S0218001424510030
Liao WShi GLv YLiu LTang XJin YNing ZZhao XLi XChen Z(2024)Accurate and robust segmentation of cerebral vasculature on four-dimensional arterial spin labeling magnetic resonance angiography using machine-learning approachMagnetic Resonance Imaging10.1016/j.mri.2024.04.022110(86-95)Online publication date: Jul-2024
https://doi.org/10.1016/j.mri.2024.04.022
Yang FQiao YHajek PAbedin M(2024)Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledge-driven explainabilityExpert Systems with Applications10.1016/j.eswa.2024.124886255(124886)Online publication date: Dec-2024
https://doi.org/10.1016/j.eswa.2024.124886
Salmi MAtif DOliva DAbraham AVentura S(2024)Handling imbalanced medical datasets: review of a decade of researchArtificial Intelligence Review10.1007/s10462-024-10884-257:10Online publication date: 2-Sep-2024
https://doi.org/10.1007/s10462-024-10884-2
Li BZhang YZhang CPiao XYin B(2023)Hypergraph Association Weakly Supervised Crowd CountingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/359467019:6(1-20)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3594670
Gao LGuan L(2023)A Discriminant Information Theoretic Learning Framework for Multi-modal Feature RepresentationACM Transactions on Intelligent Systems and Technology10.1145/358725314:3(1-24)Online publication date: 13-Apr-2023
https://dl.acm.org/doi/10.1145/3587253
Jaiswal RDubey R(2023)CAQoE: A Novel No-Reference Context-aware Speech Quality Prediction MetricACM Transactions on Multimedia Computing, Communications, and Applications10.1145/352939419:1s(1-23)Online publication date: 3-Feb-2023
https://dl.acm.org/doi/10.1145/3529394
Gao LGuan L(2023)A Deep Discriminant Fractional-order Canonical Correlation Analysis For Information Fusion2023 10th IEEE Swiss Conference on Data Science (SDS)10.1109/SDS57534.2023.00015(58-65)Online publication date: Jun-2023
https://doi.org/10.1109/SDS57534.2023.00015
Qayyum ARazzak ITanveer MMazher M(2022)Spontaneous Facial Behavior Analysis Using Deep Transformer-based Framework for Child–computer InteractionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353957720:2(1-17)Online publication date: 26-May-2022
https://dl.acm.org/doi/10.1145/3539577
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents