Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

CovLets: A Second-Order Descriptor for Modeling Multiple Features

Published: 17 April 2020 Publication History

Abstract

State-of-the-art techniques for image and video classification take a bottom-up approach where local features are aggregated into a global final representation. Existing frameworks (i.e., bag of words or Fisher vectors) are specifically designed to aggregate vector-valued features such as SIFT descriptors. In this article, we propose a technique to aggregate local descriptors in the form of covariance descriptors (CovDs) into a rich descriptor, which in essence benefit from the second-order statistics along the coding pipeline. The difficulty in aggregating CovDs arises from the fact that CovDs lie on the Riemannian manifold of symmetric positive definite (SPD) matrices. Therefore, the aggregating scheme must take advantage of metrics and the geometry of the SPD manifolds. In our proposal, we make use of the Stein divergence and Nyström method to embed the SPD manifold into a Hilbert space. We compare our proposal, dubbed CovLets, against state-of-the-art methods on several image and video classification problems including facial expression recognition and action recognition.

References

[1]
Vincent Arsigny, Pierre Fillard, Xavier Pennec, and Nicholas Ayache. 2006. Log-Euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine 56, 2 (2006), 411--421.
[2]
Anoop Cherian and Suvrit Sra. 2014. Riemannian sparse coding for positive definite matrices. In Proceedings of the European Conference on Computer Vision (ICCV’14). 299--314.
[3]
Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. 2018. ATOM: Accurate tracking by overlap maximization. arxiv:1811.07628.
[4]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The PASCAL Visual Object Classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303--338.
[5]
G. Griffin, A. Holub, and P. Perona. 2007. Caltech-256 Object cCategory Dataset. Technical Report 7694. California Institute of Technology.
[6]
K. Guo, P. Ishwar, and J. Konrad. 2013. Action recognition from video using feature covariance matrices. IEEE Transactions on Image Processing 22, 6 (2013), 2479--2494.
[7]
M. T. Harandi, R. Hartley, B. Lovell, and C. Sanderson. 2015. Sparse coding on symmetric positive definite manifolds using Bregman divergences. IEEE Transactions on Neural Networks and Learning Systems PP, 99 (2015), 1.
[8]
Zhenyu He, Xin Li, Xinge You, Dacheng Tao, and Yuan Yan Tang. 2016. Connected component model for multi-object tracking. IEEE Transactions on Image Processing 25, 8 (2016), 3698--3711.
[9]
Zhenyu He, Shuangyan Yi, Yiu-Ming Cheung, Xinge You, and Yuan Yan Tang. 2017. Robust object tracking via key patch sparse representation. IEEE Transactions on Cybernetics 47, 2 (2017), 354--364.
[10]
W. Hu, X. Li, W. Luo, X. Zhang, S. Maybank, and Z. Zhang. 2012. Single and multiple object tracking using log-Euclidean Riemannian subspace and block-division appearance model. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 12 (2012), 2420--2440.
[11]
T. S. Jaakkola and D. Haussler. 1999. Exploiting generative models in discriminative classifiers. In Proceedings of Neural Information Processing Systems (NIPS’99). 487--493.
[12]
S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi. 2013. Kernel methods on the Riemannian manifold of symmetric positive definite matrices. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 73--80.
[13]
Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3304--3311.
[14]
Jianchao, Kai Yu, Yihong Gong, and Thomas Huang. 2009. Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1794--1801.
[15]
Feng Jiang, Shengping Zhang, Shen Wu, Yang Gao, and Debin Zhao. 2015. Multi-layered gesture recognition with Kinect. Journal of Machine Learning Research 16 (2015), 227--254.
[16]
Zheheng Jiang, Danny Crookes, Brian D. Green, Yunfeng Zhao, Haiping Ma, Ling Li, Shengping Zhang, Dacheng Tao, and Huiyu Zhou. 2019. Context-aware mouse behaviour recognition using hidden Markov models. IEEE Transactions on Image Processing 28, 3 (2019), 1133--1148.
[17]
Alexander Kläser and M. Marszalek. 2008. A spatio-temporal descriptor based on 3D-gradients. In Proceedings of the British Machine Vision Conference (BMVC’08). 1--10.
[18]
Xiangyuan Lan, Andy Jinhua Ma, and Pong Chi Yuen. 2014. Multi-cue visual tracking using robust feature-level fusion based on joint sparse representation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1194--1201.
[19]
S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’06). 2169--2178.
[20]
Q. Le, Alexandre Karpenko, Jiquan Ngiam, and A. Ng. 2011. ICA with reconstruction cost for efficient overcomplete feature learning. In Proceedings of Neural Information Processing Systems (NIPS’11). 1017--1025.
[21]
F.-F. Li, R. Fergus, and P. Perona. 2004. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In Proceeding of the Computer Vision and Pattern Recognition Workshop on Generative Model Based Vision.
[22]
Peihua Li, Qilong Wang, Wangmeng Zuo, and Lei Zhang. 2013. Log-Euclidean kernels for sparse representation and dictionary learning. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV’13).
[23]
J. Liu, J. Luo, and M. Shah. 2009. Recognizing realistic actions from videos “in the wild.” In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1996--2003.
[24]
Mengyi Liu, Shiguang Shan, Ruiping Wang, and Xilin Chen. 2014. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1749--1756.
[25]
D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2 (2004), 91--110.
[26]
M. Marzalek, I. Laptev, and C. Schmid. 2009. Actions in context. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 2929--2936.
[27]
Xavier Pennec, Pierre Fillard, and Nicholas Ayache. 2006. A Riemannian framework for tensor computing. International Journal of Computer Vision 66, 1 (2006), 41--66.
[28]
Florent Perronnin, Jorge Sánchez, and Thomas Mensink. 2010. Improving the Fisher kernel for large-scale image classification. In Proceedings of the European Conference on Computer Vision (ECCV’10). 143--156.
[29]
Yuankai Qi, Lei Qin, Jian Zhang, Shengping Zhang, Qingming Huang, and Ming-Hsuan Yang. 2018. Structure-aware local sparse coding for visual tracking. IEEE Transactions on Image Processing 27, 8 (2018), 3857--3869.
[30]
Yuankai Qi, Shengping Zhang, Lei Qin, Hongxun Yao, Qingming Huang, Jongwoo Lim, and Ming-Hsuan Yang. 2019. Hedging deep features for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 5 (2019), 1116--1130.
[31]
M. Rodriguez, J. Ahmed, and M. Shah. 2008. Action MACH: A spatio-temporal maximum average correlation height filter for action recognition. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08). 1--8.
[32]
Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob J. Verbeek. 2013. Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision 105, 3 (2013), 222--245.
[33]
C. Schuldt, I. Laptev, and B. Caputo. 2004. Recognizing human actions: A local SVM approach. In Proceedings of the International Conference on Pattern Recognition. 32--36.
[34]
J. Sivic and A. Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV’03). 1470--1477.
[35]
S. Sra. 2012. A new metric on the manifold of kernel matrices with application to matrix geometric means. In Proceedings of Neural Information Processing Systems (NIPS’12). 144--152.
[36]
O. Tuzel, F. Porikli, and P. Meer. 2006. Region covariance: A fast descriptor for detection and classification. In Proceedings of the European Conference on Computer Vision (ECCV’06). 589--600.
[37]
M. Valstar and M. Pantic. 2010. Induced disgust, happiness and surprise: An addition to the MMI facial expression database. In Proceedingsof the International Conference on Language Resources and Evaluation, Workshop on EMOTION (LRECW’10).
[38]
Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2013. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision 103, 1 (2013), 60--79.
[39]
L. Wang, Y. Qiao, and X. Tang. 2013. Motionlets: Mid-level 3D parts for human motion recognition. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 2674--2681.
[40]
Peter Wilf, Shengping Zhang, Sharat Chikkerur, Stefan A. Little, Scott L. Wing, and Thomas Serre. 2016. Computer vision cracks the leaf code. Proceedings of the National Academy of Sciences of the United States of America 113, 12 (2016), 3305--3310.
[41]
Christopher Williams and Matthias Seeger. 2000. The effect of the input density distribution on kernel-based classifiers. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). 1159--1166.
[42]
Yingjie Yao, Xiaohe Wu, Lei Zhang, Shiguang Shan, and Wangmeng Zuo. 2018. Joint representation and truncated inference learning for correlation filter based tracking. In Proceedings of the European Conference on Computer Vision (ECCV’18). 552--567.
[43]
Shuangyan Yi, Zhihui Lai, Zhenyu He, Yiu-Ming Cheung, and Yang Liu. 2017. Joint sparse principal component analysis. Pattern Recognition 61 (2017), 524--536.
[44]
Shuangyan Yi, Yingyi Liang, Zhenyu He, Yi Li, and Yiu-Ming Cheung. 2019. Dual pursuit for subspace learning. IEEE Transactions on Multimedia 21, 6 (2019), 1399--1411.
[45]
Lei Zhang, Wen Wu, Terrence Chen, Norbert Strobel, and Dorin Comaniciu. 2015. Robust object tracking using semi-supervised appearance dictionary learning. Pattern Recognition Letters 62 (2015), 17--23.
[46]
Lei Zhang, Shengping Zhang, Feng Jiang, Yuankai Qi, Jun Zhang, Yuliang Guo, and Huiyu Zhou. 2018. BoMW: Bag of manifold words for one-shot learning gesture recognition from Kinect. IEEE Transactions on Circuits and Systems for Video Technology 28, 10 (2018), 2562--2573.
[47]
Shengping Zhang, Shiva Kasiviswanathan, Pong C. Yuen, and Mehrtash Harandi. 2015. Online dictionary learning on symmetric positive definite manifolds with vision applications. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 3165--3173.
[48]
Shengping Zhang, Xiangyuan Lan, Yuankai Qi, and Pong C. Yuen. 2017. Robust visual tracking via basis matching. IEEE Transactions on Circuits and Systems for Video Technology 27, 3 (2017), 421--430.
[49]
Shengping Zhang, Yuankai Qi, Feng Jiang, Xiangyuan Lan, Pong C. Yuen, and Huiyu Zhou. 2017. A biologically inspired appearance model for robust visual tracking. IEEE Transactions on Neural Networks and Learning Systems 28, 10 (2017), 2357--2370.
[50]
Shengping Zhang, Yuankai Qi, Feng Jiang, Xiangyuan Lan, Pong C. Yuen, and Huiyu Zhou. 2018. Point-to-set distance metric learning on deep representations for visual tracking. IEEE Transactions on Intelligent Transportation Systems 19, 1 (2018), 187--198.
[51]
Shengping Zhang, Hongxun Yao, Xin Sun, and Shaouhui Liu. 2012. Robust visual tracking using an effective appearance model based on sparse coding. ACM Transactions on Intelligent Systems and Technology 3, 3 (2012), 1--18.
[52]
Shengping Zhang, Hongxun Yao, Xin Sun, and Xiusheng Lu. 2013. Sparse coding based visual tracking: Review and experimental comparison. Pattern Recognition 46, 7 (2013), 1772--1788.
[53]
Shengping Zhang, Hongxun Yao, Xin Sun, Kuanquan Wang, Jun Zhang, Xiusheng Lu, and Yanhao Zhang. 2014. Action recognition based on overcomplete independent component analysis. Information Sciences 281 (2014), 635--647.
[54]
S. Zhang, H. Yao, H. Zhou, X. Sun, and S. Liu. 2013. Robust visual tracking based on online learning sparse representation. Neurocomputing 100, 1 (2013), 31--40.
[55]
Shengping Zhang, Huiyu Zhou, Feng Jiang, and Xuelong Li. 2015. Robust visual tracking using structurally random projection and weighted least squares. IEEE Transactions on Circuits and Systems for Video Technology 25, 11 (2015), 1749--1760.
[56]
Shengping Zhang, Huiyu Zhou, Hongxun Yao, Yanhao Zhang, Kuanquan Wang, and Jun Zhang. 2015. Adaptive NormalHedge for robust visual tracking. Signal Processing 110 (2015), 132--142.
[57]
G. Zhao, X. Huang, M. Taini, S. Z. Li, and Matti Pietikäinen. 2011. Facial expression recognition from near-infrared videos. Image and Vision Computing 29, 9 (2011), 607--619.
[58]
L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. N. Metaxas. 2012. Learning active facial patches for expression analysis. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2562--2569.
[59]
Heyan Zhu, Xinyuan Huang, Shengping Zhang, and Pong C. Yuen. 2017. Plant identification via multipath sparse coding. Multimedia Tools and Applications 76, 3 (2017), 4599--4615.

Cited By

View all
  • (2024)Real-Time Change Detection with Convolutional Density ApproximationVietnam Journal of Computer Science10.1142/S219688882350015X(1-36)Online publication date: 2-Apr-2024
  • (2024)Leveraging Sampling Schemes on Skewed Class Distribution to Enhance Male Fertility Detection with Ensemble AI LearnersInternational Journal of Pattern Recognition and Artificial Intelligence10.1142/S021800142451003038:02Online publication date: 7-Mar-2024
  • (2024)Accurate and robust segmentation of cerebral vasculature on four-dimensional arterial spin labeling magnetic resonance angiography using machine-learning approachMagnetic Resonance Imaging10.1016/j.mri.2024.04.022110(86-95)Online publication date: Jul-2024
  • Show More Cited By

Index Terms

  1. CovLets: A Second-Order Descriptor for Modeling Multiple Features

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 1s
    Special Issue on Multimodal Machine Learning for Human Behavior Analysis and Special Issue on Computational Intelligence for Biomedical Data and Imaging
    January 2020
    376 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3388236
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 April 2020
    Accepted: 01 August 2019
    Revised: 01 July 2019
    Received: 01 April 2019
    Published in TOMM Volume 16, Issue 1s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Action recognition
    2. Riemannian manifold
    3. covariance descriptor
    4. reproducing kernel Hilbert space

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Real-Time Change Detection with Convolutional Density ApproximationVietnam Journal of Computer Science10.1142/S219688882350015X(1-36)Online publication date: 2-Apr-2024
    • (2024)Leveraging Sampling Schemes on Skewed Class Distribution to Enhance Male Fertility Detection with Ensemble AI LearnersInternational Journal of Pattern Recognition and Artificial Intelligence10.1142/S021800142451003038:02Online publication date: 7-Mar-2024
    • (2024)Accurate and robust segmentation of cerebral vasculature on four-dimensional arterial spin labeling magnetic resonance angiography using machine-learning approachMagnetic Resonance Imaging10.1016/j.mri.2024.04.022110(86-95)Online publication date: Jul-2024
    • (2024)Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledge-driven explainabilityExpert Systems with Applications10.1016/j.eswa.2024.124886255(124886)Online publication date: Dec-2024
    • (2024)Handling imbalanced medical datasets: review of a decade of researchArtificial Intelligence Review10.1007/s10462-024-10884-257:10Online publication date: 2-Sep-2024
    • (2023)Hypergraph Association Weakly Supervised Crowd CountingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/359467019:6(1-20)Online publication date: 12-Jul-2023
    • (2023)A Discriminant Information Theoretic Learning Framework for Multi-modal Feature RepresentationACM Transactions on Intelligent Systems and Technology10.1145/358725314:3(1-24)Online publication date: 13-Apr-2023
    • (2023)CAQoE: A Novel No-Reference Context-aware Speech Quality Prediction MetricACM Transactions on Multimedia Computing, Communications, and Applications10.1145/352939419:1s(1-23)Online publication date: 3-Feb-2023
    • (2023)A Deep Discriminant Fractional-order Canonical Correlation Analysis For Information Fusion2023 10th IEEE Swiss Conference on Data Science (SDS)10.1109/SDS57534.2023.00015(58-65)Online publication date: Jun-2023
    • (2022)Spontaneous Facial Behavior Analysis Using Deep Transformer-based Framework for Child–computer InteractionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353957720:2(1-17)Online publication date: 26-May-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media