Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off

Uijlings, J.; Duta, I. C.; Sangineto, E.; Sebe, Nicu

doi:10.1007/s13735-014-0069-5

Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off

Regular Paper
Published: 28 September 2014

Volume 4, pages 33–44, (2015)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

J. Uijlings¹,
I. C. Duta²,
E. Sangineto² &
…
Nicu Sebe²

1837 Accesses
68 Citations
Explore all metrics

Abstract

The current state-of-the-art in video classification is based on Bag-of-Words using local visual descriptors. Most commonly these are histogram of oriented gradients (HOG), histogram of optical flow (HOF) and motion boundary histograms (MBH) descriptors. While such approach is very powerful for classification, it is also computationally expensive. This paper addresses the problem of computational efficiency. Specifically: (1) We propose several speed-ups for densely sampled HOG, HOF and MBH descriptors and release Matlab code; (2) We investigate the trade-off between accuracy and computational efficiency of descriptors in terms of frame sampling rate and type of Optical Flow method; (3) We investigate the trade-off between accuracy and computational efficiency for computing the feature vocabulary, using and comparing most of the commonly adopted vector quantization techniques: $k$-means, hierarchical $k$-means, Random Forests, Fisher Vectors and VLAD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A modified vector of locally aggregated descriptors approach for fast video classification

Article 21 August 2015

A Robust and Efficient Video Representation for Action Recognition

Article 17 July 2015

Human Action Recognition: Contour-Based and Silhouette-Based Approaches

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Notes

References

Arandjelović R, Zisserman A (2012) Three things everyone should know to improve object retrieval. In: CVPR
Baker S, Scharstein D, Lewis JP, Roth S, Black MJ, Szeliski R (2011) A database and evaluation methodology for optical flow. Int J Comput Vis 92:1–31
Bay H, Ess A, Tuytelaars T, Van L (2008) Speeded-Up Robust Features (SURF). Comput Vis Image Underst 110:346–359
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: ECCV, pp 25–36
Brox T, Malik J (2011) Large displacement optical flow: descriptor matching in variational motion estimation. PAMI 33(3):500–513
Article Google Scholar
Butler DJ, Wulff J, Stanley GB, Black MJ (2012) A naturalistic open source movie for optical flow evaluation. In: ECCV
Chang C-C, Lin C-J (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. http://www.csie.ntu.edu.tw/cjlin/libsvm
Chatfield K, Lempitsky V, Vedaldi A, Zisserman A (2011) The devil is in the details: an evaluation of recent feature encoding methods. In: BMVC
Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: ECCV international workshop on statistical learning in computer vision, Prague
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: CVPR
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: ECCV
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: VS-PETS
Everts I, van Gemert J, Gevers T (2013) Evaluation of color STIPs for human action recognition. In: CVPR
Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Scandinavian conference on image analysis
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42
Article MATH Google Scholar
Horn B, Schunck B (1981) Determining optical flow. Artif Intell 17:185–203
Article Google Scholar
Jaakkola T, Haussler D (1999) Exploiting generative models in discriminative classifiers. In: NIPS
Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: CVPR, pp 3304–3311
Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. In: ICCV
Karaman S, Seidenari L, Bagdanov A, del Bimbo A (2013) L1-regularized logistic regression stacking and transductive CRF smoothing for action recognition in video. In: ICCV workshop on action recognition with a large number of classes
Kläser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: BMVC
Kliper-Gross O, Gurovich Y, Hassner T, Wolf L (2012) Motion interchange patterns for action recognition in unconstrained videos. In: ECCV
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: ICCV
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPR
Lazebnik S, Schmid C, Ponce J (2006) Spatial pyramid matching for recognizing natural scene categories. In: CVPR. Beyond Bags of Features
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. IJCV 60:91–110
Article Google Scholar
Lucas B, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: International joint conference on artificial intelligence
Maji S, Berg AC, Malik J (2008) Classification using intersection kernel support vector machines is efficient. In: CVPR
Moosmann F, Nowak E, Jurie F (2008) Randomized clustering forests for image classification. IEEE Trans Pattern Anal Mach Intell 9:1632–1646
Article Google Scholar
Perronnin F, Sanchez J, Mensink T (2010) Improving the Fisher kernel for large-scale image classification. In: ECCV
Reddy K, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24(5):971–981
Sánchez J, Perronnin F, Mensink T, Verbeek JJ (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245
Article MATH MathSciNet Google Scholar
Sangineto E (2013) Pose and expression independent facial landmark localization using dense-SURF and the Hausdorff distance. IEEE Trans Pattern Anal Mach Intell 35(3):624–638
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: ICIP
Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: ACM MM
Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: ICCV
Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVID. In: ACM SIGMM international workshop on multimedia information retrieval (MIR)
Snoek CGM, Worring M, Gemert J, Geusebroek J, Smeulders A (2006) The challenge problem for automated detection of 101 semantic concepts in multimedia. In: ACM MM
Solmaz B, Assari SM, Shah M (2013) Classifying web videos using a global video descriptor. Mach Vis Appl 24(7):1473–1485
Sun D, Roth S, Black M (2014) A quantitative analysis of current practices in optical flow estimation and the principles behind them. Int J Comput Vis 106:115–137
Uijlings JRR, Smeulders AWM, Scha RJH (2010) Real-time visual concept classification. IEEE Trans Multimed 12(7):665–681
Vedaldi A, Fulkerson B (2010) VLFeat—an open and portable library of computer vision algorithms. In: ACM MM
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. Proc CVPR 1:511–518
Google Scholar
Wang H, Kläser A, Schmid C, Liu C (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103:60–79
Article MathSciNet Google Scholar
Wang H, Ullah M, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC

Download references

Acknowledgments

This work was supported by the European 7th Framework Program, under grant xLiMe (FP7-611346) and by the FIRB project S-PATTERNS.

Author information

Authors and Affiliations

University of Edinburgh, Edinburgh, UK
J. Uijlings
DISI, University of Trento, Trento, Italy
I. C. Duta, E. Sangineto & Nicu Sebe

Authors

J. Uijlings
View author publications
You can also search for this author in PubMed Google Scholar
I. C. Duta
View author publications
You can also search for this author in PubMed Google Scholar
E. Sangineto
View author publications
You can also search for this author in PubMed Google Scholar
Nicu Sebe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicu Sebe.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Uijlings, J., Duta, I.C., Sangineto, E. et al. Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off. Int J Multimed Info Retr 4, 33–44 (2015). https://doi.org/10.1007/s13735-014-0069-5

Download citation

Received: 16 July 2014
Revised: 11 September 2014
Accepted: 11 September 2014
Published: 28 September 2014
Issue Date: March 2015
DOI: https://doi.org/10.1007/s13735-014-0069-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A modified vector of locally aggregated descriptors approach for fast video classification

A Robust and Efficient Video Representation for Action Recognition

Human Action Recognition: Contour-Based and Silhouette-Based Approaches

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A modified vector of locally aggregated descriptors approach for fast video classification

A Robust and Efficient Video Representation for Action Recognition

Human Action Recognition: Contour-Based and Silhouette-Based Approaches

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation