Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Fisher Kernel Temporal Variation-based Relevance Feedback for video retrieval

Published: 01 February 2016 Publication History

Abstract

We proposed a novel framework for Relevance Feedback based on the Fisher Kernel.The Fisher Kernel representation makes possible to capture temporal variation by using frame-based features.We experiment on a high variety of scenarios and public datasets (genre classification - Blip10000, action recognition - UCF50 / UCF101 and daily activities recognition - ADL) and show the benefits of the proposed approach which outperforms other state of the art approaches.We prove the generalization power of our approach, i.e., the framework is not dependent on a particular type of content descriptors (experiments were made with text, visual and audio features). This paper proposes a novel framework for Relevance Feedback based on the Fisher Kernel (FK). Specifically, we train a Gaussian Mixture Model (GMM) on the top retrieval results (without supervision) and use this to create a FK representation, which is therefore specialized in modelling the most relevant examples. We use the FK representation to explicitly capture temporal variation in video via frame-based features taken at different time intervals. While the GMM is being trained, a user selects from the top examples those which he is looking for. This feedback is used to train a Support Vector Machine on the FK representation, which is then applied to re-rank the top retrieved results. We show that our approach outperforms other state-of-the-art relevance feedback methods. Experiments were carried out on the Blip10000, UCF50, UCF101 and ADL standard datasets using a broad range of multi-modal content descriptors (visual, audio, and text).

References

[1]
D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis., 60 (2004) 91-110.
[2]
I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1-8.
[3]
I. Mironic¿, B. Ionescu, J. Uijlings, N. Sebe, Fisher kernel based relevance feedback for multimodal video retrieval, in: Proceedings of the ACM International Conference on Multimedia Retrieval, 2013, pp. 65-72.
[4]
I. Mironic¿, J. Uijlings, N. Rostamzadeh, B. Ionescu, N. Sebe, Time matters!: capturing variation in time in video using fisher kernels, in: Proceedings of the ACM International Conference on Multimedia, Barcelona, Catalunya, Spain, 2013, pp. 701-704.
[5]
S. Schmiedeke, P. Xu, I. Ferrané, M. Eskevich, C. Kofler, M. Larson, Y. Estève, L. Lamel, G. Jones, T. Sikora, Blip10000: a social video dataset containing spug content for tagging and retrieval, Oslo, Norway, 2013.
[6]
K.K. Reddy, M. Shah, Recognizing 50 human action categories of web videos, Mach. Vis. Appl., 5 (2013) 971-981.
[7]
F. Schroff, A. Criminisi, A. Zisserman, Harvesting image databases from the web, IEEE Trans. Pattern Anal. Mach. Intell., 33 (2011) 754-766.
[8]
H. Ma, J. Zhu, M.T. Lyu, I. King, Bridging the semantic gap between image contents and tags, IEEE Trans. Multimed., 12 (2010) 462-473.
[9]
Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, Y. Pan, A multimedia retrieval framework based on semi-supervised ranking and relevance feedback, IEEE Trans. Pattern Anal. Mach. Intell., 34 (2012) 723-742.
[10]
S. Jones, L. Shao, Content-based retrieval of human actions from realistic video databases, Inf. Sci., 236 (2013) 56-65.
[11]
X.Y. Wang, B.B. Zhang, H.Y. Yang, Active svm-based relevance feedback using multiple classifiers ensemble and features reweighting, Eng. Appl. Artif. Intell., 26 (2013) 368-381.
[12]
P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders, W. Kraaij, A.F. Smeaton, G. Quéenot, Trecvid 2013 - an overview of the goals, tasks, data, evaluation mechanisms and metrics, 2013.
[13]
C.Y. Li, C.T. Hsu, Image retrieval with relevance feedback based on graph-theoretic region correspondence estimation, IEEE Trans. Multimed., 10 (2008) 447-456.
[14]
Y. Wu, A. Zhang, Interactive pattern analysis for relevance feedback in multimedia information retrieval, Multimed. Syst., 10 (2004) 41-55.
[15]
B. Ionescu, K. Seyerlehner, I. Mironic¿, C. Vertan, P. Lambert, An audio-visual approach to web video categorization, Multimed. Tools Appl., 70 (2014) 1007-1032.
[16]
M.M. Rahman, S.K. Antani, G.R. Thoma, A learning-based similarity fusion and filtering approach for biomedical image retrieval using SVM classification and relevance feedback, IEEE Trans. Inf. Technol. Biomed., 15 (2011) 640-646.
[17]
J. Li, N.M. Allinson, Handbook on neural information processing, Intell. Syst. Ref. Lib., 49 (2013) 433-469.
[18]
G. Cao, J.Y. Nie, J. Gao, S. Robertson, Selecting good expansion terms for pseudo-relevance feedback, in: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, 2008, pp. 243-250.
[19]
J.Y. Kim, M. Cramer, J. Teevan, D. Lagun, Understanding how people interact with web search results that change in real-time using implicit feedback, in: Proceedings of the ACM International Conference on Information and Knowledge Management, San Francisco, CA, USA, 2013, pp. 2321-2326.
[20]
J. Rocchio, Relevance Feedback in Information Retrieval, in: The Smart Retrieval System Experiments in Automatic Document Processing, Prentice Hall, Englewood Cliffs NJ, 1971, pp. 313-323.
[21]
N.V. Nguyen, J.-M. Ogier, S. Tabbone, A. Boucher, Text retrieval relevance feedback techniques for bag-of-words model in cbir, Paris, France, 2009.
[22]
Y. Rui, T.S. Huang, M. Ortega, M. Mehrotra, S. Beckman, Relevance feedback: a power tool for interactive content-based image retrieval, 1998.
[23]
Y. Lv, C. Zhai, Positional relevance model for pseudo-relevance feedback, in: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, Geneva, Switzerland, 2010, pp. 579-586.
[24]
L.Y. Lv, C. Zhai, Adaptive relevance feedback in information retrieval, in: Proceedings of the ACM Conference on Information and Knowledge Management, Hong Kong, China, 2009, pp. 255-264.
[25]
S. Liang, Z. Sun, Sketch retrieval and relevance feedback with biased SVM classification, Pattern Recognit. Lett., 29 (2008) 1733-1741.
[26]
J. Yu, Y. Lu, Y. Xu, N. Sebe, Q. Tian, Integrating relevance feedback in boosting for content-based image retrieval, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, Hawaii, USA, 2007, pp. 965-968.
[27]
J.H. Su, W.J. Huang, P.S. Yu, V.S. Tseng, Efficient relevance feedback for content-based image retrieval by mining user navigation patterns, IEEE Trans. Knowled. Data Eng., 23 (2011) 360-372.
[28]
W. Bian, D. Tao, Biased discriminant euclidean embedding for content-based image retrieval, IEEE Trans. Image Process., 19 (2010) 545-554.
[29]
D. Tao, X. Li, S. Maybank, Negative samples analysis in relevance feedback, IEEE Trans. Knowl. Data Eng., 19 (2010) 568-580.
[30]
T. Mei, B. Yang, X. Hua, S. Li, Contextual video recommendation by multimodal relevance and user feedback, ACM Trans. Inf. Syst., 29 (2011).
[31]
T. Jaakkola, D. Haussler, Exploiting generative models in discriminative classifiers, 1998.
[32]
F. Perronnin, J. Sanchez, T. Mensink, Improving the fisher kernel for large-scale image classification, in: Proceedings of the European Conference on Computer Vision, LNCS 6314, Crete, Greece, 2010, pp. 143-156.
[33]
O. Aran, L. Akarun, A multi-class classification strategy for fisher scores: Application to signer independent sign language recognition, Pattern Recognit., 43 (2010) 1776-1788.
[34]
P.J. Moreno, R. Rifkin, Using the fisher kernel method for web audio classification, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processin., Istambul, Turkey, 2000, pp. 2417-2420.
[35]
Q. Sun, R. Li, D. Luo, W. Xihong, Text segmentation with lda-based fisher kernel, 2008.
[36]
G.K. Myers, C.G. Snoek, R. Nallapati, J. van Hout, S. Pancoast, R. Nevatia, C. Sun, Evaluating multimedia features and fusion for example-based event detection, Mach. Vis. Appl., 25 (2014) 17-32.
[37]
Y. Chen, X.S. Zhou, T.S. Huang, One-class svm for learning in image retrieval, in: Proceedings of the IEEE International Conference of Image Processing, Thessaloniki, Greece, 2001, pp. 34-37.
[38]
S. Schmiedeke, C. Kofler, I. Ferrané, Overview of the mediaeval 2012 tagging task, Pisa, Italy, 2012.
[39]
A.G.S. de Herrera, J. Kalpathy-Cramer, D.D. Fushman, S. Antani, H. Müller, Overview of the imageclef 2013 medical tasks, Valencia, Spain, 2013.
[40]
Mediaeval 2013 workshop, in: M. Larson, X. Anguera, T. Reuter, G.J.F. Jones, B. Ionescu, M. Schedl, T. Piatrik, C. Hauff, M. Soleymani (Eds.), co-located with Proceedings of the ACM International Conference on Multimedia, 2013, CEUR-WS.org, ISSN 1613-0073, 1043, http://ceur-ws.org/Vol-1043/¿ (accessed 01.09.15), 18-19 October, Barcelona, Catalunya, Spain.
[41]
M. Everingham, L.V. Gool, C.K.I. Williams, J. Winn, A. Zisserman, The pascal visual object classes challenge 2012 (voc2012) results, 2012, http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html (accessed 01.09.15).
[42]
P. Dollár, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in: Proceedings of the IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China, 2005, pp. 65-72.
[43]
O. Kliper-Gross, Y. Gurovich, T. Hassner, L. Wolf, Motion interchange patterns for action recognition in unconstrained videos, in: Proceedings of the European Conference on Computer Vision, Firenze, Italy, 2012, pp. 256-269.
[44]
S.H. Cha, Comprehensive survey on distance/similarity measures between probability density functions, Int. J. Math. Models Methods Appl. Sci. (2007) 300- 307.
[45]
Y. Rubner, C. Tomasi, L.J. Guibas, A metric for distributions with applications to image databases, in: Proceedings of the International Conference on Computer Vision, Bombay, India, 1998, pp. 59-66.
[46]
E. Deza, M.M. Deza, Dictionary of distances, 2006.
[47]
M. Hatzigiorgaki, A.N. Skodras, Compressed domain image retrieval: a comparative study of similarity metrics, SPIE Vis. Commun. Image Process., 5150 (2003) 439-448.
[48]
K. Seyerlehner, M. Schedl, T. Pohle, P. Knees, Using block-level features for genre classification, tag classification and music similarity estimation, Utrecht, Netherlands, 2010.
[49]
B. Mathieu, S. Essid, T. Fillon, J. Prado, G. Richard, Yaafe, an easy to use and efficient audio feature extraction software, in: Proceedings of the International Society for Music Information Retrieval Conference, Utrecht, Netherlands, 2010, pp. 441-446.
[50]
T. Sikora, The mpeg-7 visual standard for content description - an overview, IEEE Trans. Circuits Syst. Video Technol., 11 (2001) 696-702.
[51]
O. Ludwig, D. Delgado, V. Goncalves, U. Nunes, Trainable classifier-fusion schemes: an application to pedestrian detection, in: Proceedings of the IEEE International Conference On Intelligent Transportation Systems, St. Louis, MO, USA, 2009, pp. 1-6.
[52]
C. Rasche, An approach to the parameterization of structure for fast categorization, Int. J. Comput. Vis., 87 (2010) 337-356.
[53]
S. Nowak, M. Huiskes, New strategies for image annotation: overview of the photo annotation task at IMAGECLEF 2010, 2010.
[54]
J.R.R. Uijlings, A.W.M. Smeulders, R.J.H. Scha, Real-time visual concept classification, IEEE Trans. Multimed., 12 (2010) 665-681.
[55]
S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in: Computer Vision and Pattern Recognition, New York, NY, USA, 2006, pp. 2169-2178.
[56]
J.V. de Weijer, C. Schmid, J. Verbeek, D. Larlus, Learning color names for real-world applications, IEEE Trans. Image Process., 18 (2009) 1512-1523.
[57]
B. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, in: Proceedings of the International Joint Conference on Artificial intelligence, Vancouver, Canada, 1981, pp. 674-679.
[58]
L. Lamel, J.-L. Gauvain, Speech processing for audio indexing, Adv. Natural Lang. Process., LNCS, 5221 (2008) 4-15.
[59]
I. Mironic¿, B. Ionescu, P. Knees, P. Lambert, An in-depth evaluation of multimodal video genre categorization, in: Proceedings of the IEEE International Workshop on Content-Based Multimedia Indexing, Veszprém, Hungary, 2013, pp. 11-16.
[60]
C.G.M. Snoek, K.E.A. van de Sande, O. de Rooij, B. Huurnink, J.C. van Gemert, J.R.R. Uijlings, J. He, X. Li, I. Everts, V. Nedovic, M. van Liempt, R. van Balen, F. Yan, M.A. Tahir, K. Mikolajczyk, J. Kittler, M. de Rijke, J.-M. Geusebroek, T. Gevers, M. Worring, A.W.M. Smeulders, D.C. Koelma, The mediamill trecvid 2008 semantic video search engine, Gaithersburg, USA, 2008.
[61]
S. Khurram, A.R. Zamir, M. Shah, Ucf101: a dataset of 101 human actions classes from videos in the wild, 2012, CoRR, abs/1212.0402.
[62]
R. Messing, C. Pal, H. Kautz, Activity recognition using the velocity histories of tracked keypoints, 2009.
[63]
A. Vedaldi, K. Lenc, Matconvnet - convolutional neural networks for matlab, 2014, In Proceedings of the arXiv:1412.4564.
[64]
A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, 2012.
[65]
J. Uijlings, I. Duta, E. Sangineto, N. Sebe, Video classification with densely extracted hog/hof/mbh features: an evaluation of the accuracy/computational efficiency trade-off, Int. J. Multimed. Inf. Retrieval (2014) 1-12.
[66]
M. Rohrbach, S. Amin, M. Andriluka, B. Schiele, A database for fine grained activity detection of cooking activities, 2012.
[67]
N. Rostamzadeh, G. Zen, I. Mironica, J. Uijlings, N. Sebe, Daily living activities recognition via efficient high and low level cues combination and fisher kernel representation, 2013.
[68]
Z. Xu, Y. Yang, A.G. Hauptmann, A discriminative cnn video representation for event detection, 2015.
[69]
Y. Yang, Z. Ma, Z. Xu, S. Yan, A.G. Hauptmann, How related exemplars help complex event detection in web videos?, 2013.
[70]
Z. Xu, Y. Yang, A.G. Hauptmann, A discriminative cnn video representation for event detection, 2015.
[71]
K. Yanai, Automatic extraction of relevant video shots of specific actions exploiting web data, Comput. Vis. Image Underst. (CVIU), 118 (2014) 2-15.
[72]
L. Shao, S. Jones, X. Li, Efficient search and localization of human actions in video databases, IEEE Trans. Circuits Syst. Video Technol., 24 (2014) 504-512.
[73]
L. Shao, X. Zhen, D. Tao, X. Li, Spatio-temporal Laplacian pyramid coding for action recognition, IEEE Trans. Cybernet., 44 (2014) 817-827.
[74]
L. Liu, L. Shao, F. Zheng, X. Li, Realistic action recognition via sparsely-constructed Gaussian processes, Pattern Recognit., 47 (2014) 3819-3827.
[75]
I. Mironic¿, I. Duta, B. Ionescu, N. Sebe, A modified vector of locally aggregated descriptors approach for fast video classification, 2015.

Cited By

View all
  • (2023)Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.320791033:3(1438-1453)Online publication date: 1-Mar-2023
  • (2021)Exquisitor at the Lifelog Search Challenge 2021: Relationships Between Semantic ClassifiersProceedings of the 4th Annual on Lifelog Search Challenge10.1145/3463948.3469255(3-6)Online publication date: 21-Aug-2021
  • (2020)Exquisitor at the Lifelog Search Challenge 2020Proceedings of the Third Annual Workshop on Lifelog Search Challenge10.1145/3379172.3391718(19-22)Online publication date: 9-Jun-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Computer Vision and Image Understanding
Computer Vision and Image Understanding  Volume 143, Issue C
February 2016
201 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 February 2016

Author Tags

  1. Fisher Kernel representation
  2. Multimodal content description
  3. Relevance feedback
  4. Video retrieval

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.320791033:3(1438-1453)Online publication date: 1-Mar-2023
  • (2021)Exquisitor at the Lifelog Search Challenge 2021: Relationships Between Semantic ClassifiersProceedings of the 4th Annual on Lifelog Search Challenge10.1145/3463948.3469255(3-6)Online publication date: 21-Aug-2021
  • (2020)Exquisitor at the Lifelog Search Challenge 2020Proceedings of the Third Annual Workshop on Lifelog Search Challenge10.1145/3379172.3391718(19-22)Online publication date: 9-Jun-2020
  • (2020)Interactive Learning for Multimedia at LargeAdvances in Information Retrieval10.1007/978-3-030-45439-5_33(495-510)Online publication date: 14-Apr-2020
  • (2019)ExquisitorProceedings of the 27th ACM International Conference on Multimedia10.1145/3343031.3350580(1029-1031)Online publication date: 15-Oct-2019
  • (2018)BlackthornIEEE Transactions on Multimedia10.1109/TMM.2017.275598620:3(687-698)Online publication date: 1-Mar-2018
  • (2017)Improving video event retrieval by user feedbackMultimedia Tools and Applications10.1007/s11042-017-4798-376:21(22361-22381)Online publication date: 1-Nov-2017
  • (2016)GPU-FVProceedings of the 2016 ACM on International Conference on Multimedia Retrieval10.1145/2911996.2911997(39-46)Online publication date: 6-Jun-2016

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media