article

Visual word proximity and linguistics for semantic video indexing and near-duplicate retrieval

Authors:

Chong-Wah NgoAuthors Info & Claims

Computer Vision and Image Understanding, Volume 113, Issue 3

Pages 405 - 414

https://doi.org/10.1016/j.cviu.2008.10.002

Published: 01 March 2009 Publication History

Abstract

Bag-of-visual-words (BoW) has recently become a popular representation to describe video and image content. Most existing approaches, nevertheless, neglect inter-word relatedness and measure similarity by bin-to-bin comparison of visual words in histograms. In this paper, we explore the linguistic and ontological aspects of visual words for video analysis. Two approaches, soft-weighting and constraint-based earth mover's distance (CEMD), are proposed to model different aspects of visual word linguistics and proximity. In soft-weighting, visual words are cleverly weighted such that the linguistic meaning of words is taken into account for bin-to-bin histogram comparison. In CEMD, a cross-bin matching algorithm is formulated such that the ground distance measure considers the linguistic similarity of words. In particular, a BoW ontology which hierarchically specifies the hyponym relationship of words is constructed to assist the reasoning. We demonstrate soft-weighting and CEMD on two tasks: video semantic indexing and near-duplicate keyframe retrieval. Experimental results indicate that soft-weighting is superior to other popular weighting schemes such as term frequency (TF) weighting in large-scale video database. In addition, CEMD shows excellent performance compared to cosine similarity in near-duplicate retrieval.

References

[1]

Nowak, E., Jurie, F. and Triggs, B., Sampling strategies for bag-of-features image classification. European Conference on Computer Vision.

[2]

Zhang, J., Marszałek, M., Lazebnik, S. and Schmid, C., Local features and kernels for classification of texture and object categories: a comprehensive study. International Journal of Computer Vision. v73 i2. 213-238.

[3]

Lazebnik, S., Schmid, C. and Ponce, J., Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. IEEE Conference on Computer Vision and Pattern Recognition.

Digital Library

[4]

Snoek, C.G.M., van Gemert, J.C., Gevers, Th., Huurnink, B., Koelma, D.C., Van Liempt, M., De Rooij, O., van de Sande, K.E.A., Seinstra, F.J., Smeulders, A.W.M., Thean, A.H.C., Veenman, C.J. and Worring, M., The mediamill TRECVID 2006 semantic video search engine. TRECVID Online Proceedings.

[5]

Fellbaum, C., WordNet: An Electronic Lexical Database. 1998. MIT Press, Cambridge, MA.

[6]

Manjunath, B.S. and Ma, W., Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence. v18 i8. 837-842.

[7]

Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D. and Jordan, M.I., Matching words and pictures. Journal of Machine Learning Research. v3. 1107-1135.

[8]

A.G. Hauptmann, M.-Y. Chen, M. Christel, W.-H. Lin, R. Yan, J. Yang, Multi-lingual broadcast news retrieval, in: TRECVID Online Proceedings, 2006.

[9]

J. Cao, Y. Lan, J. Li, Q. Li, X. Li, F. Lin, X. Liu, L. Luo, W. Peng, D. Wang, H. Wang, Z. Wang, Z. Xiang, J. Yuan, W. Zheng, B. Zhang, J. Zhang, L. Zhang, X. Zhang, Intelligent multimedia group of Tsinghua University at TRECVID 2006 TRECVID, in: Online Proceedings, 2006.

[10]

Lowe, D., Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision. v60 i2. 91-110.

[11]

Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T. and Van Gool, L., A comparison of affine region detectors. International Journal of Computer Vision. v65 i1/2. 43-72.

[12]

Mikolajczyk, K. and Schmid, C., A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence. v27 i10.

[13]

C.-W. Ngo, W.-L. Zhao, Y.-G. Jiang, Fast tracking of near-duplicate keyframes in broadcast domain with transitivity propagation, in: ACM International Conference on Multimedia, 2006.

[14]

Y. Ke, R. Suthankar, L. Huston, Efficient near-duplicate detection and sub-image retrieval, in: ACM International Conference on Multimedia, 2004, pp. 869-876.

[15]

Philbin, J., Chum, O., Isard, M., Sivic, J. and Zisserman, A., Object retrieval with large vocabularies and fast spatial matching. IEEE Conference on Computer Vision and Pattern Recognition.

[16]

X. Wu, W.-L. Zhao, C.-W. Ngo, Near-duplicate keyframe retrieval with visual keywords and semantic context, in: ACM International Conference on Image and Video Retrieval, 2007.

Digital Library

[17]

J. Sivic, A. Zisserman, Video google: a text retrieval approach to object matching in videos, in: International Conference on Computer Vision, 2003.

[18]

Y.-G. Jiang, C.-W. Ngo, J. Yang, Towards optimal bag-of-features for object categorization and semantic video retrieval, in: ACM International Conference on Image and Video Retrieval, 2007.

Digital Library

[19]

Y.-G. Jiang C.-W. Ngo, Bag-of-visual-words expansion using visual relatedness for video indexing, in: ACM SIGIR Conference on Research & Development on Information Retrieval, 2008.

[20]

F. Jurie, B. Triggs, Creating efficient codebooks for visual recognition, in: International Conference on Computer Vision, 2005.

[21]

F. Moosmann, B. Triggs, F. Jurie, Randomized clustering forests for building fast and discriminative visual vocabularies, in: Conference on Neural Information Processing Systems (NIPS), 2006.

[22]

T. Pedersen, S. Patwardhan, J. Michelizzi, WordNet::Similarity - measuring the relatedness of concepts, in: National Conference on Artificial Intelligence (AAAI), 2004.

[23]

P. Resnik, Using information content to evaluate semantic similarity in a taxonomy, in: International Joint Conferences on Artificial Intelligence, 1995.

[24]

J.J. Jiang, D.W. Conrath, Semantic similarity based on corpus statistics and lexical taxonomy, in: Proc. of ROCLING X, 1997.

[25]

Z. Wu, M. Palmer, Verb semantic and lexical selection in Annual Meeting of the ACL, 1994, pp. 133-138.

[26]

A. Agarwal, B. Triggs, Hyperfeatures - multilevel local coding for visual recognition, in: European Conference on Computer Vision, 2006.

[27]

D. Nister, H. Stewenius, Scalable recognition with a vocabulary tree, in: IEEE Conference on Computer Vision and Pattern Recognition, 2006.

[28]

K. Grauman, T. Darrell, Approximate correspondences in high dimensions, in: Advances in Neural Information Processing Systems (NIPS), 2007.

[29]

Rubner, Y., Tomasi, C. and Guibas, L.J., The earth mover's distance as a metric for image retrieval. International Journal of Computer Vision. v40 i2. 99-121.

[30]

Y.-G. Jiang, C.-W. Ngo, Ontology-based visual word matching for near-duplicate retrieval, in: IEEE International Conference on Multimedia & Expo, 2008.

[31]

Shi, J. and Malik, J., Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. v22 i8. 888-905.

[32]

TREC Video Retrieval Evaluation (TRECVID), Available from: <http://www-nlpir.nist.gov/projects/trecvid/>.

[33]

LSCOM lexicon definitions and annotations, in: DTO Challenge Workshop on Large Scale Concept Ontology for Multimedia, Columbia University ADVENT Technical Report #217-2006-3, 2006.

[34]

Vapnik, V., The Nature of Statistical Learning Theory. 1995. Springer, New York.

[35]

J.A. Aslam, V. Pavlu, E. Yilmaz, Statistical method for system evaluation using incomplete judgments, in: ACM SIGIR Conference, 2006.

[36]

S. Petrov, A. Faria, P. Michaillat, A. Berg, D. Klein, J. Malik, A. Stolcke, Detecting categories in news video using acoustic, speech, and image features, in: TRECVID Online Proceedings, 2006.

[37]

M. Campbell, S. Ebadollahi, D. Joshi, M. Naphade, A. Natsev, J. Seidl, J.R. Smith, K. Scheinberg, J. Tesic, L. Xie, IBM research TRECVID-2006 video retrieval system, in: TRECVID Online Proceedings, 2006.

[38]

D.-Q. Zhang, S.-F. Chang, Detecting image near-duplicate by stochastic attributed relational graph matching with learning, in: ACM International Conference on Multimedia, 2004.

[39]

Salton, G., Wong, A. and Yang, C.S., A vector space model for automatic indexing. Communications of the ACM. v18. 613-620.

[40]

H. Ling, K. Okada, Diffusion distance for histogram comparison, in: IEEE Conference on Computer Vision and Pattern Recognition, 2006.

Cited By

Meena PKumar HKumar Yadav S(2023)A review on video summarization techniquesEngineering Applications of Artificial Intelligence10.1016/j.engappai.2022.105667118:COnline publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1016/j.engappai.2022.105667
Shen LHong RHao Y(2020)Advance on large scale near-duplicate video retrievalFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-019-8229-714:5Online publication date: 1-Oct-2020
https://dl.acm.org/doi/10.1007/s11704-019-8229-7
Liang SWang P(2020)An Efficient Hierarchical Near-Duplicate Video Detection Algorithm Based on Deep Semantic FeaturesMultiMedia Modeling10.1007/978-3-030-37731-1_61(752-763)Online publication date: 5-Jan-2020
https://dl.acm.org/doi/10.1007/978-3-030-37731-1_61
Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Computer Vision and Image Understanding

Computer Vision and Image Understanding Volume 113, Issue 3

March, 2009

119 pages

ISSN:1077-3142

Issue’s Table of Contents

Copyright © Elsevier Inc. © 2008.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 March 2009

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Meena PKumar HKumar Yadav S(2023)A review on video summarization techniquesEngineering Applications of Artificial Intelligence10.1016/j.engappai.2022.105667118:COnline publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1016/j.engappai.2022.105667
Shen LHong RHao Y(2020)Advance on large scale near-duplicate video retrievalFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-019-8229-714:5Online publication date: 1-Oct-2020
https://dl.acm.org/doi/10.1007/s11704-019-8229-7
Liang SWang P(2020)An Efficient Hierarchical Near-Duplicate Video Detection Algorithm Based on Deep Semantic FeaturesMultiMedia Modeling10.1007/978-3-030-37731-1_61(752-763)Online publication date: 5-Jan-2020
https://dl.acm.org/doi/10.1007/978-3-030-37731-1_61
Jiang JTong YLu HCui BLei KYu L(2017)GVoSACM Transactions on Information Systems10.1145/304165736:1(1-36)Online publication date: 5-Jun-2017
https://dl.acm.org/doi/10.1145/3041657
Zhang LGao K(2016)Visual homographNeurocomputing10.1016/j.neucom.2016.04.057208:C(342-349)Online publication date: 5-Oct-2016
https://dl.acm.org/doi/10.1016/j.neucom.2016.04.057
Zhu YHuang XHuang QTian Q(2016)Large-scale video copy retrieval with temporal-concentration SIFTNeurocomputing10.1016/j.neucom.2015.09.114187:C(83-91)Online publication date: 26-Apr-2016
https://dl.acm.org/doi/10.1016/j.neucom.2015.09.114
Karakasis EAmanatiadis AGasteratos AChatzichristofis S(2015)Image moment invariants as local features for content based image retrieval using the Bag-of-Visual-Words modelPattern Recognition Letters10.1016/j.patrec.2015.01.00555:C(22-27)Online publication date: 1-Apr-2015
https://dl.acm.org/doi/10.1016/j.patrec.2015.01.005
Wang LElyan ESong D(2014)Rebuilding Visual Vocabulary via Spatial-temporal Context Similarity for Video RetrievalProceedings of the 20th Anniversary International Conference on MultiMedia Modeling - Volume 832510.1007/978-3-319-04114-8_7(74-85)Online publication date: 6-Jan-2014
https://dl.acm.org/doi/10.1007/978-3-319-04114-8_7
Liu JHuang ZCai HShen HNgo CWang W(2013)Near-duplicate video retrievalACM Computing Surveys10.1145/2501654.250165845:4(1-23)Online publication date: 30-Aug-2013
https://dl.acm.org/doi/10.1145/2501654.2501658
Cheng HLiu ZYang LChen X(2013)Sparse representation and learning in visual recognitionSignal Processing10.1016/j.sigpro.2012.09.01193:6(1408-1425)Online publication date: 1-Jun-2013
https://dl.acm.org/doi/10.1016/j.sigpro.2012.09.011
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents