Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Visual word proximity and linguistics for semantic video indexing and near-duplicate retrieval

Published: 01 March 2009 Publication History

Abstract

Bag-of-visual-words (BoW) has recently become a popular representation to describe video and image content. Most existing approaches, nevertheless, neglect inter-word relatedness and measure similarity by bin-to-bin comparison of visual words in histograms. In this paper, we explore the linguistic and ontological aspects of visual words for video analysis. Two approaches, soft-weighting and constraint-based earth mover's distance (CEMD), are proposed to model different aspects of visual word linguistics and proximity. In soft-weighting, visual words are cleverly weighted such that the linguistic meaning of words is taken into account for bin-to-bin histogram comparison. In CEMD, a cross-bin matching algorithm is formulated such that the ground distance measure considers the linguistic similarity of words. In particular, a BoW ontology which hierarchically specifies the hyponym relationship of words is constructed to assist the reasoning. We demonstrate soft-weighting and CEMD on two tasks: video semantic indexing and near-duplicate keyframe retrieval. Experimental results indicate that soft-weighting is superior to other popular weighting schemes such as term frequency (TF) weighting in large-scale video database. In addition, CEMD shows excellent performance compared to cosine similarity in near-duplicate retrieval.

References

[1]
Nowak, E., Jurie, F. and Triggs, B., Sampling strategies for bag-of-features image classification. European Conference on Computer Vision.
[2]
Zhang, J., Marszałek, M., Lazebnik, S. and Schmid, C., Local features and kernels for classification of texture and object categories: a comprehensive study. International Journal of Computer Vision. v73 i2. 213-238.
[3]
Lazebnik, S., Schmid, C. and Ponce, J., Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. IEEE Conference on Computer Vision and Pattern Recognition.
[4]
Snoek, C.G.M., van Gemert, J.C., Gevers, Th., Huurnink, B., Koelma, D.C., Van Liempt, M., De Rooij, O., van de Sande, K.E.A., Seinstra, F.J., Smeulders, A.W.M., Thean, A.H.C., Veenman, C.J. and Worring, M., The mediamill TRECVID 2006 semantic video search engine. TRECVID Online Proceedings.
[5]
Fellbaum, C., WordNet: An Electronic Lexical Database. 1998. MIT Press, Cambridge, MA.
[6]
Manjunath, B.S. and Ma, W., Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence. v18 i8. 837-842.
[7]
Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D. and Jordan, M.I., Matching words and pictures. Journal of Machine Learning Research. v3. 1107-1135.
[8]
A.G. Hauptmann, M.-Y. Chen, M. Christel, W.-H. Lin, R. Yan, J. Yang, Multi-lingual broadcast news retrieval, in: TRECVID Online Proceedings, 2006.
[9]
J. Cao, Y. Lan, J. Li, Q. Li, X. Li, F. Lin, X. Liu, L. Luo, W. Peng, D. Wang, H. Wang, Z. Wang, Z. Xiang, J. Yuan, W. Zheng, B. Zhang, J. Zhang, L. Zhang, X. Zhang, Intelligent multimedia group of Tsinghua University at TRECVID 2006 TRECVID, in: Online Proceedings, 2006.
[10]
Lowe, D., Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision. v60 i2. 91-110.
[11]
Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T. and Van Gool, L., A comparison of affine region detectors. International Journal of Computer Vision. v65 i1/2. 43-72.
[12]
Mikolajczyk, K. and Schmid, C., A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence. v27 i10.
[13]
C.-W. Ngo, W.-L. Zhao, Y.-G. Jiang, Fast tracking of near-duplicate keyframes in broadcast domain with transitivity propagation, in: ACM International Conference on Multimedia, 2006.
[14]
Y. Ke, R. Suthankar, L. Huston, Efficient near-duplicate detection and sub-image retrieval, in: ACM International Conference on Multimedia, 2004, pp. 869-876.
[15]
Philbin, J., Chum, O., Isard, M., Sivic, J. and Zisserman, A., Object retrieval with large vocabularies and fast spatial matching. IEEE Conference on Computer Vision and Pattern Recognition.
[16]
X. Wu, W.-L. Zhao, C.-W. Ngo, Near-duplicate keyframe retrieval with visual keywords and semantic context, in: ACM International Conference on Image and Video Retrieval, 2007.
[17]
J. Sivic, A. Zisserman, Video google: a text retrieval approach to object matching in videos, in: International Conference on Computer Vision, 2003.
[18]
Y.-G. Jiang, C.-W. Ngo, J. Yang, Towards optimal bag-of-features for object categorization and semantic video retrieval, in: ACM International Conference on Image and Video Retrieval, 2007.
[19]
Y.-G. Jiang C.-W. Ngo, Bag-of-visual-words expansion using visual relatedness for video indexing, in: ACM SIGIR Conference on Research & Development on Information Retrieval, 2008.
[20]
F. Jurie, B. Triggs, Creating efficient codebooks for visual recognition, in: International Conference on Computer Vision, 2005.
[21]
F. Moosmann, B. Triggs, F. Jurie, Randomized clustering forests for building fast and discriminative visual vocabularies, in: Conference on Neural Information Processing Systems (NIPS), 2006.
[22]
T. Pedersen, S. Patwardhan, J. Michelizzi, WordNet::Similarity - measuring the relatedness of concepts, in: National Conference on Artificial Intelligence (AAAI), 2004.
[23]
P. Resnik, Using information content to evaluate semantic similarity in a taxonomy, in: International Joint Conferences on Artificial Intelligence, 1995.
[24]
J.J. Jiang, D.W. Conrath, Semantic similarity based on corpus statistics and lexical taxonomy, in: Proc. of ROCLING X, 1997.
[25]
Z. Wu, M. Palmer, Verb semantic and lexical selection in Annual Meeting of the ACL, 1994, pp. 133-138.
[26]
A. Agarwal, B. Triggs, Hyperfeatures - multilevel local coding for visual recognition, in: European Conference on Computer Vision, 2006.
[27]
D. Nister, H. Stewenius, Scalable recognition with a vocabulary tree, in: IEEE Conference on Computer Vision and Pattern Recognition, 2006.
[28]
K. Grauman, T. Darrell, Approximate correspondences in high dimensions, in: Advances in Neural Information Processing Systems (NIPS), 2007.
[29]
Rubner, Y., Tomasi, C. and Guibas, L.J., The earth mover's distance as a metric for image retrieval. International Journal of Computer Vision. v40 i2. 99-121.
[30]
Y.-G. Jiang, C.-W. Ngo, Ontology-based visual word matching for near-duplicate retrieval, in: IEEE International Conference on Multimedia & Expo, 2008.
[31]
Shi, J. and Malik, J., Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. v22 i8. 888-905.
[32]
TREC Video Retrieval Evaluation (TRECVID), Available from: <http://www-nlpir.nist.gov/projects/trecvid/>.
[33]
LSCOM lexicon definitions and annotations, in: DTO Challenge Workshop on Large Scale Concept Ontology for Multimedia, Columbia University ADVENT Technical Report #217-2006-3, 2006.
[34]
Vapnik, V., The Nature of Statistical Learning Theory. 1995. Springer, New York.
[35]
J.A. Aslam, V. Pavlu, E. Yilmaz, Statistical method for system evaluation using incomplete judgments, in: ACM SIGIR Conference, 2006.
[36]
S. Petrov, A. Faria, P. Michaillat, A. Berg, D. Klein, J. Malik, A. Stolcke, Detecting categories in news video using acoustic, speech, and image features, in: TRECVID Online Proceedings, 2006.
[37]
M. Campbell, S. Ebadollahi, D. Joshi, M. Naphade, A. Natsev, J. Seidl, J.R. Smith, K. Scheinberg, J. Tesic, L. Xie, IBM research TRECVID-2006 video retrieval system, in: TRECVID Online Proceedings, 2006.
[38]
D.-Q. Zhang, S.-F. Chang, Detecting image near-duplicate by stochastic attributed relational graph matching with learning, in: ACM International Conference on Multimedia, 2004.
[39]
Salton, G., Wong, A. and Yang, C.S., A vector space model for automatic indexing. Communications of the ACM. v18. 613-620.
[40]
H. Ling, K. Okada, Diffusion distance for histogram comparison, in: IEEE Conference on Computer Vision and Pattern Recognition, 2006.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Computer Vision and Image Understanding
Computer Vision and Image Understanding  Volume 113, Issue 3
March, 2009
119 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 March 2009

Author Tags

  1. CEMD matching
  2. Linguistic similarity
  3. Near-duplicate keyframe
  4. Semantic concept
  5. Soft-weighting
  6. Visual ontology

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media