Abstract
In recent years, the problem of video location estimation (i.e., estimating the longitude/latitude coordinates of a video without GPS information) has been approached with diverse methods and ideas in the research community and significant improvements have been made. So far, however, systems have only been compared against each other and no systematic study on human performance has been conducted. Based on a human-subject study with 11,900 experiments, this article presents a human baseline for location estimation for different combinations of modalities (audio, audio/video, audio/video/text). Furthermore, this article compares state-of-the-art location estimation systems with the human baseline. Although the overall performance of humans’ multimodal video location estimation is better than current machine learning approaches, the difference is quite small: For 41 % of the test set, the machine’s accuracy was superior to the humans. We present case studies and discuss why machines did better for some videos and not for others. Our analysis suggests new directions and priorities for future work on the improvement of location inference algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
S. Chatzichristofis, Y. Boutalis, CEDD: color and edge directivity descriptor: a compact descriptor for image indexing and retrieval, Computer Vision Systems (Springer, Berlin, 2008), pp. 312–322
S. Chatzichristofis, Y. Boutalis, Fcth: Fuzzy color and texture histogram-a low level feature for accurate image retrieval, in Ninth International Workshop on Image Analysis for Multimedia Interactive Services, WIAMIS’08, pp. 191–196. IEEE (2008)
J. Choi, G. Friedland, V. Ekambaram, K. Ramchandran, Multimodal location estimation of consumer media: dealing with sparse training data, in 2012 IEEE International Conference on Multimedia and Expo (ICME). pp. 43–48, IEEE (2012)
J. Choi, H. Lei, V. Ekambaram, P. Kelm, L. Gottlieb, T. Sikora, K. Ramchandran, G. Friedland, Human vs machine: Establishing a human baseline for multimodal location estimation, in Proceedings of the 21st ACM International Conference on Multimedia, MM ’13, pp. 867–876. ACM, New York, USA (2013)
L. Gottlieb, J. Choi, G. Friedland, P. Kelm, T. Sikora. Pushing the limits of Mechanical Turk: qualifying the crowd for video geo-location, in Proceedings of the 2012 ACM Workshop on Crowdsourcing for Multimedia (CrowdMM) (2012)
A. Hatch, S. Kajarekar, A. Stolcke, Within-class covariance normalization for SVM-based speaker recognition, in Proceedings of ISCA Interspeech, vol. 4 (2006)
J. Hays, A. Efros, IM2GPS: estimating geographic information from a single image, in IEEE CVPR 2008, pp. 1–8 (2008)
S. Ioffe, Probabilistic linear discriminant analysis, Computer Vision-ECCV (Springer, Berlin, 2006), pp. 531–542
P.G. Ipeirotis, Analyzing the Amazon Mechanical Turk marketplace. XRDS 17(2), 16–21 (2010)
D. Karger, S. Oh, D. Shah, Budget-optimal crowdsourcing using low-rank matrix approximations, in 49th Annual Allerton Conference Communication, Control, and Computing (Allerton) 2011, pp. 284–291, September 2011
P. Kelm, S. Schmiedeke, T. Sikora, A hierarchical, multi-modal approach for placing videos on the map using millions of Flickr photographs, in Proceedings of SBNMA ’11, pp. 15–20. ACM, New York, USA (2011)
A. Kittur, E. H. Chi, B. Suh, Crowdsourcing user studies with Mechanical Turk, in Proceedings of the Twenty-Sixth Annual SIGCHI Conference on Human Factors in Computing Systems, CHI ’08, pp. 453–456. ACM, New York, USA (2008)
M. Larson, M. Soleymani, P. Serdyukov, S. Rudinac, C. Wartena, V. Murdock, G. Friedland, R. Ordelman, G. J. Jones, Automatic tagging and geo-tagging in video collections and communities, in ACM International Conference on Multimedia Retrieval (ICMR 2011), pp. 51:1-51:8, April 2011
H. Lei, J. Choi, G. Friedland, City-Identification on Flickr Videos Using Acoustic Features. Technical report, ICSI Technical Report TR-11-001, 2011
D.M. Mount, S. Arya, ANN: A library for approximate nearest neighbor searching, in CGC 2nd Annual Fall Workshop on Computational Geometry, pp. 153 (1997)
A. Oliva, A. Torralba, Building the gist of a scene: the role of global image features in recognition. Prog. Brain Res. 155, 23–36 (2006)
M.C. Palmer, Calculation of distance traveled by fishing vessels using GPS positional data: a theoretical evaluation of the sources of error. Fish. Res. 89(1), 57–64 (2008)
B. Russell, A. Torralba, K. Murphy, W. Freeman, LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vision 77, 157–173 (2008). doi:10.1007/s11263-007-0090-8
M. Soufifar, M. Kockmann, L. Burget, O. Plchot, O. Glembek, T. Svendsen, iVector approach to phonotactic language recognition, in Proceedings of Interspeech, pp. 2913–2916 (2011)
H. Tamura, S. Mori, T. Yamawaki, Textural features corresponding to visual perception. IEEE Trans. Syst. Man Cybern. 8(6), 460–473 (1978)
M. Wainwright, M. Jordan, Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1, 1–305 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Choi, J. et al. (2015). Human Versus Machine: Establishing a Human Baseline for Multimodal Location Estimation. In: Choi, J., Friedland, G. (eds) Multimodal Location Estimation of Videos and Images. Springer, Cham. https://doi.org/10.1007/978-3-319-09861-6_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-09861-6_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09860-9
Online ISBN: 978-3-319-09861-6
eBook Packages: EngineeringEngineering (R0)