Abstract
As the number of digital images is growing fast and Content-based Image Retrieval (CBIR) is gaining in popularity, CBIR systems should leap towards Web-scale datasets. In this paper, we report on our experience in building an experimental similarity search system on a test collection of more than 50 million images. The first big challenge we have been facing was obtaining a collection of images of this scale with the corresponding descriptive features. We have tackled the non-trivial process of image crawling and extraction of several MPEG-7 descriptors. The result of this effort is a test collection, the first of such scale, opened to the research community for experiments and comparisons. The second challenge was to develop indexing and searching mechanisms able to scale to the target size and to answer similarity queries in real-time. We have achieved this goal by creating sophisticated centralized and distributed structures based purely on the metric space model of data. We have joined them together which has resulted in an extremely flexible and scalable solution. In this paper, we study in detail the performance of this technology and its evolvement as the data volume grows by three orders of magnitude. The results of the experiments are very encouraging and promising for future applications.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
EU IST FP6 project 045128: Search on Audio-visual content using Peer-to-peer IR
References
Amato G, Falchi F, Gennaro C, Rabitti F, Savino P, Stanchev P (2004) Improving image similarity search effectiveness in a multimedia content management system. In: Proc. of workshop on multimedia information system (MIS), pp 139–146
Amato G, Rabitti F, Savino P, Zezula P (2003) Region proximity in metric spaces and its use for approximate similarity search. ACM Trans Inf Sys (TOIS) 21(2):192–227
Aspnes J, Shah G (2003) Skip graphs. In: Proc. of ACM-SIAM symposium on discrete algorithms, pp 384–393
Baeza-Yates RA, del Solar JR, Verschae R, Castillo C, Hurtado CA (2004) Content-based image retrieval and characterization on specific web collections, pp 189–198
Batko M, Novak D, Falchi F, Zezula P (2006) On scalability of the similarity search in the world of peers. In: Proc. of INFOSCALE, Hong Kong. ACM, New York, pp 1–12
Batko M, Novak D, Zezula P (2007) MESSIF: metric similarity search implementation framework. In: Proc. of DELOS conference. LNCS, vol 4877, pp 1–10
Ciaccia P, Patella M, Zezula P (1997) M-Tree: an efficient access method for similarity search in metric spaces. In: Proceedings of VLDB’97, Athens, 25–29 August 1997, pp 426–435
CoPhIR (Content-based Photo Image Retrieval) Test Collection (2008) http://cophir.isti.cnr.it/
Datta R, Joshi D, Li J, Wang JZ (2008) Image retrieval: ideas, influences, and trends of the new age. ACM Comput Surv 40(2):1–60. doi:10.1145/1348246.1348248
Dohnal V, Sedmidubsky J, Zezula P, Novak D (2008) Similarity searching: towards bulk-loading peer-to-peer networks. In: 1st international workshop on similarity search and applications (SISAP), pp 1–8
Gelasca ED, Guzman JD, Gauglitz S, Ghosh P, Xu J, Moxley E, Rahimi AM, Bi Z, Manjunath BS (2007) Cortina: searching a 10 million + images database. Tech. Rep., University of California, Santa Barbara
ISO/IEC (2003) Information technology—multimedia content description interfaces. Part 6: reference software. 15938-6:2003
Jagadish HV, Ooi BC, Tan KL, Yu C, Zhang R (2005) iDistance: an adaptive B + -tree based indexing method for nearest neighbor search. ACM Trans Database Syst (TODS 2005) 30(2):364–397. doi:10.1145/1071610.1071612
Kumar R, Novak J, Tomkins A (2006) Structure and evolution of online social networks. In: KDD ’06: proceedings of the 12th ACM SIGKDD international conference on knowedge discover and data mining. ACM, New York, pp 611–617
Li J, Wang JZ (2006) Real-time computerized annotation of pictures. In: MULTIMEDIA ’06: proceedings of the 14th annual ACM international conference on multimedia. ACM, New York, pp 911–920. doi:10.1145/1180639.1180841
Manjunath B, Salembier P, Sikora T (eds) (2002) Introduction to MPEG-7: multimedia content description interface. Wiley, New York
MPEG-7 (2002) Multimedia content description interfaces. Part 3: visual. ISO/IEC 15938-3:2002
MUFIN (Multi-Feature Indexing Network) (2008) http://mufin.fi.muni.cz/
Novak D, Zezula P (2006) M-Chord: a scalable distributed similarity search structure. In: Proc. of INFOSCALE, Hong Kong. ACM, New York, pp 1–10
Novak D, Batko M, Zezula P (2008) Web-scale system for image similarity search: when the dreams are coming true. In: Proceedings of the sixth international workshop on content-based multimedia indexing (CBMI 2008), p 8
Novak D, Batko M, Zezula P (2009) Generic similarity search engine demonstrated by an image retrieval application. In: Proc. of the 32st ACM SIGIR conference on research and development in information retrieval (SIGIR). ACM, Boston
Skopal T, Pokorný J, Snásel V (2004) PM-tree: pivoting metric tree for similarity search in multimedia databases. In: Proc. of ADBIS, Budapest
Stoica I, Morris R, Karger DR, Kaashoek FM, Balakrishnan H (2001) Chord: a scalable peer-to-peer lookup service for internet applications. In: Proc. of SIGCOMM. ACM, San Diego, pp 149–160 doi:10.1145/383059.383071. citeseer.ist.psu.edu/article/stoica01chord.html
Traina C Jr, Traina AJM, Seeger B, Faloutsos C (2000) Slim-Trees: high performance metric trees minimizing overlap between nodes. In: Proc. of EDBT. LNCS, vol 1777. Springer, New York, pp 51–65
Veltkamp RC, Tanase M (2002) Content-based image retrieval systems: a survey. Tech. Rep. UU-CS-2000-34, Department of CS, Utrecht University
Wang JZ, Li J, Wiederhold G (2001) SIMPLIcity: semantics-sensitive integrated matching for picture libraries. IEEE Trans Pattern Anal Mach Intell 23(9):947–963. doi:10.1109/34.955109
Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach. Advances in database systems, vol 32. Springer, New York
Zezula P, Savino P, Amato G, Rabitti F (1998) Approximate similarity retrieval with m-trees. VLDB J 7(4):275–293
Acknowledgements
This research was supported by the EU IST FP6 project 045128 (SAPIR) and national projects GACR 201/08/P507, GACR 201/09/0683, GACR 102/09/H042, and MSMT 1M0545. Hardware infrastructure was provided by MetaCenterFootnote 17 and by IBM SUR Award.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Batko, M., Falchi, F., Lucchese, C. et al. Building a web-scale image similarity search system. Multimed Tools Appl 47, 599–629 (2010). https://doi.org/10.1007/s11042-009-0339-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-009-0339-z