Abstract
Similarity search has become a principal operation not only in databases but also in diverse application domains. Very large datasets, however, pose a big challenge on its enormous volume-processing capability. In order to deal with the challenge, we propose a two-level clustering approach aiming at supporting fast similarity searches in massive datasets. In addition, we embed some pruning and filtering strategies into our methods so that redundancy-free data, data accuracy, inessential data accesses, unnecessary distance computations, and other following consequences are taken into account. Furthermore, we validate our methods by a series of empirical experiments in real big datasets. The results show that our approach performs better than the two inverted index-based approaches, especially when given big query batches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alabduljalil, M.A., Tang, X., Yang, T.: Optimizing parallel algorithms for all pairs similarity search. In: Proceedings of the 6th ACM WSDM, pp. 203–212. ACM, New York (2013)
Baraglia, R., De Francisci, M., Lucchese, C.: Document similarity self-join with MapReduce. In: Proceedings of the 2010 IEEE ICDM, pp. 731–736. IEEE Computer Society, Washington (2010)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th WWW, pp. 131–140. ACM, New York (2007)
Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms. McGraw-Hill Higher Education, New York (2001)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: MassJoin: a MapReduce-based method for scalable string similarity joins. In: 30th IEEE ICDE, pp. 340–351 (2014)
Dittrich, J., Richter, S., Schuh, S., Quian-Ruiz, J.-A.: Efficient or Hadoop: why not both? IEEE Data Eng. Bull. 36(1), 15–23 (2013)
Drew, J., Hahsler, M.: Strand: fast sequence comparison using MapReduce and locality sensitive hashing. In: Proceedings of the 5th ACM BCB, pp. 506–513. ACM, New York (2014)
Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th ACL-HLT: Short Papers, pp. 265–268. Association for Computational Linguistics, Stroudsburg (2008)
Jenkyns, T., Stephenson, B.: Fundamentals of Discrete Math for Computer Science: A Problem-Solving Primer. Springer, London (2012)
Knuth, D.E.: The Art of Computer Programming. Sorting and Searching, vol. 3, 2nd edn. Addison Wesley Longman Publishing Co. Inc., Boston (1998)
Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch text similarity search with MapReduce. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds.) APWeb 2011. LNCS, vol. 6612, pp. 412–423. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20291-9_46
Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd ACM SIGIR, pp. 155–162. ACM, New York (2009)
Metwally, A., Faloutsos, C.: V-SMART-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endowment 5(8), 704–715 (2012)
Phan, T.N., Küng, J., Dang, T.K.: An elastic approximate similarity search in very large datasets with MapReduce. In: Hameurlain, A., Dang, T.K., Morvan, F. (eds.) Globe 2014. LNCS, vol. 8648, pp. 49–60. Springer, Heidelberg (2014)
Phan, T.N., Jäger, M., Nadschläger, S., Küng, J., Dang, T.K.: An efficient document indexing-based similarity search in large datasets. In: Proceedings of the 2nd FDSE, pp. 16–31 (2015)
Phan, T.N., Küng, J., Dang, T.K.: eHSim: an efficient hybrid similarity search with MapReduce. In: Proceedings of the 30th IEEE AINA, pp. 422–429. IEEE Computer Society (2016)
Rajaraman, A., Ullman, J.D.: Chapter 3: finding similar items. In: Mining of Massive Datasets, pp. 71–127. Cambridge University Press (2011)
Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.H.: Efficient and scalable processing of string similarity join. IEEE TKDE 25(10), 2217–2230 (2013)
Satuluri, V., Parthasarathy, S.: Bayesian locality sensitive hashing for fast similarity search. Proc. VLDB Endowment 5(5), 430–441 (2012)
Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st ACM SIGIR, pp. 563–570. ACM, New York (2008)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD, pp. 495–506. ACM, New York (2010)
Wang, J., Li, G., Deng, D., Zhang, Y., Feng, J.: Two birds with one stone: an efficient hierarchical framework for top-k and threshold-based string similarity search. In: Gehrke, J., et al. (ed.) 31st IEEE ICDE, pp. 519–530 (2015)
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM TODS 36(3), 15:1–15:41 (2011)
Zadeh, R.B., Goel, A.: Dimension independent similarity computation. J. Mach. Learn. Res. 14(1), 1605–1626 (2013)
Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach. Springer, New York (2010)
Zhang, D., Yang, G., Hu, Y., Jin, Z., Cai, D., He, X.: A unified approximate nearest neighbor search scheme by combining data structure and hashing. In: Proceedings of the 23rd IJCAI, pp. 681–687. AAAI Press (2013)
Acknowledgments
Our sincere thanks to Mr. Faruk Kujundi, Scientific Computing, Information Management team, Johannes Kepler University Linz, for kindly supporting us with Alex Cluster.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Phan, T.N. et al. (2016). TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce. In: Dang, T., Wagner, R., Küng, J., Thoai, N., Takizawa, M., Neuhold, E. (eds) Future Data and Security Engineering. FDSE 2016. Lecture Notes in Computer Science(), vol 10018. Springer, Cham. https://doi.org/10.1007/978-3-319-48057-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-48057-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48056-5
Online ISBN: 978-3-319-48057-2
eBook Packages: Computer ScienceComputer Science (R0)