Abstract
Given a set of high dimensional sparse vectors, a similarity function and a threshold, AllPairs Similarity Search finds out all pairs of vectors whose similarity values are higher than or equal to the threshold. AllPairs Similarity Search (APSS) has been studied in many different fields of computer science, including information retrieval, data mining, database and so on. It is a crucial part of lots of applications, such as near-duplicate document detection, collaborative filtering, query refinement and clustering. For cosine similarity, many serial algorithms have been proposed to solve the problem by decreasing the possible similarity candidates for each query object. However, the efficiency of those serial algorithms degrade badly as the threshold decreases. Other parallel implementations of APSS based on OpenMP or MapReduce also adopt the pruning policy and do not solve the problem thoroughly. In this context, we introduce CuAPSS, which solves the All Pairs cosine similarity search problem in CUDA environment on GPUs. Our method adopts a hybrid method to utilize both forward list and inverted list in APSS which compromises between the memory visiting and dot-product computing. The experimental results show that our method could solve the problem much faster than existing methods on several benchmark datasets with hundreds of millions of non-zero values, achieving the speedup of 1.5x–23x against the state-of-the-art parallel algorithm, while keep a relatively stable running time with different values of the threshold.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alabduljalil, M., Tang, X., Yang, T.: Cache-conscious performance optimization for similarity search. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 713–722. ACM, New York (2013)
Alabduljalil, M.A., Tang, X., Yang, T.: Optimizing parallel algorithms for all pairs similarity search. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 203–212. ACM, New York (2013)
Alewiwi, M., Orencik, C., Savaş, E.: Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Cluster Comput. 19(1), 109–126 (2016)
Anastasiu, D.C., Karypis, G.: L2AP: fast cosine similarity search with prefix L-2 norm bounds. In: IEEE 30th International Conference on Data Engineering, ICDE 2014, 31 March–4 April 2014, Chicago, IL, USA, pp. 784–795 (2014)
Anastasiu, D.C., Karypis, G.: PL2AP: fast parallel cosine similarity search. In: Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, pp. 8:1–8:8. ACM, New York (2015)
Awekar, A., Samatova, N.F.: Fast matching for all pairs similarity search. In: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, vol. 01, pp. 295–300. IEEE Computer Society, Washington, DC (2009)
Awekar, A., Samatova, N.F.: Parallel all pairs similarity search. In: Proceedings of the 10th International Conference on Information and Knowledge Engineering (2011)
Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document similarity self-join with mapreduce. In: Proceedings of the 2010 IEEE International Conference on Data Mining, pp. 731–736. IEEE Computer Society, Washington, DC (2010)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, pp. 131–140. ACM, New York (2007)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, p. 5. IEEE Computer Society, Washington, DC (2006)
De Francisci, G., Lucchese, C., Baraglia, R.: Scaling out all pairs similarity search with mapreduce. In: Large-Scale Distributed Systems for Information Retrieval, p. 27 (2010)
Hajishirzi, H., Yih, W., Kolcz, A.: Adaptive near-duplicate detection via similarity learning. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 419–426. ACM, New York (2010)
Haveliwala, T.H., Gionis, A., Indyk, P.: Scalable techniques for clustering the web. In: Suciu, D., Vossen, G. (eds.) WebDB (Informal Proceedings), pp. 129–134 (2000)
Lee, D., Park, J., Shim, J., Lee, S.: An efficient similarity join algorithm with cosine similarity predicate. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds.) DEXA 2010. LNCS, vol. 6262, pp. 422–436. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15251-1_33
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Matsumoto, T., Yiu, M.L.: Accelerating exact similarity search on CPU-GPU systems. In: 2015 IEEE International Conference on Data Mining, pp. 320–329, November 2015
Salton, G.: Term-weighting approach in automatic text retrieval. In: Readings in Information Retrieval (1998)
Satuluri, V., Parthasarathy, S.: Bayesian locality sensitive hashing for fast similarity search. Proc. VLDB Endow. 5(5), 430–441 (2012)
Tang, X., Alabduljalil, M., Jin, X., Yang, T.: Load balancing for partition-based similarity search. In: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 193–202. ACM, New York (2014)
Zeng, C., Xing, C.X., Zhou, L.Z.: Similarity measure and instance selection for collaborative filtering. In: Proceedings of the 12th International Conference on World Wide Web, pp. 652–658. ACM, New York (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Feng, Y., Tang, J., Wang, C., Xie, J. (2018). CuAPSS: A Hybrid CUDA Solution for AllPairs Similarity Search. In: Vaidya, J., Li, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2018. Lecture Notes in Computer Science(), vol 11334. Springer, Cham. https://doi.org/10.1007/978-3-030-05051-1_29
Download citation
DOI: https://doi.org/10.1007/978-3-030-05051-1_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05050-4
Online ISBN: 978-3-030-05051-1
eBook Packages: Computer ScienceComputer Science (R0)