Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

CuAPSS: A Hybrid CUDA Solution for AllPairs Similarity Search

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11334))

  • 1583 Accesses

Abstract

Given a set of high dimensional sparse vectors, a similarity function and a threshold, AllPairs Similarity Search finds out all pairs of vectors whose similarity values are higher than or equal to the threshold. AllPairs Similarity Search (APSS) has been studied in many different fields of computer science, including information retrieval, data mining, database and so on. It is a crucial part of lots of applications, such as near-duplicate document detection, collaborative filtering, query refinement and clustering. For cosine similarity, many serial algorithms have been proposed to solve the problem by decreasing the possible similarity candidates for each query object. However, the efficiency of those serial algorithms degrade badly as the threshold decreases. Other parallel implementations of APSS based on OpenMP or MapReduce also adopt the pruning policy and do not solve the problem thoroughly. In this context, we introduce CuAPSS, which solves the All Pairs cosine similarity search problem in CUDA environment on GPUs. Our method adopts a hybrid method to utilize both forward list and inverted list in APSS which compromises between the memory visiting and dot-product computing. The experimental results show that our method could solve the problem much faster than existing methods on several benchmark datasets with hundreds of millions of non-zero values, achieving the speedup of 1.5x–23x against the state-of-the-art parallel algorithm, while keep a relatively stable running time with different values of the threshold.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alabduljalil, M., Tang, X., Yang, T.: Cache-conscious performance optimization for similarity search. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 713–722. ACM, New York (2013)

    Google Scholar 

  2. Alabduljalil, M.A., Tang, X., Yang, T.: Optimizing parallel algorithms for all pairs similarity search. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 203–212. ACM, New York (2013)

    Google Scholar 

  3. Alewiwi, M., Orencik, C., Savaş, E.: Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Cluster Comput. 19(1), 109–126 (2016)

    Article  Google Scholar 

  4. Anastasiu, D.C., Karypis, G.: L2AP: fast cosine similarity search with prefix L-2 norm bounds. In: IEEE 30th International Conference on Data Engineering, ICDE 2014, 31 March–4 April 2014, Chicago, IL, USA, pp. 784–795 (2014)

    Google Scholar 

  5. Anastasiu, D.C., Karypis, G.: PL2AP: fast parallel cosine similarity search. In: Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, pp. 8:1–8:8. ACM, New York (2015)

    Google Scholar 

  6. Awekar, A., Samatova, N.F.: Fast matching for all pairs similarity search. In: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, vol. 01, pp. 295–300. IEEE Computer Society, Washington, DC (2009)

    Google Scholar 

  7. Awekar, A., Samatova, N.F.: Parallel all pairs similarity search. In: Proceedings of the 10th International Conference on Information and Knowledge Engineering (2011)

    Google Scholar 

  8. Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document similarity self-join with mapreduce. In: Proceedings of the 2010 IEEE International Conference on Data Mining, pp. 731–736. IEEE Computer Society, Washington, DC (2010)

    Google Scholar 

  9. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, pp. 131–140. ACM, New York (2007)

    Google Scholar 

  10. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, p. 5. IEEE Computer Society, Washington, DC (2006)

    Google Scholar 

  11. De Francisci, G., Lucchese, C., Baraglia, R.: Scaling out all pairs similarity search with mapreduce. In: Large-Scale Distributed Systems for Information Retrieval, p. 27 (2010)

    Google Scholar 

  12. Hajishirzi, H., Yih, W., Kolcz, A.: Adaptive near-duplicate detection via similarity learning. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 419–426. ACM, New York (2010)

    Google Scholar 

  13. Haveliwala, T.H., Gionis, A., Indyk, P.: Scalable techniques for clustering the web. In: Suciu, D., Vossen, G. (eds.) WebDB (Informal Proceedings), pp. 129–134 (2000)

    Google Scholar 

  14. Lee, D., Park, J., Shim, J., Lee, S.: An efficient similarity join algorithm with cosine similarity predicate. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds.) DEXA 2010. LNCS, vol. 6262, pp. 422–436. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15251-1_33

    Chapter  Google Scholar 

  15. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)

    Google Scholar 

  16. Matsumoto, T., Yiu, M.L.: Accelerating exact similarity search on CPU-GPU systems. In: 2015 IEEE International Conference on Data Mining, pp. 320–329, November 2015

    Google Scholar 

  17. Salton, G.: Term-weighting approach in automatic text retrieval. In: Readings in Information Retrieval (1998)

    Google Scholar 

  18. Satuluri, V., Parthasarathy, S.: Bayesian locality sensitive hashing for fast similarity search. Proc. VLDB Endow. 5(5), 430–441 (2012)

    Article  Google Scholar 

  19. Tang, X., Alabduljalil, M., Jin, X., Yang, T.: Load balancing for partition-based similarity search. In: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 193–202. ACM, New York (2014)

    Google Scholar 

  20. Zeng, C., Xing, C.X., Zhou, L.Z.: Similarity measure and instance selection for collaborative filtering. In: Proceedings of the 12th International Conference on World Wide Web, pp. 652–658. ACM, New York (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chongjun Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Feng, Y., Tang, J., Wang, C., Xie, J. (2018). CuAPSS: A Hybrid CUDA Solution for AllPairs Similarity Search. In: Vaidya, J., Li, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2018. Lecture Notes in Computer Science(), vol 11334. Springer, Cham. https://doi.org/10.1007/978-3-030-05051-1_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-05051-1_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-05050-4

  • Online ISBN: 978-3-030-05051-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics