Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce

  • Conference paper
  • First Online:
Future Data and Security Engineering (FDSE 2016)

Abstract

Similarity search has become a principal operation not only in databases but also in diverse application domains. Very large datasets, however, pose a big challenge on its enormous volume-processing capability. In order to deal with the challenge, we propose a two-level clustering approach aiming at supporting fast similarity searches in massive datasets. In addition, we embed some pruning and filtering strategies into our methods so that redundancy-free data, data accuracy, inessential data accesses, unnecessary distance computations, and other following consequences are taken into account. Furthermore, we validate our methods by a series of empirical experiments in real big datasets. The results show that our approach performs better than the two inverted index-based approaches, especially when given big query batches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://hadoop.apache.org/docs/r1.2.1/mapred-default.html.

  2. 2.

    http://www.jku.at/content/e213/e174/e167/e186534.

  3. 3.

    http://www.gutenberg.org/.

  4. 4.

    http://hadoop.apache.org/docs/r1.2.1/streaming.html.

References

  1. Alabduljalil, M.A., Tang, X., Yang, T.: Optimizing parallel algorithms for all pairs similarity search. In: Proceedings of the 6th ACM WSDM, pp. 203–212. ACM, New York (2013)

    Google Scholar 

  2. Baraglia, R., De Francisci, M., Lucchese, C.: Document similarity self-join with MapReduce. In: Proceedings of the 2010 IEEE ICDM, pp. 731–736. IEEE Computer Society, Washington (2010)

    Google Scholar 

  3. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th WWW, pp. 131–140. ACM, New York (2007)

    Google Scholar 

  4. Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms. McGraw-Hill Higher Education, New York (2001)

    MATH  Google Scholar 

  5. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  6. Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: MassJoin: a MapReduce-based method for scalable string similarity joins. In: 30th IEEE ICDE, pp. 340–351 (2014)

    Google Scholar 

  7. Dittrich, J., Richter, S., Schuh, S., Quian-Ruiz, J.-A.: Efficient or Hadoop: why not both? IEEE Data Eng. Bull. 36(1), 15–23 (2013)

    Google Scholar 

  8. Drew, J., Hahsler, M.: Strand: fast sequence comparison using MapReduce and locality sensitive hashing. In: Proceedings of the 5th ACM BCB, pp. 506–513. ACM, New York (2014)

    Google Scholar 

  9. Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th ACL-HLT: Short Papers, pp. 265–268. Association for Computational Linguistics, Stroudsburg (2008)

    Google Scholar 

  10. Jenkyns, T., Stephenson, B.: Fundamentals of Discrete Math for Computer Science: A Problem-Solving Primer. Springer, London (2012)

    MATH  Google Scholar 

  11. Knuth, D.E.: The Art of Computer Programming. Sorting and Searching, vol. 3, 2nd edn. Addison Wesley Longman Publishing Co. Inc., Boston (1998)

    MATH  Google Scholar 

  12. Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch text similarity search with MapReduce. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds.) APWeb 2011. LNCS, vol. 6612, pp. 412–423. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20291-9_46

    Chapter  Google Scholar 

  13. Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd ACM SIGIR, pp. 155–162. ACM, New York (2009)

    Google Scholar 

  14. Metwally, A., Faloutsos, C.: V-SMART-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endowment 5(8), 704–715 (2012)

    Article  Google Scholar 

  15. Phan, T.N., Küng, J., Dang, T.K.: An elastic approximate similarity search in very large datasets with MapReduce. In: Hameurlain, A., Dang, T.K., Morvan, F. (eds.) Globe 2014. LNCS, vol. 8648, pp. 49–60. Springer, Heidelberg (2014)

    Google Scholar 

  16. Phan, T.N., Jäger, M., Nadschläger, S., Küng, J., Dang, T.K.: An efficient document indexing-based similarity search in large datasets. In: Proceedings of the 2nd FDSE, pp. 16–31 (2015)

    Google Scholar 

  17. Phan, T.N., Küng, J., Dang, T.K.: eHSim: an efficient hybrid similarity search with MapReduce. In: Proceedings of the 30th IEEE AINA, pp. 422–429. IEEE Computer Society (2016)

    Google Scholar 

  18. Rajaraman, A., Ullman, J.D.: Chapter 3: finding similar items. In: Mining of Massive Datasets, pp. 71–127. Cambridge University Press (2011)

    Google Scholar 

  19. Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.H.: Efficient and scalable processing of string similarity join. IEEE TKDE 25(10), 2217–2230 (2013)

    Google Scholar 

  20. Satuluri, V., Parthasarathy, S.: Bayesian locality sensitive hashing for fast similarity search. Proc. VLDB Endowment 5(5), 430–441 (2012)

    Article  Google Scholar 

  21. Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st ACM SIGIR, pp. 563–570. ACM, New York (2008)

    Google Scholar 

  22. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD, pp. 495–506. ACM, New York (2010)

    Google Scholar 

  23. Wang, J., Li, G., Deng, D., Zhang, Y., Feng, J.: Two birds with one stone: an efficient hierarchical framework for top-k and threshold-based string similarity search. In: Gehrke, J., et al. (ed.) 31st IEEE ICDE, pp. 519–530 (2015)

    Google Scholar 

  24. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM TODS 36(3), 15:1–15:41 (2011)

    Article  Google Scholar 

  25. Zadeh, R.B., Goel, A.: Dimension independent similarity computation. J. Mach. Learn. Res. 14(1), 1605–1626 (2013)

    MathSciNet  MATH  Google Scholar 

  26. Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach. Springer, New York (2010)

    MATH  Google Scholar 

  27. Zhang, D., Yang, G., Hu, Y., Jin, Z., Cai, D., He, X.: A unified approximate nearest neighbor search scheme by combining data structure and hashing. In: Proceedings of the 23rd IJCAI, pp. 681–687. AAAI Press (2013)

    Google Scholar 

Download references

Acknowledgments

Our sincere thanks to Mr. Faruk Kujundi, Scientific Computing, Information Management team, Johannes Kepler University Linz, for kindly supporting us with Alex Cluster.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Trong Nhan Phan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Phan, T.N. et al. (2016). TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce. In: Dang, T., Wagner, R., Küng, J., Thoai, N., Takizawa, M., Neuhold, E. (eds) Future Data and Security Engineering. FDSE 2016. Lecture Notes in Computer Science(), vol 10018. Springer, Cham. https://doi.org/10.1007/978-3-319-48057-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-48057-2_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-48056-5

  • Online ISBN: 978-3-319-48057-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics