TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce

Phan, Trong Nhan; Jäger, Markus; Nadschläger, Stefan; Gómez-Pérez, Pablo; Huber, Christian; Küng, Josef; Nguyen, Cong An

doi:10.1007/978-3-319-48057-2_4

Trong Nhan Phan¹⁹,
Markus Jäger¹⁹,
Stefan Nadschläger¹⁹,
Pablo Gómez-Pérez¹⁹,
Christian Huber¹⁹,
Josef Küng¹⁹ &
…
Cong An Nguyen²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10018))

Included in the following conference series:

International Conference on Future Data and Security Engineering

1453 Accesses

Abstract

Similarity search has become a principal operation not only in databases but also in diverse application domains. Very large datasets, however, pose a big challenge on its enormous volume-processing capability. In order to deal with the challenge, we propose a two-level clustering approach aiming at supporting fast similarity searches in massive datasets. In addition, we embed some pruning and filtering strategies into our methods so that redundancy-free data, data accuracy, inessential data accesses, unnecessary distance computations, and other following consequences are taken into account. Furthermore, we validate our methods by a series of empirical experiments in real big datasets. The results show that our approach performs better than the two inverted index-based approaches, especially when given big query batches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Efficient Document Indexing-Based Similarity Search in Large Datasets

An Adaptive Similarity Search in Massive Datasets

A Lightweight Indexing Approach for Efficient Batch Similarity Processing with MapReduce

Article 25 June 2019

Notes

References

Alabduljalil, M.A., Tang, X., Yang, T.: Optimizing parallel algorithms for all pairs similarity search. In: Proceedings of the 6th ACM WSDM, pp. 203–212. ACM, New York (2013)
Google Scholar
Baraglia, R., De Francisci, M., Lucchese, C.: Document similarity self-join with MapReduce. In: Proceedings of the 2010 IEEE ICDM, pp. 731–736. IEEE Computer Society, Washington (2010)
Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th WWW, pp. 131–140. ACM, New York (2007)
Google Scholar
Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms. McGraw-Hill Higher Education, New York (2001)
MATH Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: MassJoin: a MapReduce-based method for scalable string similarity joins. In: 30th IEEE ICDE, pp. 340–351 (2014)
Google Scholar
Dittrich, J., Richter, S., Schuh, S., Quian-Ruiz, J.-A.: Efficient or Hadoop: why not both? IEEE Data Eng. Bull. 36(1), 15–23 (2013)
Google Scholar
Drew, J., Hahsler, M.: Strand: fast sequence comparison using MapReduce and locality sensitive hashing. In: Proceedings of the 5th ACM BCB, pp. 506–513. ACM, New York (2014)
Google Scholar
Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th ACL-HLT: Short Papers, pp. 265–268. Association for Computational Linguistics, Stroudsburg (2008)
Google Scholar
Jenkyns, T., Stephenson, B.: Fundamentals of Discrete Math for Computer Science: A Problem-Solving Primer. Springer, London (2012)
MATH Google Scholar
Knuth, D.E.: The Art of Computer Programming. Sorting and Searching, vol. 3, 2nd edn. Addison Wesley Longman Publishing Co. Inc., Boston (1998)
MATH Google Scholar
Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch text similarity search with MapReduce. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds.) APWeb 2011. LNCS, vol. 6612, pp. 412–423. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20291-9_46
Chapter Google Scholar
Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd ACM SIGIR, pp. 155–162. ACM, New York (2009)
Google Scholar
Metwally, A., Faloutsos, C.: V-SMART-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endowment 5(8), 704–715 (2012)
Article Google Scholar
Phan, T.N., Küng, J., Dang, T.K.: An elastic approximate similarity search in very large datasets with MapReduce. In: Hameurlain, A., Dang, T.K., Morvan, F. (eds.) Globe 2014. LNCS, vol. 8648, pp. 49–60. Springer, Heidelberg (2014)
Google Scholar
Phan, T.N., Jäger, M., Nadschläger, S., Küng, J., Dang, T.K.: An efficient document indexing-based similarity search in large datasets. In: Proceedings of the 2nd FDSE, pp. 16–31 (2015)
Google Scholar
Phan, T.N., Küng, J., Dang, T.K.: eHSim: an efficient hybrid similarity search with MapReduce. In: Proceedings of the 30th IEEE AINA, pp. 422–429. IEEE Computer Society (2016)
Google Scholar
Rajaraman, A., Ullman, J.D.: Chapter 3: finding similar items. In: Mining of Massive Datasets, pp. 71–127. Cambridge University Press (2011)
Google Scholar
Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.H.: Efficient and scalable processing of string similarity join. IEEE TKDE 25(10), 2217–2230 (2013)
Google Scholar
Satuluri, V., Parthasarathy, S.: Bayesian locality sensitive hashing for fast similarity search. Proc. VLDB Endowment 5(5), 430–441 (2012)
Article Google Scholar
Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st ACM SIGIR, pp. 563–570. ACM, New York (2008)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD, pp. 495–506. ACM, New York (2010)
Google Scholar
Wang, J., Li, G., Deng, D., Zhang, Y., Feng, J.: Two birds with one stone: an efficient hierarchical framework for top-k and threshold-based string similarity search. In: Gehrke, J., et al. (ed.) 31st IEEE ICDE, pp. 519–530 (2015)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM TODS 36(3), 15:1–15:41 (2011)
Article Google Scholar
Zadeh, R.B., Goel, A.: Dimension independent similarity computation. J. Mach. Learn. Res. 14(1), 1605–1626 (2013)
MathSciNet MATH Google Scholar
Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach. Springer, New York (2010)
MATH Google Scholar
Zhang, D., Yang, G., Hu, Y., Jin, Z., Cai, D., He, X.: A unified approximate nearest neighbor search scheme by combining data structure and hashing. In: Proceedings of the 23rd IJCAI, pp. 681–687. AAAI Press (2013)
Google Scholar

Download references

Acknowledgments

Our sincere thanks to Mr. Faruk Kujundi, Scientific Computing, Information Management team, Johannes Kepler University Linz, for kindly supporting us with Alex Cluster.

Author information

Authors and Affiliations

Faculty of Engineering and Natural Sciences (TNF), Institute for Application Oriented Knowledge Processing (FAW), Johannes Kepler University (JKU), Linz, Austria
Trong Nhan Phan, Markus Jäger, Stefan Nadschläger, Pablo Gómez-Pérez, Christian Huber & Josef Küng
Dong Nai Social Insurance, Dong Nai, Vietnam
Cong An Nguyen

Authors

Trong Nhan Phan
View author publications
You can also search for this author in PubMed Google Scholar
Markus Jäger
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Nadschläger
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Gómez-Pérez
View author publications
You can also search for this author in PubMed Google Scholar
Christian Huber
View author publications
You can also search for this author in PubMed Google Scholar
Josef Küng
View author publications
You can also search for this author in PubMed Google Scholar
Cong An Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Trong Nhan Phan .

Editor information

Editors and Affiliations

University of Technology, Ho Chi Minh City, Vietnam
Tran Khanh Dang
Johannes Kepler University , Linz, Austria
Roland Wagner
Johannes Kepler University , Linz, Austria
Josef Küng
University of Technology, Ho Chi Minh City, Vietnam
Nam Thoai
Hosei University , Tokyo, Japan
Makoto Takizawa
University of Vienna , Wien, Austria
Erich Neuhold

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Phan, T.N. et al. (2016). TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce. In: Dang, T., Wagner, R., Küng, J., Thoai, N., Takizawa, M., Neuhold, E. (eds) Future Data and Security Engineering. FDSE 2016. Lecture Notes in Computer Science(), vol 10018. Springer, Cham. https://doi.org/10.1007/978-3-319-48057-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-48057-2_4
Published: 23 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48056-5
Online ISBN: 978-3-319-48057-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Efficient Document Indexing-Based Similarity Search in Large Datasets

An Adaptive Similarity Search in Massive Datasets

A Lightweight Indexing Approach for Efficient Batch Similarity Processing with MapReduce

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Efficient Document Indexing-Based Similarity Search in Large Datasets

An Adaptive Similarity Search in Massive Datasets

A Lightweight Indexing Approach for Efficient Batch Similarity Processing with MapReduce

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation