An improved method of locality-sensitive hashing for scalable instance matching

Aydar, Mehmet; Ayvaz, Serkan

doi:10.1007/s10115-018-1199-5

An improved method of locality-sensitive hashing for scalable instance matching

Regular Paper
Published: 26 April 2018

Volume 58, pages 275–294, (2019)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

520 Accesses
10 Citations
Explore all metrics

Abstract

In this study, we propose a scalable approach for automatically identifying similar candidate instance pairs in very large datasets. Efficient candidate pair generation is an essential to many computational problems involving calculation of instance similarities. Calculating similarities of instances with a large number of properties and efficiently matching a large number of similar instances in a scalable way are two significant bottlenecks of candidate instance pair generation. In our approach, we utilize locality-sensitive hashing (LSH) technique to greatly improve the scalability of candidate instance pair generation. Based on the candidate similarity threshold, our algorithm automatically discovers the optimum number of hash functions in each band in LSH. Moreover, we evaluated the scalability of our approach and its effectiveness in instance matching task using real-world very large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ScLink: supervised instance matching system for heterogeneous repositories

Article 18 August 2016

Hashing Multi-Instance Data from Bag and Instance Level

DASH: Data Aware Locality Sensitive Hashing

Notes

References

Achichi M, Cheatham M, Dragisic Z, Euzenat J, Faria D, Ferrara A, Flouris G, Fundulaki I, Harrow I, Ivanova V, et al. (2016) Results of the ontology alignment evaluation initiative 2016. In: CEUR workshop proceedings vol 1766. RWTH, pp 73–129
Aumueller D, Do H-H, Massmann S, Rahm E ( 2005) Schema and ontology matching with coma++. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data. Acm, pp 906–908
Aydar M, Ayvaz S, Melton AC (2015) Automatic weight generation and class predicate stability in rdf summary graphs. In: Workshop on intelligent exploration of semantic data (IESD2015), co-located with ISWC2015’
Ayvaz S, Aydar M, Melton A (2015) Building summary graphs of RDF data in semantic web. In: Computer software and applications conference (COMPSAC), 2015 IEEE 39th annual’, vol 2. pp 686–691
Berlin J, Motro A (2002) Database schema matching using machine learning with feature selection. In: International conference on advanced information systems engineering. Springer, pp 452–466
Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S (2003) Adaptive name matching in information integration. IEEE Intell Syst 18(5):16–23
Article Google Scholar
Bilke A, Naumann F (2005) Schema matching using duplicates. In: Data engineering, 2005. ICDE 2005. Proceedings. 21st international conference on’. IEEE, pp 69–80
Bizer C, Heath T, Berners-Lee T (2009) Linked data-the story so far. Int J Semant Web Inf Syst 5(3):1–22
Article Google Scholar
Broder AZ (1997) On the resemblance and containment of documents. In: Compression and complexity of sequences 1997. Proceedings. IEEE, pp 21–29
Castano S, Ferrara A, Montanelli S, Lorusso D (2008) Instance matching for ontology population. In: SEBD. pp 121–132
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing. ACM, pp 380–388
Chierichetti F, Kumar R (2015) Lsh-preserving functions and their applications. J ACM (JACM) 62(5):33
Article MathSciNet MATH Google Scholar
Chierichetti F, Kumar R, Mahdian M (2014) The complexity of lsh feasibility. Theor Comput Sci 530:89–101
Article MathSciNet MATH Google Scholar
Chum O, Philbin J, Zisserman A et al (2008) Near duplicate image detection: min-hash and tf-idf weighting. In: BMVC, vol 810. pp 812–815
Cochinwala M, Kurien V, Lalk G, Shasha D (2001) Efficient data reconciliation. Inf Sci 137(1):1–15
Article MATH Google Scholar
Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman JD, Yang C (2001) Finding interesting associations without support pruning. IEEE Trans Knowl Data Eng 13(1):64–78
Article Google Scholar
Das AS, Datar M, Garg A, Rajaram S (2007) Google news personalization: scalable online collaborative filtering. In: Proceedings of the 16th international conference on World Wide Web. ACM, pp 271–280
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Doan A, Madhavan J, Domingos P, Halevy A (2004) Ontology matching: a machine learning approach. In: Handbook on ontologies. Springer, pp 385–403
Duan S, Fokoue A, Hassanzadeh O, Kementsietsidis A, Srinivas K, Ward MJ (2012) Instance-based matching of large ontologies using locality-sensitive hashing. In: International semantic web conference. Springer, pp 49–64
Engmann D, Massmann S (2007) Instance matching with coma++. In: BTW workshops, vol 7. pp 28–37
Faria D, Pesquita C, Balasubramani BS, Martins C, Cardoso J, Curado H, Couto FM, Cruz IF, (2016) OAEI 2016 results of AML. In: Ontology matching, p 138
Fernandes K, Vinagre P, Cortez P (2015) A proactive intelligent decision support system for predicting the popularity of online news. In: Portuguese conference on artificial intelligence. Springer, pp 535–546
Gasparetti F (2017) Modeling user interests from web browsing activities. Data Min Knowl Discov 31(2):502–547
Article MathSciNet Google Scholar
Gionis A, Indyk P, Motwani R et al (1999) Similarity search in high dimensions via hashing. In: VLDB, vol 99. pp 518–529
Grauman K, Darrell T (2007) Pyramid match hashing: sub-linear time indexing over partial correspondences. In: Computer vision and pattern recognition, 2007. CVPR’07. IEEE conference on’. IEEE, pp 1–8
Haveliwala T, Gionis A, Indyk P (2000) Scalable techniques for clustering the web (extended abstract). In: Third international workshop on the web and databases (WebDB 2000). http://ilpubs.stanford.edu:8090/445/. Accessed 19 Oct 2017
He K, Wen F, Sun J (2013) $K$-means hashing: an affinity-preserving quantization method for learning binary compact codes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2938–2945
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, pp 604–613
Isaac A, Van Der Meij L, Schlobach S, Wang S (2007) An empirical study of instance-based ontology matching. In: The semantic web. Springer, pp 253–266
Jaccard P (1901) Étude comparative de la distribution florale dans une portion des alpes et des jura. Bull Soc Vaudoise Sci Nat 37:547–579
Google Scholar
Jain P, Hitzler P, Sheth AP, Verma K, Yeh PZ (2010) Ontology alignment for linked open data. In: International semantic web conference. Springer, pp 402–417
Jain P, Kulis B, Grauman K (2008) Fast image search for learned metrics. In: Computer vision and pattern recognition, 2008. CVPR 2008. IEEE conference on. IEEE, pp 1–8
Jain P, Yeh PZ, Verma K, Vasquez RG, Damova M, Hitzler P, Sheth AP (2011) Contextual ontology alignment of lod with an upper ontology: a case study with proton. In: Extended semantic web conference. Springer, pp 80–92
Jiménez-Ruiz E, Grau BC, Cross V (2016) Logmap family participation in the OAEI 2016. In: Ontology matching, p 185
Kulis B, Grauman K (2012) Kernelized locality-sensitive hashing. IEEE Trans Pattern Anal Mach Intell 34(6):1092–1104
Article Google Scholar
Leskovec J, Rajaraman A, Ullman JD (2014) Mining of massive datasets. Cambridge University Press, Cambridge
Book Google Scholar
Li J, Tang J, Li Y, Luo Q (2009) Rimom: a dynamic multistrategy ontology alignment framework. IEEE Trans Knowl Data Eng 21(8):1218–1232
Article Google Scholar
Li W-S, Clifton C (2000) Semint: a tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl Eng 33(1):49–84
Article MATH Google Scholar
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 15 Feb 2017
Lin J (2009) Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 155–162
Madhavan J, Bernstein PA, Rahm E (2001) Generic schema matching with cupid. In: vldb vol 1. pp 49–58
Manber U et al (1994) Finding similar files in a large file system. In: Usenix winter, vol 94. pp 1–10
McAuley J, Pandey R, Leskovec J (2015) , Inferring networks of substitutable and complementary products. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 785–794
McAuley J, Targett C, Shi Q, van den Hengel A (2015) Image-based recommendations on styles and substitutes. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 43–52
Melnik S, Garcia-Molina H, Rahm E (2002) , Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: Data engineering 2002. Proceedings. 18th international conference on. IEEE, pp 117–128
Rajaraman A, Ullman JD (2011) Mining of massive datasets. Cambridge University Press, Cambridge
Book Google Scholar
Ravichandran D, Pantel P, Hovy E (2005) Randomized algorithms and nlp: using locality sensitive hash function for high speed noun clustering. In: Proceedings of the 43rd annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 622–629
Rong S, Niu X, Xiang EW, Wang H, Yang Q, Yu Y (2012) A machine learning approach for instance matching based on similarity metrics. In: International semantic web conference. Springer, pp 460–475
Seddiqui M, Nath R, PD, Aono M et al (2015) An efficient metric of automatic weight generation for properties in instance matching technique. ArXiv preprint arXiv:1502.03556
Spohr D, Hollink L, Cimiano P (2011) A machine learning approach to multilingual and cross-lingual ontology matching. In: International semantic web conference. Springer, pp 665–680
Stoilos G, Stamou G, Kollias S (2005) A string metric for ontology alignment. In: International semantic web conference. Springer, pp 624–637
Wang C, Lu J, Zhang G (2006) Integration of ontology data through learning instance matching. In: Web intelligence, 2006. WI 2006. IEEE/WIC/ACM international conference on. IEEE, pp 536–539
Wang S, Englebienne G, Schlobach S (2008) Learning concept mappings from instance similarity. In: The semantic web-ISWC 2008. pp 339–355
Wrigley SN, García-Castro R, Nixon L (2012) Semantic evaluation at large scale (seals). In: Proceedings of the 21st international conference on world wide web. ACM, pp 299–302
Xu D, Wu J, Li D, Tian Y, Zhu X, Wu X (2017) SALE: Self-adaptive LSH encoding for multi-instance learning. Pattern Recognit 71:460–482
Article Google Scholar
Zhang W, Ji J, Zhu J, Xu H, Zhang B (2015) Large scale sentiment analysis with locality sensitive BitHash. In: Asia information retrieval symposium. Springer, pp 29–40
Zhu E, Nargesian F, Pu KQ, Miller RJ (2016) LSH ensemble: internet-scale domain search. Proc VLDB Endow 9(12):1185–1196
Article Google Scholar

Download references

Acknowledgements

We would like to thank the OAEI 2016 campaign Instance Matching Task organizers, particularly Dr. Manel Achichi, Dr. Daniel Faria and Dr. Ernesto Jimnez-Ruiz, for providing run time evaluations. Also, we thank Dr. Daniel Faria for providing AML’s OAEI 2016 version as a stand-alone JAR for testing purposes.

Author information

Authors and Affiliations

Department of Computer Science, Kent State University, Kent, OH, USA
Mehmet Aydar
Department of Software Engineering, Bahcesehir University, Beşiktaş, Istanbul, Turkey
Serkan Ayvaz

Authors

Mehmet Aydar
View author publications
You can also search for this author in PubMed Google Scholar
Serkan Ayvaz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mehmet Aydar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aydar, M., Ayvaz, S. An improved method of locality-sensitive hashing for scalable instance matching. Knowl Inf Syst 58, 275–294 (2019). https://doi.org/10.1007/s10115-018-1199-5

Download citation

Received: 21 March 2017
Revised: 25 February 2018
Accepted: 18 April 2018
Published: 26 April 2018
Issue Date: 06 February 2019
DOI: https://doi.org/10.1007/s10115-018-1199-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An improved method of locality-sensitive hashing for scalable instance matching

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ScLink: supervised instance matching system for heterogeneous repositories

Hashing Multi-Instance Data from Bag and Instance Level

DASH: Data Aware Locality Sensitive Hashing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

An improved method of locality-sensitive hashing for scalable instance matching

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ScLink: supervised instance matching system for heterogeneous repositories

Hashing Multi-Instance Data from Bag and Instance Level

DASH: Data Aware Locality Sensitive Hashing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation