Abstract
Measuring the semantic similarity between sentences is an essential issue for many applications, such as text summarization, Web page retrieval, question-answer model, image extraction, and so forth. A few studies have explored on this issue by several techniques, e.g., knowledge-based strategies, corpus-based strategies, hybrid strategies, etc. Most of these studies focus on how to improve the effectiveness of the problem. In this paper, we address the efficiency issue, i.e., for a given sentence collection, how to efficiently discover the top-k semantic similar sentences to a query. The previous methods cannot handle the big data efficiently, i.e., applying such strategies directly is time consuming because every candidate sentence needs to be tested. In this paper, we propose efficient strategies to tackle such problem based on a general framework. The basic idea is that for each similarity, we build a corresponding index in the preprocessing. Traversing these indices in the querying process can avoid to test many candidates, so as to improve the efficiency. Moreover, an optimal aggregation algorithm is introduced to assemble these similarities. Our framework is general enough that many similarity metrics can be incorporated, as will be discussed in the paper. We conduct extensive experimental evaluation on three real datasets to evaluate the efficiency of our proposal. In addition, we illustrate the trade-off between the effectiveness and efficiency. The experimental results demonstrate that the performance of our proposal outperforms the state-of-the-art techniques on efficiency while keeping the same high precision as them.
Similar content being viewed by others
References
Atallah, M.J., Fox, S.: Algorithms and Theory of Computation Handbook, 1st edn. CRC Press, Inc. (1998)
Blake, M.B., Cabral, L., König-Ries, B., Küster, U., Martin, D.: Semantic Web Services: Advancement through Evaluation. Springer (2012)
Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: Proceedings of the International Conference on World Wide Web, WWW’07, pp. 757–766 (2007)
Burgess, C., Livesay, K., Lund, K.: Explorations in context space: words, sentences, discourse. Discourse Process. 25, 211–257 (1998)
Ceccarelli, D., Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: Caching query-biased snippets for efficient retrieval. In: Proceedings of the International Conference on Extending Database Technology, EDBT/ICDT ’11, pp. 93–104 (2011)
Chowdhury, G.G.: Introduction to Modern Information Retrieval, 3rd edn. Facet (2010)
Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD ’98, pp. 201–212 (1998)
Cui, H., Sun, R., Li, K., Kan, M.Y., Chua, T.S.: Question answering passage retrieval using dependency relations. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’05, pp. 400–407 (2005)
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the International Conference on Computational Linguistics, COLING ’04, pp. 350–356 (2004)
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: Proceedings of the ACM SIGMOD symposium on Principles of Database Systems, PODS ’01, pp. 102–113 (2001)
Foltz, P.W., Kintsch, W., Landauer, T.K.: The measurement of textual coherence with latent semantic analysis. Discourse Process. 25, 285–307 (1998)
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the International Joint Conference on Artifical Intelligence, IJCAI’07, pp. 1606–1611 (2007)
Goyal, A., Daumé III, H.: Approximate scalable bounded space sketch for large data nlp. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pp. 250–261 (2011)
Goyal, A., Daumé III, H., Venkatasubramanian, S.: Streaming for large scale nlp: language modeling. In: Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, pp. 512–520 (2009)
Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Tech. J. 29(2), 147–160 (1950)
Han, W.S., Lee, J., Moon, Y.S., Jiang, H.: Ranked subsequence matching in time-series databases. In: Proceedings of the International Conference on Very Large Databases, VLDB ’07, pp. 423–434 (2007)
Hatzivassiloglou, V., Klavans, J.L., Eskin, E.: Detecting text similarity over short passages: Exploring linguistic feature combinations via machine learning. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, EMNLP/VLC ’99, pp. 203–212 (1999)
Hellerstein, J.M., Avnur, R., Chou, A., Hidber, C., Olston, C., Raman, V., Roth, T., Haas, P.J.: Interactive data analysis: the control project. IEEE Comput. 32(8), 51–59 (1999)
Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)
Hua, M., Pei, J., Fu, A.W., Lin, X., Leung, H.F.: Top-k typicality queries and efficient query answering methods on large databases. VLDB J. 18(3), 809–835 (2009)
Hua, M., Pei, J., Fu, A.W.C., Lin, X., Leung, H.F.: Efficiently answering top-k typicality queries on large databases. In: Proceedings of the International Conference on Very Large Databases, VLDB ’07, pp. 890–901 (2007)
Islam, A., Inkpen, D.: Second order co-occurrence pmi for determining the semantic similarity of words. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC ’06, pp. 1033–1038 (2006)
Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2(2), 1–25 (2008)
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)
Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. CoRR cmp-lg/9709008 (1997)
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–20 (1972)
Kim, J.W., Kashyap, A., Li, D., Bhamidipati, S.: Efficient wikipedia-based semantic interpreter by exploiting top-k processing. In: Proceedings of the International Conference on Information and Knowledge Management, CIKM ’10, pp. 1813–1816 (2010)
Koren, J., Zhang, Y., Liu, X.: Personalized interactive faceted search. In: Proceedings of the International Conference on World Wide Web, WWW ’08, pp. 477–486 (2008)
Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychol. Rev. 104, 211–240 (1997)
Landauer, T.K., Folt, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2), 259–284 (1998)
Leacock, C., Chodorow, M.: Combining local context and wordnet similarity for word sense identification. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, pp. 305–332. MIT Press (1998)
Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
Li, Y., Bandar, Z.A., McLean, D.: An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. Knowl. Data Eng. 15(4), 871–882 (2003)
Li, Y., McLean, D., Bandar, Z., O’Shea, J., Crockett, K.A.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)
Maguitman, A.G., Menczer, F., Roinestad, H., Vespignani, A.: Algorithmic detection of semantic similarity. In: Proceedings of the International Conference on World Wide Web, WWW ’05, pp. 107–116 (2005)
Maynard, D., Greenwood, M.A.: Large scale semantic annotation, indexing and search at the national archives. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC ’12, pp. 3487–3494 (2012)
Meadow, C.T.: Text Information Retrieval Systems. Academic Press, Inc. (1992)
Metzler, D., Dumais, S.T., Meek, C.: Similarity measures for short segments of text. In: Proceedings of the European Conference on Information Retrieval, ECIR ’07, pp. 16–27 (2007)
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the AAAI Conference on Artificial Intelligence, AAAI’06, pp. 775–780 (2006)
Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP’04, pp. 404–411 (2004)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Navarro, G., Baeza-Yates, R.A.: A practical q -gram index for text retrieval allowing errors. CLEI Electr. J. 1(2) (1998)
Pantel, P., Crestan, E., Borkovsky, A., Popescu, A.M., Vyas, V.: Web-scale distributional similarity and entity set expansion. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP’09, pp. 938–947 (2009)
Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybern. 19(1), 17–30 (1989)
Radinsky, K., Agichtein, E., Gabrilovich, E., Markovitch, S.: A word at a time: computing word relatedness using temporal semantic analysis. In: Proceedings of the International Conference on World Wide Web, WWW ’11, pp. 337–346 (2011)
Radlinski, F., Broder, A., Ciccolo, P., Gabrilovich, E., Josifovski, V., Riedel, L.: Optimizing relevance and revenue in ad search: a query substitution approach. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 403–410 (2008)
Re, C., Dalvi, N.N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: Proceedings of the International Conference on Data Engineering, ICDE’07, pp. 886–895 (2007)
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI’95, pp. 448–453 (1995)
Ryeng, N.H., Vlachou, A., Doulkeridis, C., Nørvåg, K.: Efficient distributed top-k query processing with caching. In: Proceedings of the International Conference on Database Systems for Advanced Applications, DASFAA’11, pp. 280–295 (2011)
Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the International Conference on World Wide Web, WWW ’06 (2006)
Salton, G.: Automatic Text Processing. Addison-Wesley Longman Publishing Co., Inc. (1988)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process Manag. 24(5), 513–523 (1988)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD ’04, pp. 743–754 (2004)
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948)
Tsatsaronis, G., Varlamis, I., Vazirgiannis, M.: Text relatedness based on a word thesaurus. J. Artif. Intell. Res. 37, 1–39 (2010)
Turney, P.: Mining the web for synonyms: Pmi-ir versus lsa on toefl. In: Proceedings of the European Conference on Machine Learning, ECML’01, pp. 491–502 (2001)
Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theo. Comp. Sci. 92(1), 191–211 (1992)
Vernica, R., Li, C.: Efficient top-k algorithms for fuzzy search in string collections. In: Proceedings of the International Workshop on Keyword Search on Structured Data, KEYS’09, pp. 9–14 (2009)
Vlachou, A., Doulkeridis, C., Nørvåg, K., Vazirgiannis, M.: On efficient top-k query processing in highly distributed environments. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 753–764 (2008)
Wang, K., Ming, Z.Y., Hu, X., Chua, T.S.: Segmentation of multi-sentence questions: towards effective question retrieval in cqa services. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’10, pp. 387–394 (2010)
Wei, F., Li, W., Lu, Q., He, Y.: Query-sensitive mutual reinforcement chain and its application in query-oriented multi-document summarization. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 283–290 (2008)
Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the annual meeting on Association for Computational Linguistics, ACL’94, pp. 133–138 (1994)
Yang, Z., Kitsuregawa, M.: Efficient searching top-k semantic similar words. In: Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI’11, pp. 2373–2378 (2011)
Yang, Z., Yu, J., Kitsuregawa, M.: Fast algorithms for top-k approximate string matching. In: Proceedings of the AAAI Conference on Artificial Intelligence, AAAI’10, pp. 1467–1473 (2010)
Zhang, X., Chomicki, J.: Semantics and evaluation of top-k queries in probabilistic databases. Distributed and Parallel Databases 26(1), 67–126 (2009)
Zhuge, H.: The Web Resource Space Model. Springer (2008)
Zhuge, H.: Communities and emerging semantics in semantic link network: discovery and learning. IEEE Trans. Knowl. Data Eng. 21(6), 785–799 (2009)
Zhuge, H.: Interactive semantics. Artif. Intell. 174(2), 190–204 (2010)
Zhuge, H.: Special section: semantic link network. Future Gener. Comput. Syst. 26(3), 359–360 (2010)
Zhuge, H.: Semantic linking through spaces for cyber-physical-socio intelligence: a methodology. Artif. Intell. 175(5–6), 988–1019 (2011)
Zhuge, H.: The Knowledge Grid: Toward Cyber-Physical Society, 2nd edn. World Scientific Pub Co Inc. (2012)
Zhuge, H., Xing, Y.: Probabilistic resource space model for managing resources in cyber-physical society. IEEE Trans. Serv. Comput. 5(3), 404–421 (2012)
Zhuge, H., Xing, Y., Shi, P.: Resource space model, owl and database: mapping and integration. ACM Trans. Internet Technol. 8(4) (2008)
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Gu, Y., Yang, Z., Xu, G. et al. Exploration on efficient similar sentences extraction. World Wide Web 17, 595–626 (2014). https://doi.org/10.1007/s11280-012-0195-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-012-0195-z