Aligning Needles in a Haystack: Paraphrase Acquisition Across the Web

Paşca, Marius; Dienes, Péter

doi:10.1007/11562214_11

Marius Paşca²² &
Péter Dienes²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3651))

Included in the following conference series:

International Conference on Natural Language Processing

1631 Accesses
18 Citations

Abstract

This paper presents a lightweight method for unsupervised extraction of paraphrases from arbitrary textual Web documents. The method differs from previous approaches to paraphrase acquisition in that 1) it removes the assumptions on the quality of the input data, by using inherently noisy, unreliable Web documents rather than clean, trustworthy, properly formatted documents; and 2) it does not require any explicit clue indicating which documents are likely to encode parallel paraphrases, as they report on the same events or describe the same stories. Large sets of paraphrases are collected through exhaustive pairwise alignment of small needles, i.e., sentence fragments, across a haystack of Web document sentences. The paper describes experiments on a set of about one billion Web documents, and evaluates the extracted paraphrases in a natural-language Web search application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Automated Paraphrase Generation with Over-Generation and Pruning Services

Similarity Measures Based on Latent Dirichlet Allocation

Knowledge-lean Paraphrase Identification Using Character-Based Features

References

Hirao, T., Fukusima, T., Okumura, M., Nobata, C., Nanba, H.: Corpus and evaluation measures for multiple document summarization with multiple sources. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING-2004), Geneva, Switzerland, pp. 535–541 (2004)
Google Scholar
Shinyama, Y., Sekine, S.: Paraphrase acquisition for information extraction. In: Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics (ACL-2003), 2nd Workshop on Paraphrasing: Paraphrase Acquisition and Applications, Sapporo, Japan, pp. 65–71 (2003)
Google Scholar
Paşca, M.: Open-Domain Question Answering from Large Text Collections. CSLI Studies in Computational Linguistics. CSLI Publications, Distributed by the University of Chicago Press, Stanford (2003)
Google Scholar
Mitra, M., Singhal, A., Buckley, C.: Improving automatic query expansion. In: Proceedings of the 21st ACM Conference on Research and Development in Information Retrieval (SIGIR-1998), Melbourne, Australia, pp. 206–214 (1998)
Google Scholar
Schutze, H., Pedersen, J.: Information retrieval based on word senses. In: Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1995)
Google Scholar
Zukerman, I., Raskutti, B.: Lexical query paraphrasing for document retrieval. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING-2002), Taipei, Taiwan, pp. 1177–1183 (2002)
Google Scholar
Barzilay, R., Lee, L.: Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In: Proceedings of the 2003 Human Language Technology Conference (HLT-NAACL-2003), Edmonton, Canada, pp. 16–23 (2003)
Google Scholar
Jacquemin, C., Klavans, J., Tzoukermann, E.: Expansion of multi-word terms for indexing and retrieval using morphology and syntax. In: Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics (ACL-1997), Madrid, Spain, pp. 24–31 (1997)
Google Scholar
Glickman, O., Dagan, I.: Acquiring Lexical Paraphrases from a Single Corpus. In: Recent Advances in Natural Language Processing III, pp. 81–90. John Benjamins Publishing, Amsterdam (2004)
Google Scholar
Duclaye, F., Yvon, F., Collin, O.: Using the Web as a linguistic resource for learning reformulations automatically. In: Proceedings of the 3rd Conference on Language Resources and Evaluation (LREC-2002), Las Palmas, Spain, pp. 390–396 (2002)
Google Scholar
Shinyama, Y., Sekine, S., Sudo, K., Grishman, R.: Automatic paraphrase acquisition from news articles. In: Proceedings of the Human Language Technology Conference (HLT-2002), San Diego, California, pp. 40–46 (2002)
Google Scholar
Dolan, W., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING-2004), Geneva, Switzerland, pp. 350–356 (2004)
Google Scholar
Barzilay, R., McKeown, K.: Extracting paraphrases from a parallel corpus. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL-2001), Toulouse, France, pp. 50–57 (2001)
Google Scholar
Brants, T.: TnT - a statistical part of speech tagger. In: Proceedings of the 6th Conference on Applied Natural Language Processing (ANLP-2000), Seattle, Washington, pp. 224–231 (2000)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSID-2004), San Francisco, California, pp. 137–150 (2004)
Google Scholar
Voorhees, E., Tice, D.: Building a question-answering test collection. In: Proceedings of the 23rd International Conference on Research and Development in Information Retrieval (SIGIR-2000), Athens, Greece, pp. 200–207 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Google Inc., 1600 Amphitheatre Parkway, Mountain View, California, 94043, USA
Marius Paşca & Péter Dienes

Authors

Marius Paşca
View author publications
You can also search for this author in PubMed Google Scholar
Péter Dienes
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Language Technology, Macquarie University, 2019, Sydney, NSW, Australia
Robert Dale
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Kam-Fai Wong
Institute for Infocomm Research, 21, Heng Mui Keng Terrace, 119613, Singapore
Jian Su
Language Information Sciences Research Centre, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
Oi Yee Kwong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paşca, M., Dienes, P. (2005). Aligning Needles in a Haystack: Paraphrase Acquisition Across the Web. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_11

Download citation

DOI: https://doi.org/10.1007/11562214_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29172-5
Online ISBN: 978-3-540-31724-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Aligning Needles in a Haystack: Paraphrase Acquisition Across the Web

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Automated Paraphrase Generation with Over-Generation and Pruning Services

Similarity Measures Based on Latent Dirichlet Allocation

Knowledge-lean Paraphrase Identification Using Character-Based Features

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Aligning Needles in a Haystack: Paraphrase Acquisition Across the Web

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Automated Paraphrase Generation with Over-Generation and Pruning Services

Similarity Measures Based on Latent Dirichlet Allocation

Knowledge-lean Paraphrase Identification Using Character-Based Features

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation