Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/976440.976464dlproceedingsArticle/Chapter ViewAbstractPublication Pagesaus-cswConference Proceedingsconference-collections
Article
Free access

Discovering parallel text from the World Wide Web

Published: 01 January 2004 Publication History
  • Get Citation Alerts
  • Abstract

    Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including cross-lingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents from the World Wide Web. The system crawls the Web to fetch potentially parallel multilingual Web documents using a Web spider. To determine the parallelism between potential document pairs, two modules are developed. First, a filename comparison module is used to check filename resemblance. Second, a content analysis module is used to measure the semantic similarity. The experiment conducted to a multilingual Web site shows the effectiveness of the system.

    References

    [1]
    Littman, M. L., Dumais, S., and Landauer, T. K. (1998) Automatic cross language information retrieval using latent semantic indexing. In Grefenstette, G. (ed.) Cross-Language Information Retrieval. Chapter 5. Kluwer Academic Publishers, Boston.
    [2]
    Carbonell, J. G., Yang, Y., Frederking, R. E., Brown, R. D., Geng, Y. and Lee, D (1997) Translingual information retrieval: a comparative evaluation. In Pollack, M. E. (ed.) IJCAI-97 Proceedings of the 15th International Joint Conference on Artificial Intelligence, pp. 708--714.
    [3]
    Chau, R. and Yeh, C-H. (2001) Construction of a fuzzy multilingual thesaurus and its application to cross-lingual text retrieval. In N. Zhong, Y. Yao, Liu, S. Ohsuga (Eds.): Web Intelligence: Research and Development, First Asia-Pacific Conference, WI 2001, Maebashi City, Japan, October 23--26, 2001, Proceedings. Lecture Notes in Artificial Intelligence. Springer-Verlag. Germany. pp. 340--345.
    [4]
    Resnik, P., Smith, N. A. (2002) The Web as a Parallel Corpus. Technical Report UMIAC-TR-2002-61, MD: University of Maryland.
    [5]
    Stepforth. Glossary. {Online} Available URL: http://www.stepforth.com/faq/glossary.htm
    [6]
    Chen, J., Nie, J.-Y. (2000) Parallel Web Text Mining for Cross-Language IR. In Proceedings of RIAO-2000: "Content-Based Multimedia Information Access", Paris.
    [7]
    Nie, J. and Ren, F. (1999) Chinese information retrieval: characters or words? Information Processing and Management, 35, 443--462.
    [8]
    Salton, G. (1989) Automatic Text Processing: The Transformation, analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading. MA.
    [9]
    Willert, P. (1988) Recent trends in hierarchic document clustering: A critical review. Information Processing and Management, 24(5), pp.577--597.

    Cited By

    View all
    • (2015)Domain adaptation of statistical machine translation with domain-focused web crawlingLanguage Resources and Evaluation10.1007/s10579-014-9282-349:1(147-193)Online publication date: 1-Mar-2015
    • (2012)Translation techniques in cross-language information retrievalACM Computing Surveys10.1145/2379776.237977745:1(1-44)Online publication date: 7-Dec-2012
    • (2009)Train the machine with what it can learnProceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora10.5555/1690339.1690348(27-33)Online publication date: 6-Aug-2009
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image DL Hosted proceedings
    ACSW Frontiers '04: Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation - Volume 32
    January 2004
    192 pages

    Publisher

    Australian Computer Society, Inc.

    Australia

    Publication History

    Published: 01 January 2004

    Qualifiers

    • Article

    Conference

    ACSW Frontiers '04

    Acceptance Rates

    Overall Acceptance Rate 204 of 424 submissions, 48%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)3

    Other Metrics

    Citations

    Cited By

    View all
    • (2015)Domain adaptation of statistical machine translation with domain-focused web crawlingLanguage Resources and Evaluation10.1007/s10579-014-9282-349:1(147-193)Online publication date: 1-Mar-2015
    • (2012)Translation techniques in cross-language information retrievalACM Computing Surveys10.1145/2379776.237977745:1(1-44)Online publication date: 7-Dec-2012
    • (2009)Train the machine with what it can learnProceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora10.5555/1690339.1690348(27-33)Online publication date: 6-Aug-2009
    • (2008)Improved sentence alignment on parallel web pages using a stochastic tree alignment modelProceedings of the Conference on Empirical Methods in Natural Language Processing10.5555/1613715.1613778(505-513)Online publication date: 25-Oct-2008
    • (2008)Association-based dynamic computation of reputation in web servicesInternational Journal of Web and Grid Services10.1504/IJWGS.2008.0188864:2(169-188)Online publication date: 1-Jun-2008
    • (2007)An Intelligent Web Agent to Mine Bilingual Parallel Pages via Automatic Discovery of URL Pairing PatternsProceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops10.5555/1339264.1339635(526-529)Online publication date: 2-Nov-2007
    • (2006)A DOM tree alignment model for mining parallel data from the webProceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics10.3115/1220175.1220237(489-496)Online publication date: 17-Jul-2006
    • (2006)Automatic acquisition of chinese–english parallel corpus from the webProceedings of the 28th European conference on Advances in Information Retrieval10.1007/11735106_37(420-431)Online publication date: 10-Apr-2006

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media