Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Automatic generation of English/Chinese thesaurus based on a parallel corpus in laws

Published: 01 May 2003 Publication History

Abstract

The Information available in languages other than English in the World Wide Web is increasing significantly. According to a report from Computer Economics in 1999, 54% of Internet users are English speakers ("English Will Dominate Web for Only Three More Years," Computer Economics , July 9, 1999, http://www.computereconomics. com/new4/pr/pr990610.html). However, it is predicted that there will be only 60% increase in Internet users among English speakers verses a 150% growth among non-English speakers for the next five years. By 2005, 57% of Internet users will be non-English speakers. A report by CNN.com in 2000 showed that the number of Internet users in China had been increased from 8.9 million to 16.9 million from January to June in 2000 ("Report: China Internet users double to 17 million," CNN.com, July, 2000, http://cnn.org/2000/TECH/computing/07/27/ china.internet.reut/index.html). According to Nielsen/ NetRatings, there was a dramatic leap from 22.5 millions to 56.6 millions Internet users from 2001 to 2002. China had become the second largest global at-home Internet population in 2002 (US's Internet population was 166 millions) (Robyn Greenspan, "China Pulls Ahead of Japan," Intemet.com, April 22, 2002, http://cyberatlas.internet. com/big_picture/geographics/article/0,5911_1013841,00. html). All of the evidences reveal the importance of crosslingual research to satisfy the needs in the near future.Digital library research has been focusing in structural and semantic interoperability in the past. Searching and retrieving objects across variations in protocols, formats and disciplines are widely explored (Schatz, B., & Chen, H. (1999). Digital libraries: technological advances and social impacts. IEEE Computer, Special Issue on Digital Libraries, February, 32(2), 45-50.; Chen, H., Yen, J., & Yang, C.C. (1999). International activities: development of Asian digital libraries. IEEE Computer, Special Issue on Digital Libraries, 32(2), 48-49.). However, research in crossing language boundaries, especially across European languages and Oriental languages, is still in the initial stage. In this proposal, we put our focus on cross-lingual semantic interoperability by developing automatic generation of a cross-lingual thesaurus based on English/Chinese parallel corpus. When the searchers encounter retrieval problems, professional librarians usually consult the thesaurus to identify other relevant vocabularies. In the problem of searching across language boundaries, a cross-lingual thesaurus, which is generated by co-occurrence analysis and Hopfield network, can be used to generate additional semantically relevant terms that cannot be obtained from dictionary. In particular, the automatically generated cross-lingual thesaurus is able to capture the unknown words that do not exist in a dictionary, such as names of persons, organizations, and events. Due to Hong Kong's unique history background, both English and Chinese are used as official languagesl in all legal documents. Therefore, English/Chinese cross-lingual information retrieval is critical for applications in courts and the government. In this paper, we develop an automatic thesaurus by the Hopfield network based on a parallel corpus collected from the Web site of the Department of Justice of the Hong Kong Special Administrative Region (HKSAR) Government. Experirnents are conducted to measure the precision and recall of the automatic generated English/Chinese thesaurus. The result shows that such thesaurus is a promising tool to retrieve relevant terms, especially in the language that is not the same as the input term. The direct translation of the input term can also be retrieved in most of the cases.

References

[1]
English Will Dominate Web for Only Three More Years. Computer Economics, July 9, 1999, PRIVATE HREF="http://www.computereconomics. com/new4/pr/pr990610.html" MACROBUTTON HtmlResAnchor http:// www.computereconomics.com/new4/pr/pr990610.html]]
[2]
"Report: China Internet users double to 17 million," CNN.com, July, 2000, PRIVATE HREF="http://cnn.org/2000/TECH/computing/07127/china. internet.reut/index.html" MACROBUTTON HtmlResAnchor http://cnn. org/2000/TECH/computing/07/27/china.internet.reut/index.html]]
[3]
Greenspan R. (2002). China pulls ahead of Japan. Internet.com. April 22, 2002, http://cyberatlas.internet.com/big_picture/geographics/article/ 0,5911_1013841,00.html]]
[4]
Ballesteros L. & Crosft, B. (1997). Phrasal translation and query expansion techniques for cross-language information retrieval. Proceedings of the ACM SIGIR, 1997, p. 84-91.]]
[5]
Chen, H. (1998). Artificial intelligence techniques for emerging information systems applications: trailblazing path to semantic interoperahility, Journal of the American Society for Information Science, 49(7), 579-581.]]
[6]
Chien, L. F. (1997). PAT-tree-based keyword extraction for Chinese information retrieval. Proceedings of ACM SIGIR, (pp. 50-58). Philadelphia, PA, 1997.]]
[7]
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37-46.]]
[8]
Dai, Y., Khoo, C., & Loh, T. (1999). A new statistical formula for Chinese text segmentation incorporating contextual information. Proceedings of the ACM SIGIR (pp. 82-89). Berkeley, CA. August, 1999.]]
[9]
Davis, M., & Dunning, T. (1995). A TREC evaluation of query translation methods for multi-lingual text retrieval. Proceeding of the Fourth Text retrieval Conference (TREC-4), (pp. 175-185). NIST, November, 1995.]]
[10]
Davis, M., & Dunning, T. (1995). Query translation using evolutionary programming for multi-lingual informaiton retrieval. Proceedings of the Fourth Annual Conference on Evolutionary Programming, San Diego, CA.]]
[11]
Dumais, S.T., Letsche, T.A., Littman, M.L., & Landauer, T.K. (1997). Automatic cross-language retrieval using latent semantic indexing. Proceedings of AAAI Symposium on Cross-Language Text and Speech Retrieval (pp. 15-21). March, 1997.]]
[12]
Fan, C.K., & Tsai, W.H. (1988). Automatic word identification in Chinese sentences by the relaxation technique. Computer Processing of Chinese & Oriental Languages, 4(1), 33-56.]]
[13]
Fluhr, C. (1995). Multilingual information retrieval: survey of the state of the art in human language technology. Center for Spoken Language Understanding, Oregon Graduate Institute, PRIVATE HREF-"http:// www.cse.ogi.edu/CSLU/HLTsurvey/" MACROBUTTON HtmlResAnchor http://www.cse.ogi.edu/CSLU/HLTsurvey/ch8node7.html.]]
[14]
Fluhr, C., & Radwan, K. (1993). Fulltext database as lexical semantic knowledge for multilingual interrogation and machine translation. Proceedings of the East-West Conference on Artificial Intelligence (pp. 124-128). Moscow, September, 1993.]]
[15]
Gan, K.W., Palmer, M., & Lua, K.T. (1996). A statistically emergent approach for language processing: application to modeling context effects in ambiguous Chinese word boundary perception. Computational Linguistics, p. 531-553.]]
[16]
Haykin, S. (1994). Neural networks: a comprehensive foundation. IEEE Press, New York.]]
[17]
He, S. (2000). Translingual alteration of conceptual information in medical translation: a cross-language analysis between English and Chinese. Journal of the American Society for Information Science, 51 (11), 1047-1060.]]
[18]
Hull, D.A., & Grefenstette, G. 1996. Querying across languages: a dictionary-based approach to multilingual information retrieval. Proceedings of the ACM SIGIR, (pp. 49-57).]]
[19]
Ikehara, S., Shirai, T.S. & Kawaoka, T. (1995). Automatic extraction of uninterrupted and interrupted collocations from very large Japanese corpora using N-gram statistics. Transactions of the Information Processing Society of Japan, 36(11), November, 1995, 2584-2596.]]
[20]
Klavans, J., Hovy, E., Fluhr, C., Frederking, R.E., Oard, D., Okumura, A., Ishikawa, K., and Satoh, K. (1999). Multilingual (or cross-lingual information retrieval). Multilingual information management: current levels and future abilities. Pisa, Italy.]]
[21]
Landauer, T.K., & Littman, M.L. (1990). Fully automatic cross-language document retrieval using latent semantic indexing. Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, (pp. 31-38). Waterloo Ontario, October, 1990.]]
[22]
Leonardi, V. (2000). Equivalence in translation: between myth and reality. Translation Journal, 4(4).]]
[23]
Lin, C., & Chen, H. (1996). An automatic indexing and neural network approach to concept retrieval and classification of multilingual (Chinese-English) documents. IEEE Transactions on Systems, Man, and Cybernetics, 16(1), 1-14.]]
[24]
Lua, K.T. (1990). From character to word--an application of information theory. Computer Processing of Chinese & Oriental Languages, 4(4), 304-313.]]
[25]
Leung, C.H., & Kan, W.K. (1996). A statistical learning approach to improving die accuracy of Chinese word segmentation. Literary and Linguistic Computing, (11), 87-92.]]
[26]
Nie, J.Y., Jin, W., & Hannaan, M.L. (1994). A hybrid approach to unknown word detection and segmentation of Chinese. Proceedings of International Conference on Chinese Computing, (pp. 326-335). Singapore, 1994.]]
[27]
Nie, J.Y., Simard, M., Isabelle, P., & Durand, R. (1999). Cross-language information retrieval based on parallel texts and automatic mining of parallel text from the web. Proceedings of the ACM SIGIR (pp. 74-81). Berkeley, CA, 1999.]]
[28]
Oard, D.W., & Dorr, B.J. (1996). A survey of multilingual text retrieval. UMIACS-TR96-19 C-TR-3815.]]
[29]
Oard, D.W. (1997). Alternative approaches for cross-language text retrieval. Proceedings of the 1997 AAAI Symposium in Cross-Language Text and Speech Retrieval (pp. 154-162). March, 1997.]]
[30]
Resnik, P. (1998). Parallel STRANDS: a preliminary investigation into mining the web for bilingual text. Proceedings of the Third Conference of the Association for Machine Translation in the America: Machine Translation and the Information Soup, (pp. 72-82). Langhorne, PA, October, 1998.]]
[31]
Resnik, P. (1999). Mining the web for bilingual text. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, (PP. 527-534), College Park, Maryland, June, 1999.]]
[32]
Rose, M.G. (1981). Translation types and conventions. Translation spectrum: essays in theory and practice. State University of New York Press, p. 31-33, Albany, New York.]]
[33]
Radwan, K., & Fluhr, C. (1995). Textual database lexicon used as a filter to resolve semantic ambiguity application on multilingual information retrieval. Proceedings of Fourth Annual Symposium on Document Analysis and Information Retrieval (pp. 121-136). April, 1995.]]
[34]
Salton, G. (1970). Automatic processing of foreign language documents. Journal of the American Society for Information Science, 21(3), 187-194.]]
[35]
Schatz, B., & Chen, H. (1999). Digital libraries: technological advances and social impacts. IEEE Computer, Special Issue on Digital Libraries, 32(2), 45-50.]]
[36]
Sproat, R., & Shih, C. (1990). A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, 4, 336-351.]]
[37]
Wu, Z. & Tseng, G. (1995). ACTS: An automatic chinese text segmentation system for full text retrieval. Journal of the American Society for Information Science, 46, 83-96.]]
[38]
Chen, H., Yen, J., & Yang, C.C. 1999. International activities: development of Asian digital libraries. IEEE Computer, Special Issue on Digital Libraries, 32(2), 48-49.]]
[39]
Yang, C.C., Luk, J., & Yung, S. (2000a). Combination and boundary detection approach for chinese indexing. Journal of the American Society for Information Science, Special Issue on Digital Libraries, 51(4), 340-351.]]
[40]
Zanettin, F. (1998). Bilingual comparable corpora and the training of translators. META, Special Issue on The Corpus-Based Approach: A New Paradigm in Translation Studies, 43(4), 616-630.]]

Cited By

View all
  • (2007)Feature reinforcement approach to poly-lingual text categorizationProceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers10.5555/1780653.1780676(99-108)Online publication date: 10-Dec-2007
  • (2007)Feature Reinforcement Approach to Poly-lingual Text CategorizationAsian Digital Libraries. Looking Back 10 Years and Forging New Frontiers10.1007/978-3-540-77094-7_17(99-108)Online publication date: 10-Dec-2007
  • (2006)Automatic thesaurus development: Term extraction from title metadataJournal of the American Society for Information Science and Technology10.5555/1133031.113303857:7(907-920)Online publication date: 1-May-2006
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of the American Society for Information Science and Technology
Journal of the American Society for Information Science and Technology  Volume 54, Issue 7
May 2003
105 pages

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 01 May 2003

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2007)Feature reinforcement approach to poly-lingual text categorizationProceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers10.5555/1780653.1780676(99-108)Online publication date: 10-Dec-2007
  • (2007)Feature Reinforcement Approach to Poly-lingual Text CategorizationAsian Digital Libraries. Looking Back 10 Years and Forging New Frontiers10.1007/978-3-540-77094-7_17(99-108)Online publication date: 10-Dec-2007
  • (2006)Automatic thesaurus development: Term extraction from title metadataJournal of the American Society for Information Science and Technology10.5555/1133031.113303857:7(907-920)Online publication date: 1-May-2006
  • (2006)Exploiting the Web as the multilingual corpus for unknown query translationJournal of the American Society for Information Science and Technology10.5555/1124136.112414557:5(660-670)Online publication date: 1-Mar-2006
  • (2006)Accommodating Individual Preferences in the Categorization of DocumentsJournal of Management Information Systems10.2753/MIS0742-122223020823:2(173-201)Online publication date: 1-Oct-2006
  • (2005)A heuristic method based on a statistical approach for Chinese text segmentationJournal of the American Society for Information Science and Technology10.1002/asi.2023756:13(1438-1447)Online publication date: 1-Nov-2005

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media