Article

Verifying a Chinese collection for text categorization

Authors:

William John TeahanAuthors Info & Claims

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 556 - 557

https://doi.org/10.1145/1008992.1009118

Published: 25 July 2004 Publication History

Get Access

Abstract

This article describes the development of a free test collection for Chinese text categorization. A novel retrieval-based approach was developed to detect duplicates and label inconsistency in this corpus and in Reuters-21578 for comparison. The method was able to detect certain types of similar and/or duplicated documents that were overlooked by an alternative repetition-based method [1]. Experiments showed that effectiveness was not affected by the confusing documents.

References

[1]

Dmitry V. Khmelev and William J. Teahan, "A Repetition Based Measure for Verification of Text Collections and for Text Categorization," ACM SIGIR, 2003, pp.104--110.

Digital Library

Google Scholar

[2]

WebGenie 3.23, http://www.webgenie.com.tw

Google Scholar

[3]

Amit Singhal, Gerard Salton and Chris Buckley, "Length Normalization in Degraded Text Collections" Symp. on Document Analysis and Info. Retr., 1996, pp. 149--162.

Google Scholar

[4]

S. E. Robertson and S. Walker, "Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval," Proc. of ACM SIGIR, 1992, pp.42--49.

Google Scholar

Cited By

View all

Tseng YLin YLee YHung WLee C(2009)A comparison of methods for detecting hot topicsScientometrics10.1007/s11192-009-1885-x81:1(73-90)Online publication date: 18-Mar-2009
https://doi.org/10.1007/s11192-009-1885-x

Index Terms

Verifying a Chinese collection for text categorization
1. Information systems
  1. Information storage systems
    1. Record storage systems

Recommendations

Accurate discovery of co-derivative documents via duplicate text detection

Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting ...
RCV1: A New Benchmark Collection for Text Categorization Research

Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of ...
An Evaluation of Passage-Based Text Categorization

Researches in text categorization have been confined to whole-document-level classification, probably due to lack of full-text test collections. However, full-length documents available today in large quantities pose renewed interests in text ...

Comments

Information & Contributors

Information

Published In

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

July 2004

624 pages

ISBN:1581138814

DOI:10.1145/1008992

General Chair:
Mark Sanderson
University of Sheffield (UK)
,
Program Chairs:
Kalervo Järvelin
University of Tampere (Finland)
,
James Allan
University of Massachusetts (USA)
,
Peter Bruza
Distributed Systems Technology Centre (Australia)

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGIR04

Sponsor:

SIGIR04: The 27th ACM/SIGIR International Symposium on Information Retrieval 2004

July 25 - 29, 2004

Sheffield, United Kingdom

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
382
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Tseng YLin YLee YHung WLee C(2009)A comparison of methods for detecting hot topicsScientometrics10.1007/s11192-009-1885-x81:1(73-90)Online publication date: 18-Mar-2009
https://doi.org/10.1007/s11192-009-1885-x

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Accurate discovery of co-derivative documents via duplicate text detection

RCV1: A New Benchmark Collection for Text Categorization Research

An Evaluation of Passage-Based Text Categorization