Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Ontology-Based Mapping for Automated Document Management: A Concept-Based Technique for Word Mismatch and Ambiguity Problems in Document Clustering

Published: 04 March 2015 Publication History

Abstract

Document clustering is crucial to automated document management, especially for the fast-growing volume of textual documents available digitally. Traditional lexicon-based approaches depend on document content analysis and measure overlap of the feature vectors representing different documents, which cannot effectively address word mismatch or ambiguity problems. Alternative query expansion and local context discovery approaches are developed but suffer from limited efficiency and effectiveness, because the large number of expanded terms create noise and increase the dimensionality and complexity of the overall feature space. Several techniques extend lexicon-based analysis by incorporating latent semantic indexing but produce less comprehensible clustering results and questionable performance. We instead propose a concept-based document representation and clustering (CDRC) technique and empirically examine its effectiveness using 433 articles concerning information systems and technology, randomly selected from a popular digital library. Our evaluation includes two widely used benchmark techniques and shows that CDRC outperforms them. Overall, our results reveal that clustering documents at an ontology-based, concept-based level is more effective than techniques using lexicon-based document features and can generate more comprehensible clustering results.

References

[1]
R. Attar and A. S. Fraenkel. 1997. Local feedback in full-text retrieval systems. Journal of the ACM 24, 3, 397--417.
[2]
K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. 1999. When is “nearest neighbor” meaningful? In Proceedings of the 7th International Conference on Database Theory. 217--235.
[3]
D. Boley, M. Gini, R. Gross, E. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. 1999. Partitioning-based clustering for Web document categorization. Decision Support Systems 27, 3, 329--341.
[4]
E. Brill. 1994. Some advances in rule-based part of speech tagging. In Proceedings of the 12th National Conference on Artificial Intelligence. 722--727.
[5]
Y. Choueka and S. Lusignan. 1985. Disambiguation by short contexts. Computers and the Humanities 19, 3, 147--157.
[6]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. A. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 6, 391--407.
[7]
D. Dinh and L. Tamine. 2011. Combining global and local semantic contexts for improving biomedical information retrieval. In Proceedings of the 33rd European Conference on Advances in Information Retrieval. 375--386.
[8]
C. Dorai and A. K. Jain. 1995. Shape spectra based view grouping for free-form objects. In Proceedings of the International Conference on Image Processing. 240--243.
[9]
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management. 148--155.
[10]
A. El-Hamdouchi and P. Willett. 1986. Hierarchical document clustering using Ward's method. In Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 149--156.
[11]
D. Fensel. 2004. Ontologies: Silver Bullet for Knowledge Management and Electronic Commerce (2nd ed.). Springer-Verlag, Berlin, Heidelberg.
[12]
D. H. Fisher. 1987. Knowledge acquisition via incremental conceptual clustering. Machine Learning 2, 2, 139--172.
[13]
G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum. 1988. Information retrieval using a singular value decomposition model of latent semantic structure. In Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 465--480.
[14]
G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais. 1987. The vocabulary problem in human-system communication. Communications of the ACM 30, 11, 964--971.
[15]
T. R. Gruber. 1993. A translation approach to portable ontology specifications. Knowledge Acquisition 5, 2, 199--220.
[16]
V. P. Guerrero Bote, F. Moya Anegón, and V. Herrero Solana. 2002. Document organization using Kohonen's algorithm. Information Processing and Management 38, 1, 79--89.
[17]
A. Hotho, A. Maedche, and S. Staab. 2001. Ontology-based text clustering. In Proceedings of the IJCAI Workshop on Text Learning: Beyond Supervision.
[18]
A. Jain, M. Murty, and P. Flynn. 1999. Data clustering: A review. ACM Computing Surveys 31, 3, 264--323.
[19]
Y. Jing and W. B. Croft. 1994. An association thesaurus for information retrieval. In Proceedings of the Intelligent Multimedia Information Retrieval Systems and Management Conference. 146--160.
[20]
M. S. Khan and S. Khor. 2004. Enhanced Web document retrieval using automatic query expansion. Journal of the American Society for Information Science and Technology 55, 1, 29--40.
[21]
J. Kohler, S. Philippi, and M. Lange. 2003. SEMEDA: Ontology based semantic integration of biological databases. Bioinformatics 19, 18, 2420--2427.
[22]
J. Kohler, S. Philippi, M. Specht, and A. Ruegg. 2006. Ontology based text indexing and querying for the semantic Web. Knowledge-Based Systems 19, 8, 744--754.
[23]
T. Kohonen. 1995. Self-Organizing Maps. Springer-Verlag, New York, NY.
[24]
R. Kroverz and W. B. Croft. 1992. Lexical ambiguity and information retrieval. ACM Transactions on Information Systems 10, 2, 115--141.
[25]
C. Leacock, G. Towwell, and E. M. Voorhees. 1996. Towards building contextual representations of word senses using statistical models. In Corpus Processing for Lexical Acquisition, B. Boguraev and J. Pustejovsky (Eds.). MIT Press, Cambridge, MA, 97--113.
[26]
Y. H. Lee, C. Wei, and P. Hu. 2011. An ontology-based technique for preserving user preferences in document-category evolutions. Journal of the Association for Information Science and Technology 62, 3, 507--520.
[27]
M. Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th Annual International Conference on Systems Documentation. 24--26.
[28]
D. Lin. 1997. Using syntactic dependency as local context to resolve word sense ambiguity. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics. 64--71.
[29]
S. Y. Lu and K. S. Fu. 1978. A sentence-to-sentence clustering procedure for pattern analysis. IEEE Transactions on Systems, Man, and Cybernetics 8, 5, 381--389.
[30]
A. Maedche. 2002. Ontology Learning for the Semantic Web. Kluwer Academic, Norwell, MA.
[31]
E. Morin. 1999. Automatic acquisition of semantic relations between terms from technical corpora. In Proceedings of the 5th International Congress on Terminology and Knowledge Engineering.
[32]
M. N. Murty and G. Krishna. 1980. A computationally efficient technique for data clustering. Pattern Recognition 12, 3, 153--158.
[33]
H. T. Ng and H. B. Lee. 1996. Integrating multiple knowledge sources to disambiguate word sense: An exemplar-based approach. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics. 40--47.
[34]
P. Pantel and D. Lin. 2002. Document clustering with committees. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 199--206.
[35]
A. G. Perez and V. R. Benjamins. 1999. Overview of knowledge sharing and reuse components: Ontologies and problem-solving methods. In Proceedings of the IJCAI-99 Workshop on Ontologies and Problem-Solving Methods. 1--15.
[36]
S. C. Punitha and M. Punithavalli. 2012. Performance evaluation of semantic based and ontology based text document clustering techniques. Procedia Engineering 30, 2012, 100--106.
[37]
D. R. Recupero. 2007. A new unsupervised method for document clustering by using WordNet lexical and conceptual relations. Information Retrieval 10, 6, 563--579.
[38]
J. Rocchio. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton (Ed.), Prentice Hall, Englewood Cliffs, NJ, 313--323.
[39]
D. Roussinov and H. Chen. 1999. Document clustering for electronic meetings: An experimental comparison of two techniques. Decision Support Systems 27, 1, 67--79.
[40]
G. Salton and C. Buckley. 1990. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science 41, 4, 288--297.
[41]
G. Salton and M. J. McGill. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, Columbus, OH.
[42]
M. Sanderson. 1994. Word sense disambiguation and information retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 142--151.
[43]
H. Schutze and C. Silverstein. 1997. Projections for efficient document clustering. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 74--81.
[44]
S. Spangler, J. T. Kreulen, and J. Lessler. 2003. Generating and browsing multiple taxonomies over a document collection. Journal of Management Information Systems 19, 4, 191--212.
[45]
K. Sparck Jones. 1971. Automatic Keyword Classification for Information Retrieval. Butterworths, London, England.
[46]
M. Steinbach, G. Karypis, and V. Kumar. 2000. A Comparison of Document Clustering Techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota, MN.
[47]
S. Szpakowicz. 1990. Semi-automatic acquisition of conceptual structure from technical texts. International Journal of Man-Machine Studies 33, 4, 385--397.
[48]
M. Thangamani and P. Thangaraj. 2010. Integrated clustering and feature selection scheme for text documents. Journal of Computational Science 6, 5, 536--541.
[49]
C. J. Van Rijsbergen. 1979. Information Retrieval (2nd ed.). Butterworths, London, England.
[50]
E. M. Voorhees. 1993. Using WordNet to disambiguate word sense for text retrieval. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 171--180.
[51]
A. Voutilainen. 1993. NPtool: A detector of English noun phrases. In Proceedings of the Workshop on Very Large Corpora. 48--57.
[52]
H. Wache, T. Vögele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, and S. Hübner. 2001. Ontology-based integration of information—a survey of existing approaches. In Proceedings of the 17th International Joint Conference on Artificial Intelligence. 108--117.
[53]
C. Wei, P. Hu, and Y. H. Lee. 2009. Preserving user preferences in automated document-category management: An evolution-based approach. Journal of Management Information Systems 25, 4, 109--143.
[54]
C. Wei, P. Hu, C. H. Tai, C. N. Huang, and C. S. Yang. 2008. Managing word mismatch problems in information retrieval: A topic-based query expansion approach. Journal of Management Information Systems 24, 3, 269--295.
[55]
C. Wei, Y. H. Lee, and H. W. Hsiao. 2005. A text mining approach for ontology enrichment. In Proceedings of the 4th Workshop on e-Business. 483--489.
[56]
C. Wei, C. S. Yang, H. W. Hsiao, and T. H. Cheng. 2006. Combining preference- and content-based approaches for improving document clustering effectiveness. Information Processing and Management 42, 2, 350--372.
[57]
J. Xu and W. B. Croft. 1996. Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 4--11.
[58]
J. Xu and W. B. Croft. 2000. Improving the effectiveness of information retrieval with local context analysis. ACM Transactions on Information Systems 18, 1, 79--112.
[59]
T. Yamaguchi. 2001. Acquiring conceptual relationships from domain-specific texts. In Proceedings of the 2nd Workshop on Ontology Learning.
[60]
Y. Yang and J. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proceedings of the International Conference on Machine Learning. 412--420.
[61]
D. Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of 33rd Annual Meeting of the Association for Computational Linguistics. 189--196.
[62]
X. Zhang, L. Jing, X. Hu, M. Ng, J. Xia, and X. Zhou. 2008. Medical document clustering using ontology-based term similarity measures. International Journal of Data Warehousing and Mining 4, 1, 62--73.
[63]
L. Zhao and J. Callan. 2010. Term necessity prediction. In Proceedings of the 19th ACM Conference on Information and Knowledge Management. 259--268.

Cited By

View all
  • (2024)Review of ambiguity problem in text summarization using hybrid ACA and SLRIntelligent Systems with Applications10.1016/j.iswa.2024.20036022(200360)Online publication date: Jun-2024
  • (2023)An Efficient Document Clustering Approach for Devising Semantic ClustersCybernetics and Systems10.1080/01969722.2023.2175135(1-18)Online publication date: 11-Feb-2023
  • (2021)Automating Research Data Management Using Machine-Actionable Data Management PlansACM Transactions on Management Information Systems10.1145/349039613:2(1-22)Online publication date: 11-Dec-2021
  • Show More Cited By

Index Terms

  1. Ontology-Based Mapping for Automated Document Management: A Concept-Based Technique for Word Mismatch and Ambiguity Problems in Document Clustering

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Management Information Systems
    ACM Transactions on Management Information Systems  Volume 6, Issue 1
    April 2015
    111 pages
    ISSN:2158-656X
    EISSN:2158-6578
    DOI:10.1145/2742819
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 March 2015
    Accepted: 01 November 2014
    Revised: 01 November 2014
    Received: 01 March 2014
    Published in TMIS Volume 6, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Document-category management
    2. document clustering
    3. knowledge management
    4. ontology-supported document clustering

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 02 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Review of ambiguity problem in text summarization using hybrid ACA and SLRIntelligent Systems with Applications10.1016/j.iswa.2024.20036022(200360)Online publication date: Jun-2024
    • (2023)An Efficient Document Clustering Approach for Devising Semantic ClustersCybernetics and Systems10.1080/01969722.2023.2175135(1-18)Online publication date: 11-Feb-2023
    • (2021)Automating Research Data Management Using Machine-Actionable Data Management PlansACM Transactions on Management Information Systems10.1145/349039613:2(1-22)Online publication date: 11-Dec-2021
    • (2020)Design of an Inclusive Financial Privacy Index (INF-PIE): A Financial Privacy and Digital Financial Inclusion PerspectiveACM Transactions on Management Information Systems10.1145/340394912:1(1-21)Online publication date: 22-Dec-2020
    • (2016)Predicting Customer Satisfaction in Customer Support Conversations in Social Media Using Affective FeaturesProceedings of the 2016 Conference on User Modeling Adaptation and Personalization10.1145/2930238.2930285(115-119)Online publication date: 13-Jul-2016

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media