Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Accommodating Individual Preferences in the Categorization of Documents: A Personalized Clustering Approach

Published: 01 October 2006 Publication History
  • Get Citation Alerts
  • Abstract

    As electronic commerce and knowledge economy environments proliferate, both individuals and organizations increasingly generate and consume large amounts of online information, typically available as textual documents. To manage this ever-increasing volume of documents, individuals and organizations frequently organize their documents into categories that facilitate document management and subsequent access and browsing. Document clustering is an intentional act that should reflect individual preferences with regard to the semantic coherency and relevant categorization of documents. Hence, effective document clustering must consider individual preferences and needs to support personalization in document categorization. In this paper, we present an automatic document-clustering approach that incorporates an individual's partial clustering as preferential information. Combining two document representation methods, feature refinement and feature weighting, with two clustering methods, precluster-based hierarchical agglomerative clustering (HAC) and atomic-based HAC, we establish four personalized document-clustering techniques. Using a traditional content-based document-clustering technique as a performance benchmark, we find that the proposed personalized document-clustering techniques improve clustering effectiveness, as measured by cluster precision and cluster recall.

    References

    [1]
    <!{CDATA{Anderberg, M.R. Cluster Analysis for Applications. New York: Academic Press, 1973.}}>
    [2]
    <!{CDATA{Barreau, D.K. Context as a factor in personal information management systems. Journal of the American Society for Information Science46, 5 (June 1991), 327-339.}}>
    [3]
    <!{CDATA{Boley, D.; Gini, M.; Gross, R.; Han, E.; Hastings, K.; Karypis, G.; Kumar, V.; Mobasher, B.; and Moore, J. Partitioning-based clustering for Web document categorization. Decision Support Systems, 27, 3 (1999), 329-341.}}>
    [4]
    <!{CDATA{Brill, E. A simple rule-based part of speech tagger. In M. Bates and O. Stock (eds.)Proceedings of the Third Conference on Applied Natural Language Processing. East Stroudsburg, PA: Association for Computational Linguistics, 1992, pp. 152-155.}}>
    [5]
    <!{CDATA{Brill, E. Some advances in rule-based part of speech tagging. In B. Hayes-Roth and R.E. Kork (eds.)Proceedings of the Twelfth National Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press, 1994, pp. 722-727.}}>
    [6]
    <!{CDATA{Case, D.O. Conceptual organization and retrieval of text by historians: The role of memory and metaphor. Journal of the American Society for Information Science42, 9 (October 1991), 657-668.}}>
    [7]
    <!{CDATA{Cutting, D.; Karger, D.; Pedersen, J.; and Tukey, J. Scatter/gather: A cluster-based approach to browsing large document collections. In N. Belkin, P. Ingwersen, and A.M. Pejtersen (eds.)Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1992, pp. 318-329.}}>
    [8]
    <!{CDATA{Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science41, 6 (1990), 391-407.}}>
    [9]
    <!{CDATA{Deogun, J., and Raghavan, V. User-oriented document clustering: A framework for learning in information retrieval. In F. Rabitti (ed.)Proceedings of the Ninth International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1986, pp. 157-163.}}>
    [10]
    <!{CDATA{Donovan, J. Patrons' expectations about collocation: Measuring the difference between psychologically real and the really real. Cataloging and Classification Quarterly13, 2 (1991), 23-43.}}>
    [11]
    <!{CDATA{Dunlop, M.D. The effect of accessing nonmatching documents on relevance feedback. ACM Transactions on Information Systems15, 2 (April 1997), 137-153.}}>
    [12]
    <!{CDATA{El-Hamdouchi, A., and Willett, P. Hierarchical document clustering using Ward's method. In F. Rabitti (ed.)Proceedings of the ACM Conference on Research and Development in Information Retrieval. New York: ACM Press, 1986, pp. 149-156.}}>
    [13]
    <!{CDATA{Gordon, M. User-based document clustering by redescribing subject description with a genetic algorithm. Journal of the American Society for Information Science42, 5 (1991), 311-322.}}>
    [14]
    <!{CDATA{Haines, D., and Croft, W.B. Relevance feedback and inference networks. In R. Korfhage, E. Rasmussen, and P. Willett (eds.)Proceedings of the Sixteenth International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1993, pp. 2-11.}}>
    [15]
    <!{CDATA{Johnson, E.J.; Bellman, S.; and Lohse, G.L. Cognitive lock-in and the power law of practice. Journal of Marketing67, 2 (April 2003), 62-75.}}>
    [16]
    <!{CDATA{Kaufman, L., and Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons, 1990.}}>>
    [17]
    <!{CDATA{Kim, H., and Lee, S. An effective document clustering method using user-adaptable distance metrics. In B. Panda (ed.)Proceedings of the 2002 ACM Symposium on Applied Computing. New York: ACM Press, 2002, pp. 16-20.}}>
    [18]
    <!{CDATA{Kim, H., and Lee, S. A semi-supervised document clustering technique for information organization. In A. Agah, J. Callan, E. Rundensteiner, and S. Gauch (eds.)Proceedings of the Ninth International Conference on Information and Knowledge Management. New York: ACM Press, 2000, pp. 30-37.}}>
    [19]
    <!{CDATA{Kohonen, T. Self-Organization and Associative Memory. Berlin: Springer, 1989.}}>
    [20]
    <!{CDATA{Kohonen, T. Self-Organizing Maps. Berlin: Springer, 1995.}}>
    [21]
    <!{CDATA{Kwasnik, B.H. The importance of factors that are not document attributes in the organization of personal documents. Journal of Documentation47, 4 (1991), 389-398.}}>
    [22]
    <!{CDATA{Lagus, K.; Honkela, T.; Kaski, S.; and Kohonen, T. Self-organizing maps of document collections: A new approach to interactive exploration. In E. Simoudis, J. Han, and U. Fayyad (eds.)Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 1996, pp. 238-243.}}>
    [23]
    <!{CDATA{Lakoff, G. Women, Fire and Dangerous Things: What Categories Reveal About the Mind. Chicago: University of Chicago Press, 1987.}}>
    [24]
    <!{CDATA{Larsen, B., and Aone, C. Fast and effective text mining using linear-time document clustering. In U. Fayyad, S. Chaudhuri, and D. Madigan (eds.)Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 1999, pp. 16-22.}}>
    [25]
    <!{CDATA{Lin, C.; Chen, H.; and Nunamaker, J.F. Verifying the proximity and size hypothesis for self-organizing maps. Journal of Management Information Systems16, 3 (Winter 1999-2000), 57-70.}}>
    [26]
    <!{CDATA{Mackay, W.E. Diversity in the use of electronic mail: A preliminary inquiry. ACM Transactions on Office Information Systems6, 4 (1988), 380-397.}}>
    [27]
    <!{CDATA{Mackay, W.E. Responding to cognitive overload: Co-adaptation between users and technology. Intellectica30, 1 (2000), 177-193.}}>
    [28]
    <!{CDATA{Pantel, P., and Lin, D. Document clustering with committees. In M. Beaulieu, R. Baeza-Yates, and S.H. Mayeng (eds.)Proceedings of the Twenty-Fifth International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2002, pp. 199-206.}}>
    [29]
    <!{CDATA{Quillian, M.R. Semantic memory. In M. Minsky (ed.)Semantic Information Processing. Cambridge, MA: MIT Press, 1968, pp. 227-270.}}>
    [30]
    <!{CDATA{Quiroga, L.M.; Crosby, M.E.; and Iding, M.K. Reducing cognitive load. In R.H. Sprague Jr. (ed.)Proceedings of the Thirty-Seventh Hawaii International Conference on Systems Sciences. Los Alamitos, CA: IEEE Computer Society Press, 2004 (available at ieeexplore.ieee.org).}}>
    [31]
    <!{CDATA{Rauber, A., and Merkl, D. Using self-organizing maps to organize document archives and to characterize subject matters: How to make a map tell the news of the world. In T. Bench-Capon, G. Soda, and A.M. Tjoa (eds.)Proceedings of the Tenth International Conference on Database and Expert Systems Applications. Berlin: Springer Verlag, 1999, pp. 302-311.}}>
    [32]
    <!{CDATA{Restorick, F.M. Novel filing systems applicable to an automated office: A state-of-the-art study. Information Processing and Management22, 2 (1986), 151-172.}}>
    [33]
    <!{CDATA{Roussinov, D.G., and Chen, H. Document clustering for electronic meetings: An experimental comparison of two techniques. Decision Support Systems27, 1-2 (November 1999), 67-79.}}>
    [34]
    <!{CDATA{Rucker, J., and Polanco, M.J. Siteseer: Personalized navigation for the Web. Communications of the ACM40, 3 (March 1997), 73-75.}}>
    [35]
    <!{CDATA{Salton, G., and Buckley, C. Term-weighting approaches in automatic text retrieval. Information Processing and Management24, 5 (1988), 513-523.}}>
    [36]
    <!{CDATA{Schütze, H.; Hull, D.A.; and Pedersen, J.O. A comparison of classifiers and document representations for the routing problem. In M. Beaulieu, R. Baeza-Yates, and S.H. Myaeng (eds)Proceedings of the Eighteenth International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2002, pp. 229-237.}}>
    [37]
    <!{CDATA{Sebastiani, F. Machine learning in automated text categorization. ACM Computing Surveys34, 1 (March 2002), 1-47.}}>
    [38]
    <!{CDATA{Spangler, S.; Kreulen, J.T.; and Lessler, J. Generating and browsing multiple taxonomies over a document collection. Journal of Management Information Systems19, 4 (Spring 2003), 191-212.}}>
    [39]
    <!{CDATA{Talavera, L., and Bejar, J. Integrating declarative knowledge in hierarchical clustering tasks. In D.J. Hand, J.N. Kok, and M.R. Berthold (eds.)Proceedings of the Third International Symposium on Intelligent Data Analysis. Berlin: Springer Verlag, 1999, pp. 211-222.}}>
    [40]
    <!{CDATA{Voorhees, E.M. Implementing agglomerative hierarchical clustering algorithms for use in document retrieval. Information Processing and Management22, 6 (1986), 465-476.}}>
    [41]
    <!{CDATA{Voutilainen, A. NPtool: A detector of English noun phrases. In K.W. Church (ed.)Proceedings of the First Workshop on Very Large Corpora. East Stroudsburg, PA: Association for Computational Linguistics, 1993, pp. 48-57.}}>
    [42]
    <!{CDATA{Wei, C.; Hu, P.; and Dong, Y.X. Managing document categories in e-commerce environments: An evolution-based approach. European Journal of Information Systems11, 3 (September 2002), 208-222.}}>
    [43]
    <!{CDATA{Wei, C.; Yang, C.S.; Hsiao, H.W.; and Cheng, T.H. Combining preference- and contentbased approaches for improving document clustering effectiveness. Information Processing and Management42, 2 (March 2006), 350-372.}}>
    [44]
    <!{CDATA{Yang, C., and Luk, J. Automatic generation of English/Chinese thesaurus based on a parallel corpus in laws. Journal of the American Society for Information Science and Technology54, 7 (2003), 671-682.}}>
    [45]
    <!{CDATA{Yang, Y., and Chute, C.G. An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems12, 3 (1994), 252-277.}}>
    [46]
    <!{CDATA{Yang, Y., and Pedersen, J.O. A comparative study on feature selection in text categorization. In D.H. Fisher (ed.)Proceedings of the Fourteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 1997, pp. 412-420.}}>
    [47]
    <!{CDATA{Yu, C.T.; Wang, Y.T.; and Chen, C.H. Adaptive document clustering. In J.M. Tague (ed.)Proceedings of the Eighth International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1985, pp. 197-203.}}>

    Cited By

    View all
    • (2024)Human-in-the-loop latent space learning for biblio-record-based literature managementInternational Journal on Digital Libraries10.1007/s00799-023-00389-825:1(123-136)Online publication date: 1-Mar-2024
    • (2022)Bibrecord-Based Literature Management with Interactive Latent Space LearningFrom Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries10.1007/978-3-031-21756-2_13(155-171)Online publication date: 30-Nov-2022
    • (2016)Constructing conceptual trajectory maps to trace the development of research fieldsJournal of the Association for Information Science and Technology10.1002/asi.2352267:8(2016-2031)Online publication date: 1-Aug-2016
    • Show More Cited By

    Index Terms

    1. Accommodating Individual Preferences in the Categorization of Documents: A Personalized Clustering Approach
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Journal of Management Information Systems
          Journal of Management Information Systems  Volume 23, Issue 2
          Number 2 / October 2006
          290 pages

          Publisher

          M. E. Sharpe, Inc.

          United States

          Publication History

          Published: 01 October 2006

          Author Tags

          1. Cognitive Overload
          2. Document Clustering
          3. Hierarchical Agglomerative Clustering (Hac)
          4. Personalization
          5. Personalized Document Clustering
          6. Supervised Document Clustering
          7. Text Mining

          Qualifiers

          • Article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Human-in-the-loop latent space learning for biblio-record-based literature managementInternational Journal on Digital Libraries10.1007/s00799-023-00389-825:1(123-136)Online publication date: 1-Mar-2024
          • (2022)Bibrecord-Based Literature Management with Interactive Latent Space LearningFrom Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries10.1007/978-3-031-21756-2_13(155-171)Online publication date: 30-Nov-2022
          • (2016)Constructing conceptual trajectory maps to trace the development of research fieldsJournal of the Association for Information Science and Technology10.1002/asi.2352267:8(2016-2031)Online publication date: 1-Aug-2016
          • (2014)Analyzing firm-specific social media and marketDecision Support Systems10.1016/j.dss.2014.08.00167:C(30-39)Online publication date: 1-Nov-2014
          • (2014)Exploiting temporal characteristics of features for effectively discovering event episodes from news corporaJournal of the Association for Information Science and Technology10.1002/asi.2299565:3(621-634)Online publication date: 1-Mar-2014
          • (2012)A Data-Driven Approach to Measure Web Site NavigabilityJournal of Management Information Systems10.2753/MIS0742-122229020729:2(173-212)Online publication date: 1-Oct-2012
          • (2010)A knowledge-based model using ontologies for personalized web information gatheringWeb Intelligence and Agent Systems10.5555/1839537.18395388:3(235-254)Online publication date: 1-Aug-2010
          • (2009)Preserving User Preferences in Automated Document-Category ManagementJournal of Management Information Systems10.2753/MIS0742-122225040425:4(109-144)Online publication date: 1-Apr-2009
          • (2009)Discovering event episodes from news corporaProceedings of the 11th International Conference on Electronic Commerce10.1145/1593254.1593265(72-80)Online publication date: 12-Aug-2009
          • (2007)Managing Word Mismatch Problems in Information RetrievalJournal of Management Information Systems10.2753/MIS0742-122224030924:3(269-295)Online publication date: 1-Dec-2007
          • Show More Cited By

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media