Abstract
We present WebACE, an agent for exploring and categorizing documents onthe World Wide Web based on a user profile. The heart of the agent is anunsupervised categorization of a set of documents, combined with a processfor generating new queries that is used to search for new relateddocuments and for filtering the resulting documents to extract the onesmost closely related to the starting set. The document categories are notgiven a priori. We present the overall architecture and describe twonovel algorithms which provide significant improvement over HierarchicalAgglomeration Clustering and AutoClass algorithms and form the basis forthe query generation and search component of the agent. We report on theresults of our experiments comparing these new algorithms with moretraditional clustering algorithms and we show that our algorithms are fastand sacalable.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ackerman L. M. et al. (1997). Learning Probabilistic User Profiles. AI Magazine 18(2): 47-56.
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. & Verkamo, A. I. (1996). Fast Discovery of Association Rules. In Fayyad, U.M. Piatetsky-Shapiro, G., Smyth, P. & Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, 307-328. AAAI/MIT Press.
Anderson, T. W. (1954). On Estimation of Parameters in Latent Structure Analysis. Psychometrika 19: 1-10.
Armstrong, R. Freitag, D., Joachims, T. & Mitchell, T. (1995). Web Watcher: A Learning Apprentice for the World Wide Web. In Proc. AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments. AAAI Press.
Balabanovic, M., Shoham, G. & Yun, Y. (1995). An Adaptive Agent for Automated Web Browsing. Journal of Visual Communication and Image Representation 6(4).
Berge, L. C. (1976). Graphs and Hypergraphs. American Elsevier.
Berry, M. W. (1992). Large-Scale Sparse Singular Value Computations. International Journal of Supercomputer Applications 6(1): 13-49.
Berry, M. W., Dumais, S. T. & O'Brien, G. W. (1995). Using Linear Algebra for Intelligent Information Retrieval. SIAM Review 37: 573-595.
Boley, D. L. (1997). Principal Direction Divisive Partitioning. Technical Report TR-97-056, Department of Computer Science, University of Minnesota, Minneapolis.
Cheeseman, L. & Stutz, J. (1996). Bayesian Classification (Autoclass): Theory and Results. In Fayyad, U. M., Piatesky-Shapiro, G., Smyth, P. & Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, 153-180. AAAI/MIT Press.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. & Harshman, R. (1990). Indexing by Latent Semantic Analysis. J. Amer. Soc. Inform. Sci. 41: 41.
Doorenbos, R. B., Etzioni, O. & Weld, D. S. (1996). A Scalable Comparison Shopping Agent for the World Wide Web. Technical Report 96-01-03, University of Washington, Dept. of Computer Science and Engineering.
Duda, R. O. & Hart, P. E. (1973). Pattern Classification and Scene Analysis. John Wiley & Sons.
Frakes, W. B. (1992). Stemming Algorithms. In Frakes, W. B. & Baeza-Yates, R. (eds.) Information Retrieval Data Structures and Algorithms, 131-160. Prentice Hall.
Frakes, W. B. & Baeza-Yates, R. (1992). Information Retrieval Data Structures and Algorithms. Prentice Hall: Englewood Cliffs, NJ.
Golub, G. H. & Van Loan, C. F. (1996). Matrix Computations, 3rd edn. Johns Hopkins Univ. Press.
Hammond, K., Burke, R., Martin C. & Lytinen, S. (1995). FAQ-Finder: A Case-Based Approach to Knowledge Navigation. In Working Notes of the AAAI Spring Symposium: Information Gathering from Heterogeneous, Distributed Environments. AAAI Press.
Han, E. H., Karypis, G., Kumar, V. & Mobasher, B. (1997a). Clustering Based on Association Rule Hypergraphs (Position Paper). In Workshop on Research Issues on Data Mining and Knowledge Discovery, 9-13. Tucson, Arizona.
Han, E. H., Karypis, G., Kumar, V. & Mobasher, B. (1997b). Clustering in a High-Dimensional Space Using Hypergraph Models. Technical Report TR-97-063, Department of Computer Science, University of Minnesota, Minneapolis.
Han, E. H., Karypis, G., Kumar, V. & Mobasher, B. (1998). Hypergraph Based Clustering in High-Dimensional Data Sets: A Summary of Results. Bulletin of the Technical Committee on Data Engineering 21(1).
Jackson, J. E. (1991). A User's Guide to Principal Components. John Wiley & Sons.
Jain A. K. & Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice Hall.
Karypis, G., Aggarwal, R., Kumar V. & Shekhar, S. (1997). Multilevel Hypergraph Partitioning: Application in VLSI Domain. In Proceedings ACM/IEEE Design Automation Conference.
Kirk, T., Levy, A. Y., Sagiv, Y. & Srivastava, D. (1995). The Information Manifold. In Working Notes of the AAAI Spring Symposium: Information Gathering from Heterogeneous, Distributed Environments. AAAI Press.
Kohonen, T. (1988). Self-Organization and Association Memory. Springer-Verlag.
Kwok, C. & Weld, D. (1996). Planning to Gather Information. In Proc. 14th National Conference on AI.
Leighton, V. H. & Srivastava, J. (1997). Precision Among WWW Search Services (Search Engines): Alta Vista, Excite, Hotbot, Infoseek, Lycos. http://www,winona,msus.edu/is-f/ library-f/webind2/webind2.htm.
Lu, S. Y. & Fu, K. S. (1978). A Sentence-to-Sentence Clustering Procedure for Pattern Analysis. IEEE Transactions on Systems, Man and Cybernetics 8: 381-389.
Maarek, Y. S. & Shaul, I. Z. Ben (1996). Automatically Organizing Bookmarks per Content. In Proc. of 5th International World Wide Web Conference.
Moore, J., Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V. & Mobasher, B. (1997). Web Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering. In 7th Workshop on Information Technologies and Systems.
Perkowitz, M. & Etzioni, O. (1995). Category Translation: Learning to Understand Information on the Internet. In Proc. 15th International Joint Conference on AI, pp. 930-936. Montreal, Canada.
Porter, M. F. An Algorithm for Suffix Stripping. Program 14(3): 130-137.
Salton, G. & McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill.
Titterington, D. M., Smith, A. F. M. & Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons.
Weiss, R., Velez, B., Sheldon, M. A., Nemprempre, C., Szilagyi, P., Duda, A. & Gifford, D. K. (1996). Hypursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. In Seventh ACM Conference on Hypertext.
Wulfekuhler, M. R. & Punch, W. F. (1997). Finding Salient Features for Personal Web Page Categories. In Proc of 6th International World Wide Web Conference.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Boley, D., Gini, M., Gross, R. et al. Document Categorization and Query Generation on the World Wide Web Using WebACE. Artificial Intelligence Review 13, 365–391 (1999). https://doi.org/10.1023/A:1006592405320
Issue Date:
DOI: https://doi.org/10.1023/A:1006592405320