Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1367497.1367508acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Floatcascade learning for fast imbalanced web mining

Published: 21 April 2008 Publication History

Abstract

This paper is concerned with the problem of Imbalanced Classification (IC) in web mining, which often arises on the web due to the "Matthew Effect". As web IC applications usually need to provide online service for user and deal with large volume of data, classification speed emerges as an important issue to be addressed. In face detection, Asymmetric Cascade is used to speed up imbalanced classification by building a cascade structure of simple classifiers, but it often causes a loss of classification accuracy due to the iterative feature addition in its learning procedure. In this paper, we adopt the idea of cascade classifier in imbalanced web mining for fast classification and propose a novel asymmetric cascade learning method called FloatCascade to improve the accuracy. To the end, FloatCascade selects fewer yet more effective features at each stage of the cascade classifier. In addition, a decision-tree scheme is adopted to enhance feature diversity and discrimination capability for FloatCascade learning. We evaluate FloatCascade through two typical IC applications in web mining: web page categorization and citation matching. Experimental results demonstrate the effectiveness and efficiency of FloatCascade comparing to the state-of-the-art IC methods like Asymmetric Cascade, Asymmetric AdaBoost and Weighted SVM.

References

[1]
N. Japkowicz. Learning from Imbalanced Data Sets: A Comparison of Various Strategies, In Learning from imbalanced data sets: The AAAI Workshop 10-15. Technical Report WS-00-05, Menlo Park, CA: AAAI Press, 2000.
[2]
H. Liu and H. Motoda. On Issues of Instance Selection. In Journal of Data Mining and Knowledge Discovery, pp. 115--130, 2002.
[3]
D. Fragoudis, D. Meretakis, and S. Likothanassis. Integrating Feature and Instance Selection for Text Classification. In Proc. of ACM SIGKDD 2002, pp. 501--506, Canada, 2002.
[4]
M. Kubat and S. Matwin. Addressing the Curse of Imbalanced Training Sets: One Sided Selection. In Proc. of ICML 1997, pp. 179--186, 1997.
[5]
C. Chen, H. Lee, and M. Kao. Multi-class Svm with Negative Data Selection for Web Page Classification. In Proc. of IEEE Joint Conf. on Neural Networks, pp. 2047--2052, Budapest, Hungary, 2004.
[6]
J. Brank, M. Grobelnik, N. M. Frayling, and D. Mladenic. Training Text Classifiers with SVM on Very Few Positive Examples. Technical Report MSR-TR-2003-34, Microsoft Research, April 2003.
[7]
G. Wu and E. Y. Chang, Kba: Kernel Boundary Alignment Considering Imbalanced Data Distribution. IEEE Trans. on Knowledge and Data Engineering (TKDE), 17(6): pp. 786--795, June 2005.
[8]
E. S. Robert. A Brief Introduction to Boosting. In Proc. of IJCAI 1999, pp. 1401--1405, Stockholm, Sweden, 1999.
[9]
V. Paul and J. Michael, Fast and Robust Classification Using Asymmetric AdaBoost and a Detector Cascade. In Proc. of NIPS 2001. pp. 1311--1318, 2001.
[10]
V. Paul and J. Michael, Robust Real-Time Face Detection. In Journal of International Journal of Computer Vision (IJCV), pp. 137--154. Kluwer Academic Publishers, Netherlands, 2004.
[11]
D. Shen, J. Sun, Q. Yang, and Z. Chen. A Comparison of Implicit and Explicit Links for Web Page Classification. In Proc. of WWW 2006, pp. 643--650, 2006.
[12]
E. J. Glover, K. Tsioutsiouliklis, S. Lawrence, D. M. Pennock, and G. W. Flake. Using Web Structure for Classifying and Describing Web Pages. In Proc. of WWW 2002, pp. 562--569, Honolulu, Hawaii, USA, 2002.
[13]
H. Oh, S Myaeng, and M. Lee. A Practical Hypertext Categorization Method using Links and Incrementally Available Class Information. In Proc. of SIGIR 2000, pp. 264--271, Athens, Greece, 2000.
[14]
T. Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proc. of ECML 1998, pp. 137--142, Chemnitz, Germany, 1998.
[15]
A. McCallum and K. Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In AAAI-98 Workshop on Learning for Text Categorization, pp. 22--28, 1998.
[16]
S. Lessmann. Solving Imbalanced Classification Problems with Support Vector Machines. In Proc. of the Int. Conf. on Artificial Intelligence (IC-AI'04), pp. 214--220, Las Vegas, Nevada, USA,
[17]
A> Sun, E. Lim, B. Benatallah, and M. Hassan. FISA: Feature-Based Instance Selection for Imbalanced Text Classification. In Proc. of PAKDD 2006. pp. 250--254, 2006.
[18]
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. In Journal of Artificial Intelligence Research, 16: pp. 321--357, 2002.
[19]
S. Z. Li, Z. Zhang, H. Shum, and H. Zhang. FloatBoost Learning for Classification. In Proc. of NIPS 2002, pp. 993--1000, 2002.
[20]
W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. Adacost: Misclassification Cost-sensitive Boosting. In Proc. of ICML 1999, pp. 97--105, 1999.
[21]
Y. Ma and X. Q. Ding. Robust Real-time Face Detection based on Cost-sensitive AdaBoost Method. In Proc. of ICME 2003, pp. 465--468, 2003.
[22]
K. M. Ting and Z. Zheng. Boosting Trees for Cost-sensitive Classifications. In Proc. of the ECML 1998. pp. 190--195, 1998.
[23]
K. Morik, P. Brockhausen, and T. Joachims, Combining Statistical Learning with a Knowledge-based Approach - A Case Study in Intensive Care Monitoring. In Proc. of ICML 1999, pp. 268--277, 1999.
[24]
A. K. McCallum, K. Nigam, and L. Ungar. Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching. In Proc. of KDD2000, pp. 169--178, Boston, MA, 2000.
[25]
R. Akbani, S. Kwek, and N. Japkowicz. Applying Support Vector Machines to Imbalanced Datasets. In Proc. of ECML 2004, pp. 39-5-50, 2004.
[26]
X. Hou, C. Liu, and T. Tan. Learning Boosted Asymmetric Classifiers for Object Detection. In Proc. of CVPR 2006. pp.330--338, New York, 2006.
[27]
N. Bobb. BiBoost for Asymmetric Learning. Technical Report, University of California, 2006.
[28]
J. Wu, J. M. Rehg, and M. D. Mullin. Learning a Rare Event Detection Cascade by Direct Feature Selection. In Proc. of NIPS 2003, pp. 1523--1530, 2003.
[29]
J. Wu, M. D. Mullin, and J. M. Rehg. Linear Asymmetric Classifier for Cascade Detectors. In Proc. of ICML 2005. pp. 988--995, 2005.
[30]
P. Pudil, J. Novovicova, and J. Kittler. Floating Search Methods in Feature Selection. In Journal of Pattern Recognition Letters, (11): pp. 1119--1125, 1994.
[31]
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity Uncertainty and Citation Matching. In Proc. of NIPS 2002. pp. 1401--1408, 2002.
[32]
W. W. Cohen. P. Ravikumar, and S. E. Fienberg. A Comparison of String Distance Metrics for Name-Matching Tasks. In Proc. of IJCAI 2003 Workshop on Information Integration on the Web, pp. 73--78, 2003
[33]
M. Kubat, R. Holte, and S. Matwin. Machine Learning for the Detection of Oil Spills in Satellite Radar Images. In Journal of Machine Learning, pp.195--215, 1998.
[34]
K. Sung and T. Poggio. Example-based Learning for View-based Human Face Detection. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 20(1): pp. 39--51, 1998.
[35]
http://www.dmoz.org
[36]
http://dir.yahoo.com
[37]
http://directory.google.com
[38]
http://scholar.google.com/
[39]
http://citeseer.ist.psu.edu/
[40]
http://svmlight.joachims.org/
[41]
http://www.dcs.shef.ac.uk/~sam/simmetrics.html
[42]
M. Richardson, and A. Prakash, and E. Brill. Beyond PageRank: Machine Learning for Static Ranking. In Proc. of WWW2006, pp. 707--715. May 23-26, 2006.
[43]
H. Drucker, D. Wu, and V. N. Vapnik, Support Vector Machines for Spam Categorization, IEEE Trans. on Neural Networks, 20(5): pp. 1048--1054, 1999.
[44]
W. Yih, J. Goodman, and V. R. Carvalho. Finding Advertising Keywords on Web Pages. In Proc. of WWW 2006, pp 213--222. 2006.

Cited By

View all
  • (2010)CasJoinProceedings of the 19th ACM international conference on Information and knowledge management10.1145/1871437.1871714(1725-1728)Online publication date: 26-Oct-2010
  • (2008)General Framework for Text Classification Based on Domain OntologyProceedings of the 2008 Third International Workshop on Semantic Media Adaptation and Personalization10.1109/SMAP.2008.17(147-152)Online publication date: 15-Dec-2008

Index Terms

  1. Floatcascade learning for fast imbalanced web mining

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '08: Proceedings of the 17th international conference on World Wide Web
      April 2008
      1326 pages
      ISBN:9781605580852
      DOI:10.1145/1367497
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 April 2008

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cascade learning
      2. citation matching
      3. fast imbalanced classification
      4. float searching
      5. web page categorization

      Qualifiers

      • Research-article

      Conference

      WWW '08
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 30 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2010)CasJoinProceedings of the 19th ACM international conference on Information and knowledge management10.1145/1871437.1871714(1725-1728)Online publication date: 26-Oct-2010
      • (2008)General Framework for Text Classification Based on Domain OntologyProceedings of the 2008 Third International Workshop on Semantic Media Adaptation and Personalization10.1109/SMAP.2008.17(147-152)Online publication date: 15-Dec-2008

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media