Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Hybrid Approach to Web Content Outlier Mining Without Query Vector

  • Conference paper
Data Warehousing and Knowledge Discovery (DaWaK 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3589))

Included in the following conference series:

Abstract

Mining outliers from large datasets is like finding needles in a haystack. Even more challenging is sifting through the dynamic, unstructured, and ever-growing web data for outliers. This paper presents HyCOQ, which is a hybrid algorithm that draws from the power of n-gram-based and word-based systems. Experimental results obtained using embedded motifs without a dictionary show significant improvement over using a domain dictionary irrespective of the type of data used (words, n-grams, or hybrid). Also, there is remarkable improvement in recall with hybrid documents compared to using raw words and n-grams without a domain dictionary.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Agyemang, M., Barker, K., Alhajj, R.: Framework for Mining Web Content Outliers. In: Proc. of ACM SAC, Cyprus, pp. 590–594 (2004)

    Google Scholar 

  2. Agyemang, M., Barker, K., Alhajj, R.: Mining Web Content Outliers using Structure Oriented Weighting Techniques and N-grams. In: Proc. of ACM SAC, New Mexico (2005)

    Google Scholar 

  3. Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: LOF: Identifying Outliers in Large Dataset. In: Proc. of ACM SIGMOD, Dallas, TX, pp. 93–104 (2000)

    Google Scholar 

  4. Barnett, V., Lewis, T.: Outliers in Statistical Data. John Wiley, Chichester (1994)

    MATH  Google Scholar 

  5. Cavnar, B.W., Trenkle, M.J.: N-Gram-Based Text Categorization. In: Proc. of SDAIR (1994)

    Google Scholar 

  6. Chakrabarti, S., et al.: Mining the Link Structure of the Would Wide Web. IEEE Computer 32(8), 60–67 (1999)

    Google Scholar 

  7. Chakrabarti, S.: Data Mining for Hypertext- A Tutorial Survey. ACM SIGKDD Explorations 1(2), 1–11 (2000)

    Article  Google Scholar 

  8. Cooley, R., Mobasher, B., Srivastava, J.: Information and Pattern Discovery on the World Wide Web. In: SIGKDD Explorations, vol. 1(2) (2000)

    Google Scholar 

  9. Damashek, M.: Gauging Similarity with N-Grams: Language Independent Categorization of Text. Science 267, 843–848 (1995)

    Article  Google Scholar 

  10. Etzioni, O.: The World Wide Web: Quagmire or Gold Mine. Communication of the ACM 39(11), 65–68 (1996)

    Article  Google Scholar 

  11. Jin, W., Tung, A.K.-H., Han, J.: Mining Top-n Local Outliers in Large Databases. In: Proc. of ACM SIGKDD, San Francisco, CA, pp. 293–298 (2001)

    Google Scholar 

  12. Johnson, T., Kwok, I., Ng, R.: Fast Computation of 2-D depth Contours. In: Proc. of ACM SIGKDD, pp. 224–228 (1998)

    Google Scholar 

  13. Knorr, E.M., Ng, R.T.: A Unified Notion of Outliers: Properties and Computation. In: Proc. of KDD, pp. 219–222 (1997)

    Google Scholar 

  14. Knorr, E.M., Ng, R.T.: Algorithms for Mining Distance-Based Outliers in Large Dataset. In: Proc. of VLDB, New York, pp. 392–403 (1998)

    Google Scholar 

  15. Liu, B., Ma, Y., Yu, P.S.: Discovering Unexpected Information from Your Competitors’ Web Sites. In: Proc. of ACM SIGKDD, San Francisco, CA, pp. 144–153 (2001)

    Google Scholar 

  16. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient Algorithms for Mining Outliers from Large Data Set. In: Proc. of ACM SIGMOD, pp. 427–438 (2000)

    Google Scholar 

  17. Salton, G., Buckley, C.: Term-Weighting Approaches in automatic Text Retrieval. Information Processing and Management 24(5), 513–523 (1988)

    Article  Google Scholar 

  18. http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data , (December 2004)

  19. http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words , (December 2004)

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Agyemang, M., Barker, K., Alhajj, R. (2005). Hybrid Approach to Web Content Outlier Mining Without Query Vector. In: Tjoa, A.M., Trujillo, J. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2005. Lecture Notes in Computer Science, vol 3589. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11546849_28

Download citation

  • DOI: https://doi.org/10.1007/11546849_28

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28558-8

  • Online ISBN: 978-3-540-31732-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics