Hybrid Approach to Web Content Outlier Mining Without Query Vector

Agyemang, Malik; Barker, Ken; Alhajj, Reda

doi:10.1007/11546849_28

Malik Agyemang¹⁸,
Ken Barker¹⁸ &
Reda Alhajj^18,19

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3589))

Included in the following conference series:

International Conference on Data Warehousing and Knowledge Discovery

1568 Accesses
4 Citations

Abstract

Mining outliers from large datasets is like finding needles in a haystack. Even more challenging is sifting through the dynamic, unstructured, and ever-growing web data for outliers. This paper presents HyCOQ, which is a hybrid algorithm that draws from the power of n-gram-based and word-based systems. Experimental results obtained using embedded motifs without a dictionary show significant improvement over using a domain dictionary irrespective of the type of data used (words, n-grams, or hybrid). Also, there is remarkable improvement in recall with hybrid documents compared to using raw words and n-grams without a domain dictionary.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Enabling PII Discovery in Textual Data via Outlier Detection

Fast and Scalable Outlier Detection with Metric Access Methods

A Preliminary Investigation Towards Improving Linked Data Quality Using Distance-Based Outlier Detection

References

Agyemang, M., Barker, K., Alhajj, R.: Framework for Mining Web Content Outliers. In: Proc. of ACM SAC, Cyprus, pp. 590–594 (2004)
Google Scholar
Agyemang, M., Barker, K., Alhajj, R.: Mining Web Content Outliers using Structure Oriented Weighting Techniques and N-grams. In: Proc. of ACM SAC, New Mexico (2005)
Google Scholar
Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: LOF: Identifying Outliers in Large Dataset. In: Proc. of ACM SIGMOD, Dallas, TX, pp. 93–104 (2000)
Google Scholar
Barnett, V., Lewis, T.: Outliers in Statistical Data. John Wiley, Chichester (1994)
MATH Google Scholar
Cavnar, B.W., Trenkle, M.J.: N-Gram-Based Text Categorization. In: Proc. of SDAIR (1994)
Google Scholar
Chakrabarti, S., et al.: Mining the Link Structure of the Would Wide Web. IEEE Computer 32(8), 60–67 (1999)
Google Scholar
Chakrabarti, S.: Data Mining for Hypertext- A Tutorial Survey. ACM SIGKDD Explorations 1(2), 1–11 (2000)
Article Google Scholar
Cooley, R., Mobasher, B., Srivastava, J.: Information and Pattern Discovery on the World Wide Web. In: SIGKDD Explorations, vol. 1(2) (2000)
Google Scholar
Damashek, M.: Gauging Similarity with N-Grams: Language Independent Categorization of Text. Science 267, 843–848 (1995)
Article Google Scholar
Etzioni, O.: The World Wide Web: Quagmire or Gold Mine. Communication of the ACM 39(11), 65–68 (1996)
Article Google Scholar
Jin, W., Tung, A.K.-H., Han, J.: Mining Top-n Local Outliers in Large Databases. In: Proc. of ACM SIGKDD, San Francisco, CA, pp. 293–298 (2001)
Google Scholar
Johnson, T., Kwok, I., Ng, R.: Fast Computation of 2-D depth Contours. In: Proc. of ACM SIGKDD, pp. 224–228 (1998)
Google Scholar
Knorr, E.M., Ng, R.T.: A Unified Notion of Outliers: Properties and Computation. In: Proc. of KDD, pp. 219–222 (1997)
Google Scholar
Knorr, E.M., Ng, R.T.: Algorithms for Mining Distance-Based Outliers in Large Dataset. In: Proc. of VLDB, New York, pp. 392–403 (1998)
Google Scholar
Liu, B., Ma, Y., Yu, P.S.: Discovering Unexpected Information from Your Competitors’ Web Sites. In: Proc. of ACM SIGKDD, San Francisco, CA, pp. 144–153 (2001)
Google Scholar
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient Algorithms for Mining Outliers from Large Data Set. In: Proc. of ACM SIGMOD, pp. 427–438 (2000)
Google Scholar
Salton, G., Buckley, C.: Term-Weighting Approaches in automatic Text Retrieval. Information Processing and Management 24(5), 513–523 (1988)
Article Google Scholar
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data , (December 2004)
http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words , (December 2004)

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Calgary, Calgary, Alberta, Canada
Malik Agyemang, Ken Barker & Reda Alhajj
Department of Computer Science, Global University, Beirut, Lebanon
Reda Alhajj

Authors

Malik Agyemang
View author publications
You can also search for this author in PubMed Google Scholar
Ken Barker
View author publications
You can also search for this author in PubMed Google Scholar
Reda Alhajj
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstr. 9-11/188, A-1040, Wien, Austria
A Min Tjoa
Department of Software and Computing Systems, University of Alicante, Spain
Juan Trujillo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agyemang, M., Barker, K., Alhajj, R. (2005). Hybrid Approach to Web Content Outlier Mining Without Query Vector. In: Tjoa, A.M., Trujillo, J. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2005. Lecture Notes in Computer Science, vol 3589. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11546849_28

Download citation

DOI: https://doi.org/10.1007/11546849_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28558-8
Online ISBN: 978-3-540-31732-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Hybrid Approach to Web Content Outlier Mining Without Query Vector

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Enabling PII Discovery in Textual Data via Outlier Detection

Fast and Scalable Outlier Detection with Metric Access Methods

A Preliminary Investigation Towards Improving Linked Data Quality Using Distance-Based Outlier Detection

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Hybrid Approach to Web Content Outlier Mining Without Query Vector

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Enabling PII Discovery in Textual Data via Outlier Detection

Fast and Scalable Outlier Detection with Metric Access Methods

A Preliminary Investigation Towards Improving Linked Data Quality Using Distance-Based Outlier Detection

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation