UNIT 4 Mining Object Spatial Multimedia Text and Web Data
UNIT 4 Mining Object Spatial Multimedia Text and Web Data
10
Mining Object, Spatial,
Multimedia, Text, and Web Data
Data Mining
Spatial database
Measures
numerical
Spatial: collection of spatial pointers (0-5 degree region)
Input
Output
Goals
Challenge
Spatial Merge
Example
Progressive Refinement
Spatial Classification
Spatial classification
Constraints-based clustering
C2
River
ge
d
i
r
B
Mountain
Spatial data with obstacles
C3
C1
C4
Clustering without taking
obstacles into consideration
Find all of the images that are similar to the features of given
image
Classification
Decision tree
Feature extraction
Example
Information retrieval
Keyword-Based Retrieval
Major difficulties
Similarity-Based Retrieval
Basic techniques
Similarity metrics
| v1 || v2 |
TF-IDF Weighting
TF (Term Frequency)
TF-IDF weighting
weight(t, d) = TF(t, d) * IDF(t)
Example
universe rocket
D1 1
0
D 2 0
1
D3 1
0
D4 0
0
D5 0
0
D6 0
0
SVD
Reduce dimension
SVD
0.75 0.29 0.28
0.28 0.53 0.75
0.20 0.19
0.00
0.00
0.45
0.58
0.53
2.16
0.00
0.29
0.63
0.63
0.20
0.00
0.22
0.12
0.58
0.19
0.41
0.12
0.41
0.33
0.58
0.22
0.45
0.33
S 0.00
0.00
0.00
0.62 0.46
0.60 0.84
0.04 0.30
AV US 2
0
.
97
1
.
00
0.71 0.35
0.26 0.65
0.00
1.59
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.28
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.39
0.74
0.16
0.10
0.54
(US )(US )T
V T ...
1.00
0.94
1.00
0.93
0.74
1.00
Automatic Document
Classification
Motivation
A classification problem
Methods
Contents information
Hyperlink information
Usage information
Challenges
Index-based
Search the Web, collect Web pages, index Web pages, and
build and store huge keyword-based indices
Locate sets of Web pages containing certain keywords
Deficiencies
Example
Methods
Problems
Retrieving pages that are not only relevant, but also of high
quality, or authoritative on the topic
Hub
HITS (Hyperlink-Induced
Topic Search)
Method
1.
2.
3.
Apply weight-propagation
4.
Include all of the pages that the root-set pages link to, and all
of the pages that link to a page in the root set
Large hub weights, large authority weights for the given search
topic
Clever, Google
Applications
References
H. Miller and J. Han (eds.), Geographic Data Mining and Knowledge Discovery, Taylor and Francis, 2001.
Ester M., Frommelt A., Kriegel H.-P., Sander J.: Spatial Data Mining: Database Primitives, Algorithms and
Efficient DBMS Support, Data Mining and Knowledge Discovery, 4: 193-216, 2000.
J. Han, M. Kamber, and A. K. H. Tung, "Spatial Clustering Methods in Data Mining: A Survey", in H. Miller and
J. Han (eds.), Geographic Data Mining and Knowledge Discovery, Taylor and Francis, 2000.
Y. Bedard, T. Merrett, and J. Han, "Fundamentals of Geospatial Data Warehousing for Geographic
Knowledge Discovery", in H. Miller and J. Han (eds.), Geographic Data Mining and Knowledge Discovery,
Taylor and Francis, 2000
K. Koperski and J. Han. Discovery of spatial association rules in geographic information databases. SSD'95.
Shashi Shekhar and Sanjay Chawla, Spatial Databases: A Tour , Prentice Hall, 2003 (ISBN 013-017480-7).
Chapter 7.: Introduction to Spatial Data Mining
X. Li, J. Han, and S. Kim, Motion-Alert: Automatic Anomaly Detection in Massive Moving Objects, IEEE Int.
Conf. on Intelligence and Security Informatics (ISI'06).
Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34,
No.1, March 2002
Soumen Chakrabarti, Data mining for hypertext: A tutorial survey, ACM SIGKDD Explorations, 2000.
Cleverdon, Optimizing convenient online access to bibliographic databases, Information Survey, Use4, 1,
37-47, 1984
Yiming Yang, An evaluation of statistical approaches to text categorization, Journal of Information Retrieval,
1:67-88, 1999.
Yiming Yang and Xin Liu A re-examination of text categorization methods. Proceedings of ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR'99, pp 42--49), 1999.
S. Chakrabarti, Mining the Web: Statistical Analysis of Hypertext and Semi-Structured Data, Morgan
Kaufmann, 2002.
References
G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R. Tengi. Five papers on WordNet. Princeton
University, August 1993.
M. Hearst, Untangling Text Data Mining, ACL99, invited paper. R. Sproat, Introduction to Computational
Linguistics, LING 306, UIUC, Fall 2003.
A Road Map to Text Mining and Web Mining, University of Texas resource page.
http://www.cs.utexas.edu/users/pebronia/text-mining/
Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, Extracting Content Structure for Web Pages based
on Visual Representation, The Fifth Asia Pacific Web Conference, 2003.
Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, VIPS: a Vision-based Page Segmentation
Algorithm, Microsoft Technical Report (MSR-TR-2003-79), 2003.
Shipeng Yu, Deng Cai, Ji-Rong Wen and Wei-Ying Ma, Improving Pseudo-Relevance Feedback in Web
Information Retrieval Using Web Page Segmentation, 12th International World Wide Web Conference
(WWW2003), May 2003.
Ruihua Song, Haifeng Liu, Ji-Rong Wen and Wei-Ying Ma, Learning Block Importance Models for Web
Pages, 13th International World Wide Web Conference (WWW2004), May 2004.
Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, Block-based Web Search, SIGIR 2004, July 2004 .
Deng Cai, Xiaofei He, Ji-Rong Wen and Wei-Ying Ma, Block-Level Link Analysis, SIGIR 2004, July 2004 .
Deng Cai, Xiaofei He, Wei-Ying Ma, Ji-Rong Wen and Hong-Jiang Zhang, Organizing WWW Images Based
on The Analysis of Page Layout and Web Link Structure, The IEEE International Conference on Multimedia
and EXPO (ICME'2004) , June 2004
Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma and Ji-Rong Wen, Hierarchical Clustering of WWW Image
Search Results Using Visual, Textual and Link Analysis,12th ACM International Conference on Multimedia,
Oct. 2004 .