Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2K views

UNIT 4 Mining Object Spatial Multimedia Text and Web Data

The document discusses mining various types of complex data including spatial, image, text, and web data. It provides examples and methods for mining each type of data such as spatial clustering analysis, image retrieval, text classification using keywords or similarity, and determining authoritative web pages based on hyperlink analysis. Mining complex data types requires specialized techniques compared to traditional data mining due to characteristics of the different data.

Uploaded by

gurjeetkaur1991
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views

UNIT 4 Mining Object Spatial Multimedia Text and Web Data

The document discusses mining various types of complex data including spatial, image, text, and web data. It provides examples and methods for mining each type of data such as spatial clustering analysis, image retrieval, text classification using keywords or similarity, and determining authoritative web pages based on hyperlink analysis. Mining complex data types requires specialized techniques compared to traditional data mining due to characteristics of the different data.

Uploaded by

gurjeetkaur1991
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 30

Chap.

10
Mining Object, Spatial,
Multimedia, Text, and Web Data
Data Mining

Mining Complex Types of Data

Mining spatial data

Mining image data

Mining text data

Mining the Web

Mining Spatial Databases

Spatial database

Space related data: maps, VLSI layouts,


Topological, distance information organized by spatial
indexing structures

Spatial data warehousing

Issue: different representations & structures


Dimensions

Nonspatial: 25-30 degree hot


Spatial-to-nonspatial: New York western provinces
Spatial-to-spatial: equi. temp region 0-5 degree region

Measures

numerical
Spatial: collection of spatial pointers (0-5 degree region)

Example: BC Weather Pattern


Analysis

Input

Output

A map that reveals patterns: merged (similar) regions

Goals

A map with about 3,000 weather probes scattered in B.C.


Daily data for temperature, wind velocity, etc.
Concept hierarchies for all attributes

Interactive analysis (drill-down, slice, dice, pivot, roll-up)


Fast response time, Minimizing storage space used

Challenge

A merged region may contain hundreds of primitive


regions (polygons)

Spatial Merge

Precomputing: too much


storage space
On-line merge: very
expensive

Spatial Association Analysis

Spatial association rule: A B [s%, c%]

A and B are sets of spatial or nonspatial predicates

Example

Topological relations: intersects, overlaps, disjoint, etc.


Spatial orientations: left_of, west_of, under, etc.
Distance information: close_to, within_distance, etc.
is_a(x, school) ^ close_to(x, sports_center)
close_to(x, park) [7%, 85%]

Progressive Refinement

First search for rough relationship (e.g. g_close_to for


close_to, touch, intersect) using rough evaluation (e.g.
MBR)
Then apply only to those objects which have passed the
rough test

Spatial Classification

Spatial classification

Analyze spatial objects to derive classification schemes,


such as decision trees in relevance to spatial properties
Example

Classify regions into rich vs. poor


Properties: containing university, containing highway, near
ocean, etc.

Spatial Cluster Analysis

Constraints-based clustering

Selection of relevant objects before clustering


Parameters as constraints
K-means, density-based: radius, min points
Clustering with obstructed distance

C2
River

ge
d
i
r
B

Mountain
Spatial data with obstacles

C3

C1

C4
Clustering without taking
obstacles into consideration

Mining Image Data - Retrieval

Description-based retrieval systems

Retrieval based on image descriptions, such as keywords,


captions, size, etc.
Labor-intensive, poor quality

Content-based retrieval systems

Retrieval based on the image content(features), such as


color histogram, texture, shape, and wavelet transforms
Sample-based queries

Find all of the images that are similar to the features of given
image

Feature specification queries

Specify or sketch image features like color, texture, or shape,


which are translated into a feature vector

Mining Image Data - Retrieval


Combining searches

Search for blue sky

Search for airplane in blue sky

(top layout grid is blue)

(top layout grid is blue and


keyword = airplane)

Classification of Image Data

Classification

Decision tree

Feature extraction

Extract features for classification from raw image


Various image analysis techniques are required

Based on descriptive features


Based on content features

Data transformation, edge detection, etc.

Example

Classify sky images to recognize galaxies, stars, etc.


By using properties obtained from image analysis

Classification of Image Data

Mining Text Databases

Text databases (document databases)

Large collections of documents from various sources

News articles, research papers, books, e-mail messages, and


Web pages

Data stored is usually semi-structured


Traditional information retrieval techniques become
inadequate for the increasingly vast amounts of text data

Information retrieval

Information is organized into documents


Information retrieval problem

Locating relevant documents based on user input, such as


keywords or example documents

Basic Measures for IR

Precision: the percentage of retrieved documents that are in


fact relevant to the query (i.e., correct responses)
| {Relevant} {Retrieved } |
precision
| {Retrieved} |

Recall: the percentage of documents that are relevant to the


query and were, in fact, retrieved
| {Relevant} {Retrieved } |
recall
| {Relevant} |

Keyword-Based Retrieval

A document is represented by a set of keywords

Queries may use expressions of keywords

Retrieval by keyword matching


(Car and accessory), (C++ or Java)

Major difficulties

Synonymy: same meaning but different word

Ex> Q: software Doc: about programming, do not have


the keyword

Polysemy: same word but different meaning

Ex> Q: mining Doc: about gold mining, have the


keyword

Similarity-Based Retrieval

A document is represented as a keyword vector

Retrieval by similarity computing

Basic techniques

Stop list set of words that are frequent but irrelevant

Stemming use a common word stem

Ex> drug, drugs, drugged drug

Weighting count frequency

Ex> a, the, of, for, with,

Term frequency, inverse document frequency,

Similarity metrics

Measure the closeness of a document to a query


Cosine similarity: sim(v , v ) v1 v2
1

| v1 || v2 |

TF-IDF Weighting

TF (Term Frequency)

TF= f(t,d) : how many times term t appears in doc d


More frequent more relevant to topic
Normalization:

Document length varies : relative frequency preferred

IDF (Inverse Document Frequency)

IDF = 1 + log (n / k) : in how many documents term t appears

n : total number of docs


k : # docs with term t appearing (the document frequency)

Less frequent among documents more discriminative

TF-IDF weighting
weight(t, d) = TF(t, d) * IDF(t)

Latent Semantic Indexing

Reduce the dimension of keyword matrix

To resolve the synonym problem and the size problem


Use a singular value decomposition (SVD) techniques

Example
universe rocket
D1 1
0
D 2 0
1
D3 1
0

D4 0
0
D5 0
0

D6 0
0

moon car truck


1
1
0
1
0
0
0
0
0
0
1
1
0
1
0
0
0
1

SVD

Singular Value Decomposition

Decompose the matrix Amn


Amn = Umm Smn (Vnn)T

Reduce dimension

Select largest k singular values


Amn = Umk Skk (Vnk)T
Projection of A into k dimension
Amn Vnk = Umk Skk
Computing similarity
AAT = USVT(USVT)T
= USVTVSTUT
= (US)(US)T

SVD
0.75 0.29 0.28
0.28 0.53 0.75

0.20 0.19

0.00
0.00

0.45

0.58

0.53

2.16
0.00

0.29
0.63

0.63

0.20

0.00

0.22

0.12

0.58

0.19
0.41

0.12

0.41

0.33

0.58

0.22

0.45
0.33

S 0.00

0.00
0.00

0.62 0.46
0.60 0.84

0.04 0.30
AV US 2

0
.
97
1
.
00

0.71 0.35

0.26 0.65

0.00
1.59

0.00
0.00

0.00
0.00

0.00
0.00
0.00

1.28
0.00
0.00

0.00
1.00
0.00

0.00
0.00
0.00

0.00
0.39

1.00 0.78 0.40 0.47

1.00 0.88 0.18

0.74
0.16

0.10
0.54

1.00 0.62 0.32 0.87

(US )(US )T

V T ...

1.00

0.94
1.00

0.93
0.74

1.00

Automatic Document
Classification

Motivation

A classification problem

Automatic classification for the tremendous number of on-line


text documents (Web pages, e-mails, etc.)
Training set: Human experts generate a training data set
Classification(learning): The system discovers the
classification rules

Methods

Extract keywords and weights from documents

Documents are represented as (keyword, weight) pairs

Classify training documents into classes


Apply classification algorithm

Decision tree, Bayesian, neural network, etc.

Mining the World-Wide Web

WWW provides rich sources for data mining

Contents information
Hyperlink information
Usage information

Challenges

Too huge for effective data warehousing and data mining


Too complex and heterogeneous
Growing and changing very rapidly

Web Search Engines

Index-based

Search the Web, collect Web pages, index Web pages, and
build and store huge keyword-based indices
Locate sets of Web pages containing certain keywords

Deficiencies

A topic of any breadth may easily contain hundreds of


thousands of documents
Many documents that are highly relevant to a topic may not
contain keywords defining them (synonymy, polysemy)

Web Contents Mining Classification

Web page/site classification

Example

Assign a class label to each web page from a set of


predefined topic categories
Based on a set of examples of preclassified documents
Use Yahoo!'s taxonomy and its associated documents as
training and test sets
Derive a Web document classification model
Use the model to classify new Web documents by assigning
categories from the same taxonomy

Methods

Keyword-based classification, use of hyperlink information,


statistical models,

Web Structure Mining

Finding authoritative Web pages

Hyperlinks can infer the notion of authority

A hyperlink pointing to another Web page, this can be


considered as the author's endorsement of the other page

Problems

Retrieving pages that are not only relevant, but also of high
quality, or authoritative on the topic

Not every hyperlink represents an endorsement


One authority will seldom point to its rival authority
Authoritative pages are seldom particularly descriptive

Hub

Set of Web pages that provides collections of links to


authorities

HITS (Hyperlink-Induced
Topic Search)

Method
1.
2.

Use an index-based search engine to form the root set


Expand the root set into a base set

3.

Apply weight-propagation

4.

Determines numerical estimates of hub and authority


weights

Output a list of the pages

Include all of the pages that the root-set pages link to, and all
of the pages that link to a page in the root set

Large hub weights, large authority weights for the given search
topic

Systems based on the HITS algorithm

Clever, Google

Achieve better quality search results than AltaVista, Yahoo!

Web Usage Mining

Mining Web log records

OLAP on the Weblog database

Discover user access patterns


Typical Web log entry - URL requested, the IP address from
which the request originated, timestamp, etc.
Find the top N users, top N accessed Web pages, most
frequently accessed time periods, etc.

Data mining on Weblog records

Find association patterns, sequential patterns, and trends of


Web accessing

Web Usage Mining

Applications

Target potential customers for electronic commerce


Identify potential prime advertisement locations
Enhance the quality and delivery of Internet information
services to the end user
Improve Web server system performance

Web caching, Web page prefetching, and Web page swapping

References

H. Miller and J. Han (eds.), Geographic Data Mining and Knowledge Discovery, Taylor and Francis, 2001.
Ester M., Frommelt A., Kriegel H.-P., Sander J.: Spatial Data Mining: Database Primitives, Algorithms and
Efficient DBMS Support, Data Mining and Knowledge Discovery, 4: 193-216, 2000.
J. Han, M. Kamber, and A. K. H. Tung, "Spatial Clustering Methods in Data Mining: A Survey", in H. Miller and
J. Han (eds.), Geographic Data Mining and Knowledge Discovery, Taylor and Francis, 2000.
Y. Bedard, T. Merrett, and J. Han, "Fundamentals of Geospatial Data Warehousing for Geographic
Knowledge Discovery", in H. Miller and J. Han (eds.), Geographic Data Mining and Knowledge Discovery,
Taylor and Francis, 2000
K. Koperski and J. Han. Discovery of spatial association rules in geographic information databases. SSD'95.
Shashi Shekhar and Sanjay Chawla, Spatial Databases: A Tour , Prentice Hall, 2003 (ISBN 013-017480-7).
Chapter 7.: Introduction to Spatial Data Mining
X. Li, J. Han, and S. Kim, Motion-Alert: Automatic Anomaly Detection in Massive Moving Objects, IEEE Int.
Conf. on Intelligence and Security Informatics (ISI'06).
Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34,
No.1, March 2002
Soumen Chakrabarti, Data mining for hypertext: A tutorial survey, ACM SIGKDD Explorations, 2000.
Cleverdon, Optimizing convenient online access to bibliographic databases, Information Survey, Use4, 1,
37-47, 1984
Yiming Yang, An evaluation of statistical approaches to text categorization, Journal of Information Retrieval,
1:67-88, 1999.
Yiming Yang and Xin Liu A re-examination of text categorization methods. Proceedings of ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR'99, pp 42--49), 1999.
S. Chakrabarti, Mining the Web: Statistical Analysis of Hypertext and Semi-Structured Data, Morgan
Kaufmann, 2002.

References

G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R. Tengi. Five papers on WordNet. Princeton
University, August 1993.
M. Hearst, Untangling Text Data Mining, ACL99, invited paper. R. Sproat, Introduction to Computational
Linguistics, LING 306, UIUC, Fall 2003.
A Road Map to Text Mining and Web Mining, University of Texas resource page.
http://www.cs.utexas.edu/users/pebronia/text-mining/
Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, Extracting Content Structure for Web Pages based
on Visual Representation, The Fifth Asia Pacific Web Conference, 2003.
Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, VIPS: a Vision-based Page Segmentation
Algorithm, Microsoft Technical Report (MSR-TR-2003-79), 2003.
Shipeng Yu, Deng Cai, Ji-Rong Wen and Wei-Ying Ma, Improving Pseudo-Relevance Feedback in Web
Information Retrieval Using Web Page Segmentation, 12th International World Wide Web Conference
(WWW2003), May 2003.
Ruihua Song, Haifeng Liu, Ji-Rong Wen and Wei-Ying Ma, Learning Block Importance Models for Web
Pages, 13th International World Wide Web Conference (WWW2004), May 2004.
Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, Block-based Web Search, SIGIR 2004, July 2004 .
Deng Cai, Xiaofei He, Ji-Rong Wen and Wei-Ying Ma, Block-Level Link Analysis, SIGIR 2004, July 2004 .
Deng Cai, Xiaofei He, Wei-Ying Ma, Ji-Rong Wen and Hong-Jiang Zhang, Organizing WWW Images Based
on The Analysis of Page Layout and Web Link Structure, The IEEE International Conference on Multimedia
and EXPO (ICME'2004) , June 2004
Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma and Ji-Rong Wen, Hierarchical Clustering of WWW Image
Search Results Using Visual, Textual and Link Analysis,12th ACM International Conference on Multimedia,
Oct. 2004 .

You might also like