Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views

Data Mining - Extracting Knowledge From Large Datasets

The document discusses data mining and extracting knowledge from large datasets. It describes some risks like finding meaningless patterns and provides examples. Visualization techniques can help discover patterns in data. The document also defines data mining and different types of models that can be extracted from data like predictive, summarization and feature extraction models.

Uploaded by

ralfaryw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Data Mining - Extracting Knowledge From Large Datasets

The document discusses data mining and extracting knowledge from large datasets. It describes some risks like finding meaningless patterns and provides examples. Visualization techniques can help discover patterns in data. The document also defines data mining and different types of models that can be extracted from data like predictive, summarization and feature extraction models.

Uploaded by

ralfaryw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

4/21/24, 7:20 PM Data Mining_ Extracting Knowledge from Large Datasets.

svg

A big data mining risk is discovering


patterns that are meaningless.

Bonferroni's principle: If you look in more


places for interesting patterns than your
data can support, you are bound to find
meaningless patterns. Meaningfulness of Answers
Data mining is the use of efficient
The Rhine Paradox: An example of how techniques for the analysis of very large
not to conduct scientific research, where a collections of data and the extraction of
parapsychologist concluded that telling useful and possibly unexpected patterns
people they have ESP causes them to lose in data.
it.
"Data mining is the analysis of (often large)
observational data sets to find
unsuspected relationships and to
The human eye is a powerful analytical summarize the data in novel ways that are
tool. both understandable and useful to the
data analyst" (Hand, Mannila, Smyth).
Visualizing the data properly can help
discover patterns. Visualization: Post-processing What is Data Mining?
"Data mining is the discovery of models for
data" (Rajaraman, Ullman).
There are multiple visualization
techniques, such as scatter plots, contour Models that explain the data (e.g., a single
plots, and histograms. function)

Models that predict the future data


instances
Frequency and Mode: Useful for
categorical data. We can have the following types of models:
Models that summarize the data
Percentiles, Mean, and Median: Measures
of location for continuous data. Summary Statistics: Models that extract the most prominent
features of the data
Range and Variance: Measures of spread
for continuous data.
Scientific data from different disciplines
An important distribution that (weather, astronomy, physics, biological
characterizes many quantities. microarrays, genomics)

Normal Distribution: Exploratory Data Analysis


Fully characterized by the mean and Huge text collections (the web, scientific
standard deviation. articles, news, tweets, Facebook postings)
Huge amounts of complex data generated
Many real-world phenomena, such as from multiple sources and interconnected Transaction data (retail store records,
word frequencies, follow power-law in different ways, such as: credit card records)
distributions.

Behavioral data (mobile phone data, query


Detected by a linear relationship in the log- logs, browsing behavior, ad clicks)
log space. Power-law Distributions:
Networked data (the web, social networks,
Examples include incoming/outgoing links IM networks, email networks, biological
of web pages, number of friends in social networks)
networks, file sizes, city sizes, income
distribution, and product/movie popularity.
Why Do We Need Data Mining? These data types can be combined in
many ways (e.g., Facebook has a network,
text, images, user behavior, ad
transactions).
Perform simple processing to "normalize"
the data (remove punctuation, make into
lowercase, clear white spaces, etc.). We need to analyze this data to extract
knowledge for commercial or scientific
purposes.
Break into words and keep the most
popular words. First Cut:
Data Mining: Our solutions should scale to the size of
the data.
The most frequent words are often stop
words. Extracting
Remove stop words using a pre-defined Knowledge from Preprocessing: Real data is noisy, Techniques: Sampling, dimensionality
stop word list. incomplete, and inconsistent. Data
Data Preprocessing
Large Datasets cleaning is required to make sense of the
reduction, feature selection.
Second Cut: data.
The remaining words are more informative This is often the most important step for
for describing the restaurants. the analysis.

Term Frequency (TF): The number of times Statistical analysis of importance


a word appears in a document. Mining is not the only step in the analysis Post-processing: Make the data actionable
The Data Analysis Pipeline process: and useful to the user.
Visualization
Inverse Document Frequency (IDF): A
measure of the uniqueness of a word
across documents. TF-IDF: Pre- and post-processing are often data
mining tasks as well.

TF-IDF = TF * IDF: Combines term


frequency and inverse document
frequency to identify the most important
Noise and outliers
words.
Data Quality Examples of data quality problems: Missing values

Sample review text from Yelp and Duplicate data


Foursquare. Example Data

Sampling is the main technique employed


Suppose we want to mine the for data selection.
comments/reviews of people on Yelp and
Foursquare. It is often used for both the preliminary
investigation of the data and the final data
Today, there is an abundance of data analysis.
online (Facebook, Twitter, Wikipedia, the
web, etc.). The key principle for effective sampling is
that using a sample will work almost as
We can extract interesting information well as using the entire data sets if the
sample is representative.
from this data, but first we need to collect it. A Detailed Data Preprocessing
Data Collection: Example
Use customized crawlers, public APIs, and A sample is representative if it has
additional cleaning/processing to parse approximately the same property (of
out the useful parts. interest) as the original set of data.

There is an equal probability of selecting


Respect crawling etiquette.
any particular item.

Collect all reviews for the top-10 most Sampling


reviewed restaurants in New York on Yelp. Sampling with replacement: Objects are
not removed from the population as they
Mining Task: Simple Random Sampling: are selected.
Find a few terms that best describe the
restaurants.
Sampling without replacement: As each
item is selected, it is removed from the
Types of Sampling population.
You have N integers and want to sample
one integer uniformly at random. Split the data into several groups, then
draw random samples from each group.
The integers are coming in a stream: you Stratified Sampling:
do not know the size of the stream in Ensures that both groups are represented.
advance, and there is not enough memory
A Data Mining Challenge:
to store the stream. Reservoir Sampling
The sample size necessary depends on the
specific application and the desired level
With probability 1/n, select the nth item of of accuracy.
the stream and replace the previous choice.
Sample Size Larger samples generally provide more
This ensures that every item has The solution is Reservoir Sampling: accurate results, but also require more
probability 1/N to be selected after N items resources.
have been read.

file:///home/kali/Documents/Data Mining_ Extracting Knowledge from Large Datasets.svg 1/1

You might also like