0% found this document useful (0 votes)

10 views

Data Mining - Extracting Knowledge From Large Datasets

The document discusses data mining and extracting knowledge from large datasets. It describes some risks like finding meaningless patterns and provides examples. Visualization techniques can help discover patterns in data. The document also defines data mining and different types of models that can be extracted from data like predictive, summarization and feature extraction models.

Uploaded by

ralfaryw

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Data Mining - Extracting Knowledge From Large Datasets

Uploaded by

ralfaryw

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

4/21/24, 7:20 PM Data Mining_ Extracting Knowledge from Large Datasets.

svg

A big data mining risk is discovering

patterns that are meaningless.

Bonferroni's principle: If you look in more

places for interesting patterns than your
data can support, you are bound to find
meaningless patterns. Meaningfulness of Answers
Data mining is the use of efficient
The Rhine Paradox: An example of how techniques for the analysis of very large
not to conduct scientific research, where a collections of data and the extraction of
parapsychologist concluded that telling useful and possibly unexpected patterns
people they have ESP causes them to lose in data.
it.
"Data mining is the analysis of (often large)
observational data sets to find
unsuspected relationships and to
The human eye is a powerful analytical summarize the data in novel ways that are
tool. both understandable and useful to the
data analyst" (Hand, Mannila, Smyth).
Visualizing the data properly can help
discover patterns. Visualization: Post-processing What is Data Mining?
"Data mining is the discovery of models for
data" (Rajaraman, Ullman).
There are multiple visualization
techniques, such as scatter plots, contour Models that explain the data (e.g., a single
plots, and histograms. function)

Models that predict the future data

instances
Frequency and Mode: Useful for
categorical data. We can have the following types of models:
Models that summarize the data
Percentiles, Mean, and Median: Measures
of location for continuous data. Summary Statistics: Models that extract the most prominent
features of the data
Range and Variance: Measures of spread
for continuous data.
Scientific data from different disciplines
An important distribution that (weather, astronomy, physics, biological
characterizes many quantities. microarrays, genomics)

Normal Distribution: Exploratory Data Analysis

Fully characterized by the mean and Huge text collections (the web, scientific
standard deviation. articles, news, tweets, Facebook postings)
Huge amounts of complex data generated
Many real-world phenomena, such as from multiple sources and interconnected Transaction data (retail store records,
word frequencies, follow power-law in different ways, such as: credit card records)
distributions.

Behavioral data (mobile phone data, query

Detected by a linear relationship in the log- logs, browsing behavior, ad clicks)
log space. Power-law Distributions:
Networked data (the web, social networks,
Examples include incoming/outgoing links IM networks, email networks, biological
of web pages, number of friends in social networks)
networks, file sizes, city sizes, income
distribution, and product/movie popularity.
Why Do We Need Data Mining? These data types can be combined in
many ways (e.g., Facebook has a network,
text, images, user behavior, ad
transactions).
Perform simple processing to "normalize"
the data (remove punctuation, make into
lowercase, clear white spaces, etc.). We need to analyze this data to extract
knowledge for commercial or scientific
purposes.
Break into words and keep the most
popular words. First Cut:
Data Mining: Our solutions should scale to the size of
the data.
The most frequent words are often stop
words. Extracting
Remove stop words using a pre-defined Knowledge from Preprocessing: Real data is noisy, Techniques: Sampling, dimensionality
stop word list. incomplete, and inconsistent. Data
Data Preprocessing
Large Datasets cleaning is required to make sense of the
reduction, feature selection.
Second Cut: data.
The remaining words are more informative This is often the most important step for
for describing the restaurants. the analysis.

Term Frequency (TF): The number of times Statistical analysis of importance

a word appears in a document. Mining is not the only step in the analysis Post-processing: Make the data actionable
The Data Analysis Pipeline process: and useful to the user.
Visualization
Inverse Document Frequency (IDF): A
measure of the uniqueness of a word
across documents. TF-IDF: Pre- and post-processing are often data
mining tasks as well.

TF-IDF = TF * IDF: Combines term

frequency and inverse document
frequency to identify the most important
Noise and outliers
words.
Data Quality Examples of data quality problems: Missing values

Sample review text from Yelp and Duplicate data

Foursquare. Example Data

Sampling is the main technique employed

Suppose we want to mine the for data selection.
comments/reviews of people on Yelp and
Foursquare. It is often used for both the preliminary
investigation of the data and the final data
Today, there is an abundance of data analysis.
online (Facebook, Twitter, Wikipedia, the
web, etc.). The key principle for effective sampling is
that using a sample will work almost as
We can extract interesting information well as using the entire data sets if the
sample is representative.
from this data, but first we need to collect it. A Detailed Data Preprocessing
Data Collection: Example
Use customized crawlers, public APIs, and A sample is representative if it has
additional cleaning/processing to parse approximately the same property (of
out the useful parts. interest) as the original set of data.

There is an equal probability of selecting

Respect crawling etiquette.
any particular item.

Collect all reviews for the top-10 most Sampling

reviewed restaurants in New York on Yelp. Sampling with replacement: Objects are
not removed from the population as they
Mining Task: Simple Random Sampling: are selected.
Find a few terms that best describe the
restaurants.
Sampling without replacement: As each
item is selected, it is removed from the
Types of Sampling population.
You have N integers and want to sample
one integer uniformly at random. Split the data into several groups, then
draw random samples from each group.
The integers are coming in a stream: you Stratified Sampling:
do not know the size of the stream in Ensures that both groups are represented.
advance, and there is not enough memory
A Data Mining Challenge:
to store the stream. Reservoir Sampling
The sample size necessary depends on the
specific application and the desired level
With probability 1/n, select the nth item of of accuracy.
the stream and replace the previous choice.
Sample Size Larger samples generally provide more
This ensures that every item has The solution is Reservoir Sampling: accurate results, but also require more
probability 1/N to be selected after N items resources.
have been read.

file:///home/kali/Documents/Data Mining_ Extracting Knowledge from Large Datasets.svg 1/1

(Haskins) Practical Guide To Critical Thinking PDF
100% (2)
(Haskins) Practical Guide To Critical Thinking PDF
14 pages
A Brief Overview On Data Mining Survey PDF
No ratings yet
A Brief Overview On Data Mining Survey PDF
8 pages
Adobe Scan 09 Sept 2024
No ratings yet
Adobe Scan 09 Sept 2024
4 pages
5104 - 07.S. L. Nalawade1
No ratings yet
5104 - 07.S. L. Nalawade1
5 pages
A Review of Data Mining Literature
No ratings yet
A Review of Data Mining Literature
6 pages
What Is Data Mining
No ratings yet
What Is Data Mining
3 pages
4-3-21-192
No ratings yet
4-3-21-192
4 pages
DWM Merged
No ratings yet
DWM Merged
125 pages
Unit-1 Data Mining
No ratings yet
Unit-1 Data Mining
52 pages
Major issues in DM
No ratings yet
Major issues in DM
5 pages
Neural Networks in Data Mining
No ratings yet
Neural Networks in Data Mining
5 pages
Insert Your Titles and Guide Name: International Research Journal of Engineering and Technology (IRJET)
No ratings yet
Insert Your Titles and Guide Name: International Research Journal of Engineering and Technology (IRJET)
8 pages
Unit 3
No ratings yet
Unit 3
23 pages
"Connecting The Dots To Make Sense of Data": Contents
No ratings yet
"Connecting The Dots To Make Sense of Data": Contents
14 pages
Data Science Terminology Flashcards - Quizlet
100% (1)
Data Science Terminology Flashcards - Quizlet
15 pages
Data Mining in Digital Library
No ratings yet
Data Mining in Digital Library
5 pages
An Intelligent Approach of Rough Set in Knowledge Discovery Databases
No ratings yet
An Intelligent Approach of Rough Set in Knowledge Discovery Databases
4 pages
Multidimensional Data Analysis, Data Mining and Knowledge Discovery
No ratings yet
Multidimensional Data Analysis, Data Mining and Knowledge Discovery
6 pages
Visualization of High Dimensional Scientific Data
No ratings yet
Visualization of High Dimensional Scientific Data
105 pages
Chapter 1 (5)
No ratings yet
Chapter 1 (5)
37 pages
Data Warehousing & Mining: Unit - Iv
No ratings yet
Data Warehousing & Mining: Unit - Iv
32 pages
Sheenaz Project
No ratings yet
Sheenaz Project
22 pages
An Analysis of Outlier Detection Through Clustering Method
No ratings yet
An Analysis of Outlier Detection Through Clustering Method
6 pages
Scalable Pattern Recognition For Large-Scale Scientific Data Mining
No ratings yet
Scalable Pattern Recognition For Large-Scale Scientific Data Mining
14 pages
Hand2007 - Article - Principles Ofs DataMining
No ratings yet
Hand2007 - Article - Principles Ofs DataMining
2 pages
Automating Data Science
No ratings yet
Automating Data Science
12 pages
Connecting The Dots To Make Sense of Data
No ratings yet
Connecting The Dots To Make Sense of Data
8 pages
Fundamental Data Mining in Institutional Research Workshop
No ratings yet
Fundamental Data Mining in Institutional Research Workshop
68 pages
Ijms 17 01313
No ratings yet
Ijms 17 01313
26 pages
Big Data Processing Technologies in Distributed in
No ratings yet
Big Data Processing Technologies in Distributed in
6 pages
Data-Mining and Knowledge Discovery, Neural Networks in
No ratings yet
Data-Mining and Knowledge Discovery, Neural Networks in
15 pages
1-s2.0-S0092867422007991-main
No ratings yet
1-s2.0-S0092867422007991-main
4 pages
Investigation Techniques: Data Mining
No ratings yet
Investigation Techniques: Data Mining
1 page
Data Smarshing
No ratings yet
Data Smarshing
11 pages
Stanford - Slides Mapreduce
No ratings yet
Stanford - Slides Mapreduce
76 pages
Research of Data Mining Based On Neural Networks: Xianjun Ni
No ratings yet
Research of Data Mining Based On Neural Networks: Xianjun Ni
4 pages
Insight Into Theoretical and Applied Informatics I... - (2.2.4 Data Mining)
No ratings yet
Insight Into Theoretical and Applied Informatics I... - (2.2.4 Data Mining)
5 pages
Internal
No ratings yet
Internal
267 pages
AGeneralised Methodology
No ratings yet
AGeneralised Methodology
14 pages
Data Mining (The Analysis Step of The "Knowledge Discovery in Databases" Process, or KDD)
No ratings yet
Data Mining (The Analysis Step of The "Knowledge Discovery in Databases" Process, or KDD)
1 page
My Third Publication
No ratings yet
My Third Publication
8 pages
Data Mining Mod1
No ratings yet
Data Mining Mod1
128 pages
Ch01.Ppt Data Mining
No ratings yet
Ch01.Ppt Data Mining
46 pages
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
No ratings yet
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
10 pages
Data Science and its role in data analytics
No ratings yet
Data Science and its role in data analytics
23 pages
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
No ratings yet
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
43 pages
World's Largest Science, Technology & Medicine Open Access Book Publisher
No ratings yet
World's Largest Science, Technology & Medicine Open Access Book Publisher
26 pages
ds final
No ratings yet
ds final
3 pages
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
No ratings yet
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
43 pages
Knowledge Discovery in Databases (KDD) : An Overview
No ratings yet
Knowledge Discovery in Databases (KDD) : An Overview
4 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
36 pages
Data Mining Using Evolutionary Algorit Data Mining Using Evolutionary Algorithm HM
No ratings yet
Data Mining Using Evolutionary Algorit Data Mining Using Evolutionary Algorithm HM
11 pages
4 - A Survey On Contribution of Data Mining - B Lavanya
No ratings yet
4 - A Survey On Contribution of Data Mining - B Lavanya
7 pages
Visual Data Mining
No ratings yet
Visual Data Mining
2 pages
Fundamentals of Data Science Unit 1
No ratings yet
Fundamentals of Data Science Unit 1
29 pages
Hot Keys
No ratings yet
Hot Keys
4 pages
Identification of Missing Person Using CNN
No ratings yet
Identification of Missing Person Using CNN
5 pages
Data Mining 4545
No ratings yet
Data Mining 4545
20 pages
Data Mining: Concepts, Fundamentals And Applications
From Everand
Data Mining: Concepts, Fundamentals And Applications
Enrico Guardelli
No ratings yet
Neural Networks for Beginners: Introduction to Machine Learning and Deep Learning
From Everand
Neural Networks for Beginners: Introduction to Machine Learning and Deep Learning
daniel Huston
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Eca 80 Dad
No ratings yet
Eca 80 Dad
13 pages
Multiple Regression Analysis, The Problem of Estimation
No ratings yet
Multiple Regression Analysis, The Problem of Estimation
53 pages
Subjects1415 PDF
No ratings yet
Subjects1415 PDF
488 pages
BPBR7103 Business Research Assignment Jan 2020
No ratings yet
BPBR7103 Business Research Assignment Jan 2020
25 pages
Template Submit Jurnal UI
No ratings yet
Template Submit Jurnal UI
9 pages
Lecture 2 Merits and Limitations of Inductive and Deductive Methods
No ratings yet
Lecture 2 Merits and Limitations of Inductive and Deductive Methods
4 pages
Chapter 9 - Regression Analysis: S-1
No ratings yet
Chapter 9 - Regression Analysis: S-1
7 pages
BUS173.10 Course Outline
No ratings yet
BUS173.10 Course Outline
4 pages
FC 502 FARSy 1
No ratings yet
FC 502 FARSy 1
2 pages
Twitter As Data PDF
No ratings yet
Twitter As Data PDF
116 pages
Stata Item Response Theory Reference Manual: Release 17
No ratings yet
Stata Item Response Theory Reference Manual: Release 17
257 pages
The Effect of Road Safety Education On Knowledge A
No ratings yet
The Effect of Road Safety Education On Knowledge A
7 pages
Assignment AnjaliVats 244
No ratings yet
Assignment AnjaliVats 244
12 pages
Solutions for Using and Understanding Mathematics 8th Edition by Bennett
No ratings yet
Solutions for Using and Understanding Mathematics 8th Edition by Bennett
31 pages
Worksheet 3
No ratings yet
Worksheet 3
10 pages
08 - Mahalik - Masculinity and Perceived Normative Health Behaviors As Predictors of Men Heatlh Behaviors
No ratings yet
08 - Mahalik - Masculinity and Perceived Normative Health Behaviors As Predictors of Men Heatlh Behaviors
9 pages
Child Care Business Plan
100% (1)
Child Care Business Plan
34 pages
Complete Final
100% (2)
Complete Final
29 pages
CSDS 440: Machine Learning: Soumya Ray (
No ratings yet
CSDS 440: Machine Learning: Soumya Ray (
31 pages
Hypothesis Testing by Example Hands On Approach Using R
No ratings yet
Hypothesis Testing by Example Hands On Approach Using R
39 pages
Packag Technol Sci - 2022 - Berumen - Interactions of Fast Moving Consumer Goods in Cooking Insights From A Quantitative
No ratings yet
Packag Technol Sci - 2022 - Berumen - Interactions of Fast Moving Consumer Goods in Cooking Insights From A Quantitative
15 pages
Bias and Confounding: Nayana Fernando
No ratings yet
Bias and Confounding: Nayana Fernando
31 pages
Construction of Decision Tree Attribute Selection Measures
No ratings yet
Construction of Decision Tree Attribute Selection Measures
5 pages
Mid Term Test Revision Homework
No ratings yet
Mid Term Test Revision Homework
7 pages
Stat Notes
No ratings yet
Stat Notes
9 pages
WoE Methopd For Landslide Susceptibility MAp in Tandikek and Damarbancah - IJSR
No ratings yet
WoE Methopd For Landslide Susceptibility MAp in Tandikek and Damarbancah - IJSR
8 pages
Sip Project Report Guidelines: Progressive Education Society's Modern College of Engineering, Pune - 5
No ratings yet
Sip Project Report Guidelines: Progressive Education Society's Modern College of Engineering, Pune - 5
11 pages
Random - Motors - Presentation Deck
No ratings yet
Random - Motors - Presentation Deck
10 pages
Business Mathematics and Statistics: Dr. Muhammad Arif Hussain
No ratings yet
Business Mathematics and Statistics: Dr. Muhammad Arif Hussain
39 pages

Data Mining - Extracting Knowledge From Large Datasets

Uploaded by

Data Mining - Extracting Knowledge From Large Datasets

Uploaded by

4/21/24, 7:20 PM Data Mining_ Extracting Knowledge from Large Datasets.

A big data mining risk is discovering

Bonferroni's principle: If you look in more

Models that predict the future data

Normal Distribution: Exploratory Data Analysis

Behavioral data (mobile phone data, query

Term Frequency (TF): The number of times Statistical analysis of importance

TF-IDF = TF * IDF: Combines term

Sample review text from Yelp and Duplicate data

Sampling is the main technique employed

There is an equal probability of selecting

Collect all reviews for the top-10 most Sampling

file:///home/kali/Documents/Data Mining_ Extracting Knowledge from Large Datasets.svg 1/1

You might also like