Data Mining - Extracting Knowledge From Large Datasets
The document discusses data mining and extracting knowledge from large datasets. It describes some risks like finding meaningless patterns and provides examples. Visualization techniques can help discover patterns in data. The document also defines data mining and different types of models that can be extracted from data like predictive, summarization and feature extraction models.
Data Mining - Extracting Knowledge From Large Datasets
The document discusses data mining and extracting knowledge from large datasets. It describes some risks like finding meaningless patterns and provides examples. Visualization techniques can help discover patterns in data. The document also defines data mining and different types of models that can be extracted from data like predictive, summarization and feature extraction models.
4/21/24, 7:20 PM Data Mining_ Extracting Knowledge from Large Datasets.
svg
A big data mining risk is discovering
patterns that are meaningless.
Bonferroni's principle: If you look in more
places for interesting patterns than your data can support, you are bound to find meaningless patterns. Meaningfulness of Answers Data mining is the use of efficient The Rhine Paradox: An example of how techniques for the analysis of very large not to conduct scientific research, where a collections of data and the extraction of parapsychologist concluded that telling useful and possibly unexpected patterns people they have ESP causes them to lose in data. it. "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to The human eye is a powerful analytical summarize the data in novel ways that are tool. both understandable and useful to the data analyst" (Hand, Mannila, Smyth). Visualizing the data properly can help discover patterns. Visualization: Post-processing What is Data Mining? "Data mining is the discovery of models for data" (Rajaraman, Ullman). There are multiple visualization techniques, such as scatter plots, contour Models that explain the data (e.g., a single plots, and histograms. function)
Models that predict the future data
instances Frequency and Mode: Useful for categorical data. We can have the following types of models: Models that summarize the data Percentiles, Mean, and Median: Measures of location for continuous data. Summary Statistics: Models that extract the most prominent features of the data Range and Variance: Measures of spread for continuous data. Scientific data from different disciplines An important distribution that (weather, astronomy, physics, biological characterizes many quantities. microarrays, genomics)
Normal Distribution: Exploratory Data Analysis
Fully characterized by the mean and Huge text collections (the web, scientific standard deviation. articles, news, tweets, Facebook postings) Huge amounts of complex data generated Many real-world phenomena, such as from multiple sources and interconnected Transaction data (retail store records, word frequencies, follow power-law in different ways, such as: credit card records) distributions.
Behavioral data (mobile phone data, query
Detected by a linear relationship in the log- logs, browsing behavior, ad clicks) log space. Power-law Distributions: Networked data (the web, social networks, Examples include incoming/outgoing links IM networks, email networks, biological of web pages, number of friends in social networks) networks, file sizes, city sizes, income distribution, and product/movie popularity. Why Do We Need Data Mining? These data types can be combined in many ways (e.g., Facebook has a network, text, images, user behavior, ad transactions). Perform simple processing to "normalize" the data (remove punctuation, make into lowercase, clear white spaces, etc.). We need to analyze this data to extract knowledge for commercial or scientific purposes. Break into words and keep the most popular words. First Cut: Data Mining: Our solutions should scale to the size of the data. The most frequent words are often stop words. Extracting Remove stop words using a pre-defined Knowledge from Preprocessing: Real data is noisy, Techniques: Sampling, dimensionality stop word list. incomplete, and inconsistent. Data Data Preprocessing Large Datasets cleaning is required to make sense of the reduction, feature selection. Second Cut: data. The remaining words are more informative This is often the most important step for for describing the restaurants. the analysis.
Term Frequency (TF): The number of times Statistical analysis of importance
a word appears in a document. Mining is not the only step in the analysis Post-processing: Make the data actionable The Data Analysis Pipeline process: and useful to the user. Visualization Inverse Document Frequency (IDF): A measure of the uniqueness of a word across documents. TF-IDF: Pre- and post-processing are often data mining tasks as well.
TF-IDF = TF * IDF: Combines term
frequency and inverse document frequency to identify the most important Noise and outliers words. Data Quality Examples of data quality problems: Missing values
Sample review text from Yelp and Duplicate data
Foursquare. Example Data
Sampling is the main technique employed
Suppose we want to mine the for data selection. comments/reviews of people on Yelp and Foursquare. It is often used for both the preliminary investigation of the data and the final data Today, there is an abundance of data analysis. online (Facebook, Twitter, Wikipedia, the web, etc.). The key principle for effective sampling is that using a sample will work almost as We can extract interesting information well as using the entire data sets if the sample is representative. from this data, but first we need to collect it. A Detailed Data Preprocessing Data Collection: Example Use customized crawlers, public APIs, and A sample is representative if it has additional cleaning/processing to parse approximately the same property (of out the useful parts. interest) as the original set of data.
There is an equal probability of selecting
Respect crawling etiquette. any particular item.
Collect all reviews for the top-10 most Sampling
reviewed restaurants in New York on Yelp. Sampling with replacement: Objects are not removed from the population as they Mining Task: Simple Random Sampling: are selected. Find a few terms that best describe the restaurants. Sampling without replacement: As each item is selected, it is removed from the Types of Sampling population. You have N integers and want to sample one integer uniformly at random. Split the data into several groups, then draw random samples from each group. The integers are coming in a stream: you Stratified Sampling: do not know the size of the stream in Ensures that both groups are represented. advance, and there is not enough memory A Data Mining Challenge: to store the stream. Reservoir Sampling The sample size necessary depends on the specific application and the desired level With probability 1/n, select the nth item of of accuracy. the stream and replace the previous choice. Sample Size Larger samples generally provide more This ensures that every item has The solution is Reservoir Sampling: accurate results, but also require more probability 1/N to be selected after N items resources. have been read.
file:///home/kali/Documents/Data Mining_ Extracting Knowledge from Large Datasets.svg 1/1