The document introduces data preprocessing techniques for data mining. It discusses why data preprocessing is important due to real-world data often being dirty, incomplete, noisy, inconsistent or duplicate. It then describes common data types and quality issues like missing values, noise, outliers and duplicates. The major tasks of data preprocessing are outlined as data cleaning, integration, transformation and reduction. Specific techniques for handling missing values, noise, outliers and duplicates are also summarized.
This document discusses various techniques for data preprocessing, including data cleaning, integration, transformation, and reduction. It describes why preprocessing is important for obtaining quality data and mining results. Key techniques covered include handling missing data, smoothing noisy data, data integration and normalization for transformation, and data reduction methods like binning, discretization, feature selection and dimensionality reduction.
This document discusses various machine learning techniques for classification and prediction. It covers decision tree induction, tree pruning, Bayesian classification, Bayesian belief networks, backpropagation, association rule mining, and ensemble methods like bagging and boosting. Classification involves predicting categorical labels while prediction predicts continuous values. Key steps for preparing data include cleaning, transformation, and comparing different methods based on accuracy, speed, robustness, scalability, and interpretability.
The document provides an overview of data mining concepts and techniques. It introduces data mining, describing it as the process of discovering interesting patterns or knowledge from large amounts of data. It discusses why data mining is necessary due to the explosive growth of data and how it relates to other fields like machine learning, statistics, and database technology. Additionally, it covers different types of data that can be mined, functionalities of data mining like classification and prediction, and classifications of data mining systems.
Data preprocessing is the process of preparing raw data for analysis by cleaning it, transforming it, and reducing it. The key steps in data preprocessing include data cleaning to handle missing values, outliers, and noise; data transformation techniques like normalization, discretization, and feature extraction; and data reduction methods like dimensionality reduction and sampling. Preprocessing ensures the data is consistent, accurate and suitable for building machine learning models.
Data preprocessing involves transforming raw data into an understandable and consistent format. It includes data cleaning, integration, transformation, and reduction. Data cleaning aims to fill missing values, smooth noise, and resolve inconsistencies. Data integration combines data from multiple sources. Data transformation handles tasks like normalization and aggregation to prepare the data for mining. Data reduction techniques obtain a reduced representation of data that maintains analytical results but reduces volume, such as through aggregation, dimensionality reduction, discretization, and sampling.
- Naive Bayes is a classification technique based on Bayes' theorem that uses "naive" independence assumptions. It is easy to build and can perform well even with large datasets.
- It works by calculating the posterior probability for each class given predictor values using the Bayes theorem and independence assumptions between predictors. The class with the highest posterior probability is predicted.
- It is commonly used for text classification, spam filtering, and sentiment analysis due to its fast performance and high success rates compared to other algorithms.
This document discusses unsupervised learning approaches including clustering, blind signal separation, and self-organizing maps (SOM). Clustering groups unlabeled data points together based on similarities. Blind signal separation separates mixed signals into their underlying source signals without information about the mixing process. SOM is an algorithm that maps higher-dimensional data onto lower-dimensional displays to visualize relationships in the data.
This presentation was prepared as part of the curriculum studies for CSCI-659 Topics in Artificial Intelligence Course - Machine Learning in Computational Linguistics.
It was prepared under guidance of Prof. Sandra Kubler.
PCA and LDA are dimensionality reduction techniques. PCA transforms variables into uncorrelated principal components while maximizing variance. It is unsupervised. LDA finds axes that maximize separation between classes while minimizing within-class variance. It is supervised and finds axes that separate classes well. The document provides mathematical explanations of how PCA and LDA work including calculating covariance matrices, eigenvalues, eigenvectors, and transformations.
Machine Learning With Logistic RegressionKnoldus Inc.
Machine learning is the subfield of computer science that gives computers the ability to learn without being programmed. Logistic Regression is a type of classification algorithm, based on linear regression to evaluate output and to minimize the error.
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
This document discusses rule-based classification. It describes how rule-based classification models use if-then rules to classify data. It covers extracting rules from decision trees and directly from training data. Key points include using sequential covering algorithms to iteratively learn rules that each cover positive examples of a class, and measuring rule quality based on both coverage and accuracy to determine the best rules.
The document discusses various techniques for data pre-processing. It begins by explaining why pre-processing is important for obtaining clean and consistent data needed for quality data mining results. It then covers topics such as data cleaning, integration, transformation, reduction, and discretization. Data cleaning involves techniques for handling missing values, outliers, and inconsistencies. Data integration combines data from multiple sources. Transformation techniques include normalization, aggregation, and generalization. Data reduction aims to reduce data volume while maintaining analytical quality. Discretization includes binning of continuous variables.
This document discusses various techniques for data preprocessing, including data cleaning, integration and transformation, reduction, and discretization. It provides details on techniques for handling missing data, noisy data, and data integration issues. It also describes methods for data transformation such as normalization, aggregation, and attribute construction. Finally, it outlines various data reduction techniques including cube aggregation, attribute selection, dimensionality reduction, and numerosity reduction.
This document provides an overview of clustering techniques and similarity measures. It introduces clustering as an unsupervised classification technique where data is grouped without predefined classes. Various similarity and dissimilarity measures are discussed for calculating proximity between data points defined by single or multiple attributes. Measures like symmetric binary coefficient and Jaccard coefficient are explained for computing similarity between objects described by symmetric and asymmetric binary attributes respectively. Examples are also provided to demonstrate calculating these similarity measures.
Data is often incomplete, noisy, and inconsistent which can negatively impact mining results. Effective data cleaning is needed to fill in missing values, identify and remove outliers, and resolve inconsistencies. Other important tasks include data integration, transformation, reduction, and discretization to prepare the data for mining and obtain reduced representation that produces similar analytical results. Proper data preparation is essential for high quality knowledge discovery.
You will learn the basic concepts of machine learning classification and will be introduced to some different algorithms that can be used. This is from a very high level and will not be getting into the nitty-gritty details.
Scaling transforms data values to fall within a specific range, such as 0 to 1, without changing the data distribution. Normalization changes the data distribution to be normal. Common normalization techniques include standardization, which transforms data to have mean 0 and standard deviation 1, and Box-Cox transformation, which finds the best lambda value to make data more normal. Normalization is useful for algorithms that assume normal data distributions and can improve model performance and interpretation.
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
This document provides an overview of data preprocessing techniques. It discusses data quality issues like missing values, noise, and inconsistencies that require cleaning. Major tasks in preprocessing include data cleaning, integration, reduction, and transformation. Data cleaning techniques are described for handling incomplete, noisy, and inconsistent data. Methods for data integration, reduction through dimensionality reduction and sampling, and transformation through normalization and discretization are also summarized.
Data Preprocessing can be defined as a process of converting raw data into a format that is understandable and usable for further analysis. It is an important step in the Data Preparation stage. It ensures that the outcome of the analysis is accurate, complete, and consistent.
Data preprocessing involves several key steps:
1) Data cleaning to fill in missing values, identify and remove outliers, and resolve inconsistencies
2) Data integration to combine multiple data sources and resolve conflicts and redundancies
3) Data reduction techniques like discretization, dimensionality reduction, and aggregation to obtain a reduced representation of the data for mining and analysis.
This document discusses data preprocessing techniques for data mining. It covers data cleaning, integration, reduction, transformation, and discretization. Data cleaning involves handling missing, noisy, and inconsistent data through techniques like filling in missing values, smoothing noisy data, and resolving inconsistencies. Data integration combines data from multiple sources. Data reduction reduces data size through dimensionality reduction, numerosity reduction, and compression. Dimensionality reduction techniques include wavelet transforms and principal component analysis.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
This document summarizes key concepts from Chapter 3 of the textbook "Data Mining: Concepts and Techniques". It discusses data preprocessing, which includes data cleaning, integration, reduction, and transformation. Data cleaning deals with handling missing, noisy, and inconsistent data. Data integration combines data from multiple sources. Data reduction reduces data volume for analysis through techniques like dimensionality reduction. Data transformation normalizes and discretizes values.
The document discusses different types of data sets and various concepts related to data preprocessing. It describes common data types like records, relational data, and transaction data. It also defines key concepts in data preprocessing like data objects, attributes, handling missing/noisy data, data integration, reduction, transformation and discretization. The goal of these techniques is to clean, integrate and prepare raw data for analysis.
This document discusses data preprocessing and data warehouses. It explains that real-world data is often dirty, incomplete, noisy, and inconsistent. Data preprocessing aims to clean and transform raw data into a format suitable for data mining. The key tasks of data preprocessing include data cleaning, integration, transformation, reduction, and discretization. Data cleaning involves techniques like handling missing data, identifying outliers, and resolving inconsistencies. Data integration combines data from multiple sources. The document also defines characteristics of a data warehouse such as being subject-oriented, integrated, time-variant, and nonvolatile.
The document discusses various techniques for data preprocessing, including data cleaning, integration, reduction, and transformation. It describes why preprocessing is important for improving data quality, accuracy, and consistency. Several forms of data preprocessing are covered in detail, such as handling missing or noisy data, data integration, dimensionality reduction techniques like principal component analysis, and different strategies for data reduction.
Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean
the chapter contains :
Data Preprocessing: An Overview,
Data Quality,
Major Tasks in Data Preprocessing,
Data Cleaning,
Data Integration,
Data Reduction,
Data Transformation and Data Discretization,
Summary.
Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for modeling and analysis. The document discusses several key aspects of data preprocessing including:
- Why data preprocessing is important to improve data quality and ensure accurate analysis results.
- Common data issues like missing values, noise, inconsistencies that require cleaning. Techniques for cleaning include filling in missing data, identifying and handling outliers, and resolving inconsistencies.
- Methods for reducing data like binning, regression, clustering, sampling to obtain a smaller yet representative version of the data.
- The major tasks in preprocessing like data cleaning, integration, transformation, reduction and discretization which are aimed at handling real-world data issues.
This document provides an overview of data preprocessing techniques for data mining. It discusses data quality issues like accuracy, completeness, and consistency that require data cleaning. The major tasks of data preprocessing are described as data cleaning, integration, reduction, and transformation. Specific techniques for handling missing data, noisy data, and reducing redundancy during data integration are also summarized.
The document discusses data preprocessing tasks that are commonly performed on real-world databases before data mining or analysis. These tasks include data cleaning to handle incomplete, noisy, or inconsistent data through techniques like filling in missing values, identifying outliers, and resolving inconsistencies. Data integration is used to combine data from multiple sources by resolving attribute name differences and eliminating redundancies. Data transformation techniques like normalization, attribute construction, aggregation, and generalization are also discussed to convert data into appropriate forms for mining algorithms or users. The goal of these preprocessing steps is to improve the quality and consistency of data for subsequent analysis and knowledge discovery.
This document summarizes key concepts from Chapter 3 of the textbook "Data Mining: Concepts and Techniques". It discusses the importance of data preprocessing, which includes tasks like data cleaning, integration, reduction, and transformation. Specific techniques are described for handling missing/noisy data, data integration when combining multiple sources, and reducing dimensionality through feature selection or dimension reduction methods like PCA. The goal of preprocessing is to prepare raw data into a format suitable for data mining algorithms.
This document provides an overview of data preprocessing techniques discussed in Chapter 3 of the textbook "Data Mining: Concepts and Techniques". It covers topics such as data quality, data cleaning, data integration, data reduction, and data transformation. Data reduction techniques like dimensionality reduction aim to obtain a reduced representation of data that uses less space but produces similar analytical results. Dimensionality reduction methods include wavelet transforms, principal component analysis, and feature selection. Wavelet transforms decompose a signal into different frequency subbands and allow clusters to become more distinguishable at different resolution levels.
This document summarizes key concepts from Chapter 3 of the textbook "Data Mining: Concepts and Techniques". It discusses the importance of data preprocessing, which includes tasks like data cleaning, integration, reduction, and transformation. Specific techniques are described for handling missing/noisy data, data integration when combining multiple sources, and reducing dimensionality through feature selection or dimension reduction methods like PCA. The goal of preprocessing is to prepare raw data into a format suitable for data mining algorithms.
Data Preprocessing and Visualizsdjvnovrnververdfvdfationwokati2689
This document provides an overview of data preprocessing techniques discussed in Chapter 3 of the textbook "Data Mining: Concepts and Techniques". It covers topics such as data quality, data cleaning, data integration, data reduction, and data transformation. Data reduction techniques like dimensionality reduction aim to obtain a reduced representation of data that uses less space but produces similar analytical results. Dimensionality reduction methods include wavelet transforms, principal component analysis, and feature selection. Wavelet transforms decompose a signal into different frequency subbands and allow clusters to become more distinguishable at different resolution levels.
The document discusses various terms and concepts related to satellite communications. It describes different types of satellites based on coverage area, service provided, and orbit. It also defines terms like earth stations, uplinks, downlinks, transponders, elevation angles, and coverage angles. Finally, it discusses concepts like frequency division multiple access (FDMA), time division multiple access (TDMA), different satellite orbits like GEO, LEO and MEO, and considerations that impact satellite link performance.
HIPERLAN is a European standard for wireless local area networks (WLANs) developed by ETSI to provide high performance networking comparable to wired Ethernet networks. HIPERLAN 1 operates in the 5GHz band and uses a frequency hopping spread spectrum technique along with advanced MAC and PHY layer protocols to provide data rates up to 23Mbps and support both asynchronous and isochronous traffic. HIPERLAN 2 is under development and aims to be compatible with asynchronous transfer mode networking to provide quality of service guarantees over wireless links.
- Bluesniff is a tool that allows Bluetooth devices to be discovered through wardriving even if they are not set to be discoverable. It brute forces the discovery process by cycling through all possible MAC addresses.
- Previous tools like Redfang also attempted this discovery of non-discoverable Bluetooth devices but required manually entering each MAC address. Bluesniff streamlines the process and integrates it with existing WiFi wardriving tools.
- The goal of Bluesniff is to raise awareness of Bluetooth security issues by making it easier for people to discover Bluetooth devices and track locations through Bluetooth, even if the user thinks the device is not discoverable.
IEEE 802.11 is a set of media access control (MAC) and physical layer (PHY) specifications for implementing wireless local area network (WLAN) computer communication in the 2.4, 3.6, 5, and 60 GHz frequency bands. It provides connectivity through wireless stations organized into basic service sets (BSSs) that together form an extended service set (ESS). Key components include the MAC sublayer, physical layers using technologies like direct sequence spread spectrum (DSSS), and services that enable station mobility and quality of service (QoS).
HIPERLAN was a wireless network standard developed in Europe as an alternative to IEEE 802.11. It aimed to provide higher data rates and quality of service compared to early 802.11 standards. HIPERLAN Type 1 achieved data rates up to 2Mbps in the 1990s. HIPERLAN Type 2 was developed later to support connection-oriented networking at up to 54Mbps, with quality of service, security, and flexibility to connect to different wired networks. While prototypes existed, commercial products were still in development in the early 2000s as the standard competed with improving 802.11 variants.
The document provides an overview of the GSM network including its history, architecture, technical specifications, and applications. It discusses the key components of GSM including the mobile station, base station subsystem, network switching subsystem, logical and physical channels, and security features. The architecture consists of the mobile station, base station subsystem with BTS and BSC, and the network switching subsystem including the MSC, HLR, VLR, and AUC. GSM uses TDMA and FDMA and operates in the 900/1800MHz spectrum. It provides voice and data services and allows international roaming.
- GPRS is an upgrade to GSM that allows packet-based data services and efficient use of network bandwidth. It provides higher data rates than GSM and constant connectivity.
- The GPRS network architecture introduces new network elements like the SGSN and GGSN to route data packets. The SGSN manages packet data in its service area while the GGSN connects the GPRS network to external packet networks.
- Session management in GPRS includes establishing PDP contexts for data transfer sessions and location management tracks the routing area of mobile devices through routing area updates.
The document discusses the goal of going more in depth on the core architecture of CORBA, including less breadth and focusing on reading suggested sections from referenced books, with an outlined lecture covering CORBA's general overview, interface definition language, ORB components, and conclusions.
This document provides an overview of Remote Method Invocation (RMI) in Java. It describes how RMI allows objects to be distributed across machines and invoked remotely. The key steps to building an RMI application are discussed, including defining a remote interface, implementing remote objects, running the RMI registry, and compiling/running client and server code. An example is presented to demonstrate the required classes for a simple RMI application that retrieves a string remotely.
Global Collaboration for Space Exploration.pdfSachin Chitre
Distinguished readers, leaders, esteemed colleagues, and fellow dreamers,
We stand at the precipice of a new era, an epoch where the boundaries of human potential are poised to be redefined. For centuries, humanity has gazed up at the celestial expanse, yearning to explore the cosmic mysteries that beckon us.
Today, I present a vision, a blueprint for a journey that transcends the limitations of conventional science and technology.
Imagine a world where the shackles of gravity are broken, where interstellar travel is no longer confined to the realms of science fiction. A world united not by petty differences, but by a shared purpose – to explore, to discover, and to elevate humanity.
This presentation outlines a comprehensive research project to construct and deploy Vimanas – ancient, aerial vehicles of wisdom and power. By harnessing the knowledge of our ancestors and the advancements of modern science, we can embark on a quest to not only conquer the skies but to conquer the cosmos.
Let us together ignite the spark of human ingenuity and propel our civilization towards a future where the stars are within our reach and where the bonds of humanity are strengthened through shared exploration.
The time for action is now. Let us embark on this extraordinary journey together."
How CXAI Toolkit uses RAG for Intelligent Q&AZilliz
Manasi will be talking about RAG and how CXAI Toolkit uses RAG for Intelligent Q&A. She will go over what sets CXAI Toolkit's Intelligent Q&A apart from other Q&A systems, and how our trusted AI layer keeps customer data safe. She will also share some current challenges being faced by the team.
Planetek Italia is an Italian Benefit Company established in 1994, which employs 120+ women and men, passionate and skilled in Geoinformatics, Space solutions, and Earth science.
We provide solutions to exploit the value of geospatial data through all phases of data life cycle. We operate in many application areas ranging from environmental and land monitoring to open-government and smart cities, and including defence and security, as well as Space exploration and EO satellite missions.
IT market in Israel, economic background, forecasts of 160 categories and the infrastructure and software products in those categories, professional services also. 710 vendors are ranked in 160 categories.
Project management Course in Australia.pptxdeathreaper9
Project Management Course
Over the past few decades, organisations have discovered something incredible: the principles that lead to great success on large projects can be applied to projects of any size to achieve extraordinary success. As a result, many employees are expected to be familiar with project management techniques and how they apply them to projects.
https://projectmanagementcoursesonline.au/
Selling software today doesn’t look anything like it did a few years ago. Especially software that runs inside a customer environment. Dreamfactory has used Anchore and Ask Sage to achieve compliance in a record time. Reducing attack surface to keep vulnerability counts low, and configuring automation to meet those compliance requirements. After achieving compliance, they are keeping up to date with Anchore Enterprise in their CI/CD pipelines.
The CEO of Ask Sage, Nic Chaillan, the CEO of Dreamfactory Terence Bennet, and Anchore’s VP of Security Josh Bressers are going to discuss these hard problems.
In this webinar we will cover:
- The standards Dreamfactory decided to use for their compliance efforts
- How Dreamfactory used Ask Sage to collect and write up their evidence
- How Dreamfactory used Anchore Enterprise to help achieve their compliance needs
- How Dreamfactory is using automation to stay in compliance continuously
- How reducing attack surface can lower vulnerability findings
- How you can apply these principles in your own environment
When you do security right, they won’t know you’ve done anything at all!
Network Auto Configuration and Correction using Python.pptxsaikumaresh2
- Implemented Zero Touch Provisioning, Network Topology Mapper, and Root Cause Analysis using Python, GNS3, Netmiko, SSH, OSPF, and Graphviz.
- Developed a Python script to automate network discovery based on Core Router IP and login details, significantly reducing manual intervention.
- Enhanced network visualization by generating detailed network graphs, aiding in quick network analysis and troubleshooting.
Project Delivery Methodology on a page with activities, deliverablesCLIVE MINCHIN
I've not found a 1 pager like this anywhere so I created it based on my experiences. This 1 pager details a waterfall style project methodology with defined phases, activities, deliverables, assumptions. There's nothing in here that conflicts with commonsense.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
Securiport Gambia is a civil aviation and intelligent immigration solutions provider founded in 2001. The company was created to address security needs unique to today’s age of advanced technology and security threats. Securiport Gambia partners with governments, coming alongside their border security to create and implement the right solutions.
1. Introduction to Data Mining Ch. 2 Data Preprocessing Heon Gyu Lee ( [email_address] ) http://dblab.chungbuk.ac.kr/~hglee DB/Bioinfo., Lab. http://dblab.chungbuk.ac.kr Chungbuk National University
2. Why Data Preprocessing? Data in the real world is dirty incomplete : lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” noisy : containing errors or outliers e.g., Salary=“-10” inconsistent : containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records
3. What is Data? Collection of data objects and their attributes An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance Attributes Objects
4. Types of Attributes There are different types of attributes Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Ratio Examples: temperature, length, time, counts
5. Discrete and Continuous Attributes Discrete Attribute Has only a finite or countably infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables.
6. Data Quality What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems? Examples of data quality problems: Noise and outliers missing values duplicate data
7. Noise Noise refers to modification of original values Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine Waves Two Sine Waves + Noise
8. Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set
9. Missing Values Reasons for missing values Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities)
10. Duplicate Data Data set may include data objects that are duplicates, or almost duplicates of one another Major issue when merging data from heterogeous sources Examples: Same person with multiple email addresses Data cleaning Process of dealing with duplicate data issues
11. Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data
13. Importance “ Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “ Data cleaning is the number one problem in data warehousing”—DCI survey Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration Data Cleaning
14. Data Cleaning : How to Handle Missing Data? Ignore the tuple : usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually Fill in it automatically with a global constant : e.g., “unknown”, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or regression
15. Data Cleaning : How to Handle Noisy Data? Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries , etc. Regression smooth by fitting the data into regression functions Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human (e.g., deal with possible outliers)
16. Data Cleaning : Binning Methods Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
19. Data Integration Data integration: Combines data from multiple sources into a coherent store Schema integration: e.g., A.cust-id B.cust-# Integrate metadata from different sources Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts For the same real world entity, attribute values from different sources are different Possible reasons: different representations, different scales
20. Data Integration : Handling Redundancy in Data Integration Redundant data occur often when integration of multiple databases Object identification : The same attribute or object may have different names in different databases Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue Redundant attributes may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
21. Data Integration : Correlation Analysis (Numerical Data) Correlation coefficient (also called Pearson’s product moment coefficient ) where n is the number of tuples, and are the respective means of A and B, σ A and σ B are the respective standard deviation of A and B, and Σ (AB) is the sum of the AB cross-product. If r A,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. r A,B = 0: independent; r A,B < 0: negatively correlated
22. Data Integration : Correlation Analysis (Categorical Data) Χ 2 (chi-square) test The larger the Χ 2 value, the more likely the variables are related The cells that contribute the most to the Χ 2 value are those whose actual count is very different from the expected count Correlation does not imply causality # of hospitals and # of car-theft in a city are correlated Both are causally linked to the third variable: population
23. Chi-Square Calculation: An Example Χ 2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) It shows that like_science_fiction and play_chess are correlated in the group 1500 1200 300 Sum(col.) 1050 1000(840) 50(210) Not like science fiction 450 200(360) 250(90) Like science fiction Sum (row) Not play chess Play chess
24. Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones
25. Data Transformation : Normalization Min-max normalization: to [new_min A , new_max A ] Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to Z-score normalization ( μ : mean, σ : standard deviation): Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling Where j is the smallest integer such that Max(| ν ’ |) < 1
26. Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation Discretization and Binarization Attribute Transformation
27. Data Reduction : Aggregation Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction Reduce the number of attributes or objects Change of scale Cities aggregated into regions, states, countries, etc More “stable” data Aggregated data tends to have less variability
28. Data Reduction : Aggregation Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation Variation of Precipitation in Australia
29. Data Reduction : Sampling Sampling is the main technique employed for data selection. It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.
30. Data Reduction : Types of Sampling Simple Random Sampling There is an equal probability of selecting any particular item Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once
31. Data Reduction : Dimensionality Reduction Purpose: Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise Techniques Principle Component Analysis Singular Value Decomposition Others: supervised and non-linear techniques
32. Dimensionality Reduction : PCA Goal is to find a projection that captures the largest amount of variation in data x 2 x 1 e
33. Dimensionality Reduction : PCA Find the eigenvectors of the covariance matrix The eigenvectors define the new space x 2 x 1 e
34. Data Reduction : Feature Subset Selection Another way to reduce dimensionality of data Redundant features duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA
35. Data Reduction : Feature Subset Selection Techniques: Brute-force approch: Try all possible feature subsets as input to data mining algorithm Filter approaches: Features are selected before data mining algorithm is run Wrapper approaches: Use the data mining algorithm as a black box to find best subset of attributes
36. Data Reduction : Feature Creation Create new attributes that can capture the important information in a data set much more efficiently than the original attributes Three general methodologies: Feature Extraction domain-specific Mapping Data to New Space Feature Construction combining features
37. Data Reduction : Mapping Data to a New Space Two Sine Waves Two Sine Waves + Noise Frequency Fourier transform Wavelet transform
38. Data Reduction : Discretization Using Class Labels Entropy based approach 3 categories for both x and y 5 categories for both x and y
39. Data Reduction : Discretization Without Using Class Labels Data Equal interval width Equal frequency K-means
40. Data Reduction : Attribute Transformation A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values Simple functions: x k , log(x), e x , |x| Standardization and Normalization