ML Assignment-1

This summarizes an ML assignment discussing various data preprocessing techniques. It addresses assessing data quality, handling missing values, smoothing data using bin means, issues to consider during data integration, value ranges of normalization methods, and examples of normalizing data using min-max normalization, z-score normalization, z-score normalization with mean absolute deviation, and decimal scaling normalization. Examples and explanations of each technique are provided.

Uploaded by

Likhitha Pallerla

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

118 views

ML Assignment-1

Uploaded by

Likhitha Pallerla

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

ML Assignment - 1

Data preprocessing 3.1 - 3.10

Bhavya Gupta
1700121C202

3.1. Data quality can be assessed in terms of several issues, including accuracy,
completeness, and consis-tency. For each of the above three issues, discuss how the
assessment of data quality can depend on the intended use of the data, giving examples.
Propose two other dimensions of data quality.
For accuracy, first consider a recommendation system for online purchase of clothes. When it
comes to birth date, the system may only care about in which year the user was born, so that it
can provide the right choices. However, an app in facebook which makes birthday calendars for
friends must acquire the exact day on which a user was born to make a credible calendar.
• For completeness, a product manager may not care much if customers’ address information is
missing while a marketing analyst considers address information essential for analysis.
• For consistency, consider a database manager who is merging two big movie information
databases into one. When he decides whether two entries refer to the same movie, he may
check the entry’s title and release date. Here in either database, the release date must be
consistent with the title or there will be annoying problems. But when a user is searching for a
movie’s information just for entertainment using either database, whether the release date is
consistent with the title is not so important. A user usually cares more about the movie’s
content.
Two other dimensions that can be used to assess the quality of data can be taken from the
following:
• Timeliness: Data must be available within a time frame that allows it to be useful for decision
making.
• Believability: Data values must be within the range of possible results in order to be useful for
decision making.
• Value added: Data must provide additional value in terms of information that offsets the cost
of collecting and accessing it.
• Interpretability: Data must not be so complex that the effort to understand the information it
provides exceeds the benefit of its analysis.
• Accessibility: Data must be accessible so that the effort to collect it does not exceed the
benefit from its use.

3.2. In real-world data, tuples with missing values for some attributes are a common
occurrence. Describe various methods for handling this problem.
The various methods for handling the problem of missing values in data tuples include:
(a) Ignoring the tuple: This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective unless the tuple
contains several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
(b) Manually filling in the missing value: In general, this approach is time-consuming and may
not be a reasonable task for large data sets with many missing values, especially when the
value to be filled in is not easily determined.
(c) Using a global constant to fill in the missing value: Replace all missing attribute values
by the same constant, such as a label like “Unknown,” or −∞. If missing values are replaced by,
say, “Unknown,” then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common — that of “Unknown.” Hence, although this
The method is simple, it is not recommended.
(d) Using a measure of central tendency for the attribute, such as the mean (for sym-
metric numeric data), the median (for asymmetric numeric data), or the mode (for nominal data):
For example, suppose that the average income of AllElectronics customers is $28,000 and that
the data are symmetric. Use this value to replace any missing values for income.
(e) Using the attribute mean for numeric (quantitative) values or attribute mode for
nominal values, for all samples belonging to the same class as the given tuple: For
example, if classifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that of the given tuple. If
the data are numeric and skewed, use the median value.
(f) Using the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using Bayesian formalism, or decision tree induction. For
example, using the other customer attributes in your data set, you may construct a decision tree
to predict the missing values for income.

3.3. Exercise 2.2 gave the following data (in increasing order) for the attribute age: 13, 15,
16, 16, 19, 20,20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3.
Illustrate your steps. Comment on the effect of this technique for the given data.
(b) How might you determine outliers in the data?
(c) What other methods are there for data smoothing?

(a) The following steps are required to smooth the above data using smoothing by bin means
with a bin depth of 3 :-
• Step 1: Sort the data.
• Step 2: Partition the data into equi depth bins of depth 3.
Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22
Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35
Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70
• Step 3: Calculate the arithmetic mean of each bin.
• Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.
Bin 1: 142/3, 142/3, 142/3 Bin 2: 181/3, 181/3, 181/3 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 262/3, 262/3, 262/3 Bin 6: 332/3, 332/3, 332/3
Bin 7: 35, 35, 35 Bin 8: 401/3, 401/3, 401/3 Bin 9: 56, 56, 56
This method smooths a sorted data value by consulting its ”neighborhood”. It performs local
smoothing.
(b) Outliers in the data may be detected by clustering, where similar values are organized into
groups,or ‘clusters’. Values that fall outside of the set of clusters may be considered outliers.
Alterna-tively, a combination of computer and human inspection can be used where a
predetermined data distribution is implemented to allow the computer to identify possible
outliers. These possible outliers can then be verified by human inspection with much less effort
than would be required to verify the entire initial data set.
(c) Other methods that can be used for data smoothing include alternate forms of binning such
as smoothing by bin medians or smoothing by bin boundaries. Alternatively, equi width bins can
be used to implement any of the forms of binning, where the interval range of values in each bin
is constant. Methods other than binning include using regression techniques to smooth the data
by fitting it to a function such as through linear or multiple regression. Also, classification
techniques can be used to implement concept hierarchies that can smooth the data by rolling-up
lower level concepts to higher-level concepts.

3.4. Discuss issues to consider during data integration.

Data integration involves combining data from multiple sources into a coherent data store.
Issues that must be considered during such integration include:
• Schema integration: The metadata from the different data sources must be integrated in order
to match up equivalent real-world entities. This is referred to as the entity identification problem.
• Handling redundant data: Derived attributes may be redundant, and inconsistent attribute
naming may also lead to redundancies in the resulting data set. Also, duplications at the tuple
level may occur and thus need to be detected and resolved.
• Detection and resolution of data value conflicts: Differences in representation, scaling or
encoding may cause the same real-world entity attribute values to differ in the data sources
being integrated.

3.5. What are the value ranges of the following normalization methods?
(a) min-max normalization
(b) z-score normalization
(c) z-score normalization using the mean absolute deviation instead of standard
deviation
(d) normalization by decimal scaling

(a) min-max normalization can define any value range and linearly map the original data to this
range.
(b) z-score normalization normalizes the values for an attribute A based on the mean and
standard deviation. The value range for z-score normalization is [ minA−A /σA,maxA−A /σA].
(c) z-score normalization using the mean absolute deviation is a variation of z-score
normalization by replacing the standard deviation with the mean absolute deviation of A,
denoted by sA,
(d) normalization by decimal scaling normalizes by moving the decimal point of values of
attribute [ minA /10^j,maxA /10^j].

3.6. Use the methods below to normalize the following group of data:
200, 300, 400, 600, 1000
(a) min-max normalization by setting min = 0 and max = 1
(b) z-score normalization
(c) z-score normalization using the mean absolute deviation instead of standard
deviation
(d) normalization by decimal scaling

3.7. Using the data for age given in Exercise 3.3, answer the following:
(a) Use min-max normalization to transform the value 35 for age onto the range [0.0, 1.0].
(b) Use z-score normalization to transform the value 35 for age, where the standard
deviation of age is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for age.
(d) Comment on which method you would prefer to use for the given data, giving reasons
as to why.

(a) Using the corresponding equation with minA = 13, maxA = 70, new minA = 0, new maxA =
1.0, then v = 35 is transformed to v′ = 0.39.
(b) 12.94 years. Using the corresponding equation where A = 809/27 = 29.96 and σA = 12.94,
then v = 35 is transformed to v′ = 0.39.
(c) Using the corresponding equation where j = 2, v = 35 is transformed to v′ = 0.35.
(d) Given the data, one may prefer decimal scaling for normalization as such a transformation
would maintain the data distribution and be intuitive to interpret, while still allowing mining on
spe-cific age groups. Min-max normalization has the undesired effect of not permitting any
future values to fall outside the current minimum and maximum values without encountering an
“out of bounds error”. As it is probable that such values may be present in future data, this
method is less appropriate. Also, z-score normalization transforms values into measures that
represent their distance from the mean, in terms of standard deviations. It is probable that this
type of transformation would not increase the information value of the attribute in terms of
intuitiveness to users or in usefulness of mining results.

3.8. Using the data for age and body fat given in Exercise 2.4, answer the following:
(a) Normalize the two attributes based on z-score normalization.
(b) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are
these two attributes positively or negatively correlated? Compute their covariance.

(a)

(b) The correlation coefficient is 0.82. The variables are positively correlated.

3.9. Suppose a group of 12 sales price records has been sorted as follows:
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215.
Partition them into three bins by each of the following methods.
(a) equal-frequency (equi depth) partitioning
(b) equal-width partitioning
(c) clustering

(a) Partition the data into equi depth bins of depth 4:

Bin 1: 1: 5, 10, 11, 13 Bin 2: 15, 35, 50, 55 Bin 3: 72, 92, 204, 215
(b) Partitioning the data into 3 equal-width bins will require the width to be (215 − 5)/3 = 70. We
Get: Bin 1: 5, 10, 11, 13, 15, 35, 50, 55, 72 Bin 2: 92 Bin 3: 204, 215
(c) Using K-means clustering to partition the data into three bins we get:
Bin 1: 5, 10, 11, 13, 15, 35 Bin 2: 50, 55, 72, 92 Bin 3: 204, 215

3.10. Use a flowchart to summarize the following procedures for attribute subset
selection:
(a) stepwise forward selection
(b) stepwise backward elimination
(c) a combination of forward selection and backward elimination

Microsoftfabricanalyticsengineerdp 600examdumps2024 240518151026 9b189f89
No ratings yet
Microsoftfabricanalyticsengineerdp 600examdumps2024 240518151026 9b189f89
17 pages
Solutions To II Unit Exercises From Kamber
83% (42)
Solutions To II Unit Exercises From Kamber
16 pages
SQL Server Notes
No ratings yet
SQL Server Notes
396 pages
E-Tivity 2.2 Tharcisse 217010849
No ratings yet
E-Tivity 2.2 Tharcisse 217010849
7 pages
Enterprise Performance Management Systems
No ratings yet
Enterprise Performance Management Systems
4 pages
Car Parking Management System
100% (1)
Car Parking Management System
37 pages
Veritas NetBackup 8.3 Administration
0% (1)
Veritas NetBackup 8.3 Administration
4 pages
Data Preprocessing Solution-24-37
No ratings yet
Data Preprocessing Solution-24-37
14 pages
UNIT-2
No ratings yet
UNIT-2
34 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
DWDM UNIT-II
No ratings yet
DWDM UNIT-II
18 pages
Mit401 Unit 10-Slm
No ratings yet
Mit401 Unit 10-Slm
23 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
No ratings yet
Data Preprocessing Techniques: 1.1 Why Preprocess The Data?
12 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
ASS Ignments: Program: BSC It Semester-Vi
No ratings yet
ASS Ignments: Program: BSC It Semester-Vi
14 pages
Answers PDF
No ratings yet
Answers PDF
9 pages
UNIT-2 PREPROCESSING
No ratings yet
UNIT-2 PREPROCESSING
18 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
Mid Term
No ratings yet
Mid Term
12 pages
Unit 1 Assignment
0% (1)
Unit 1 Assignment
6 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
data preprocessing
No ratings yet
data preprocessing
21 pages
Data Mining Questions and Answers
No ratings yet
Data Mining Questions and Answers
22 pages
2 - Data Mining and Warehousing - L2
No ratings yet
2 - Data Mining and Warehousing - L2
6 pages
2 SVM Kernel
No ratings yet
2 SVM Kernel
8 pages
Data Mining MCQ (Multiple Choice Questions)
No ratings yet
Data Mining MCQ (Multiple Choice Questions)
7 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Data Mining Answer Key
No ratings yet
Data Mining Answer Key
10 pages
For 100% Result Oriented IGNOU Coaching and Project Training Call CPD: 011-65164822, 08860352748
No ratings yet
For 100% Result Oriented IGNOU Coaching and Project Training Call CPD: 011-65164822, 08860352748
9 pages
Grid Search Hyper-Parameter Tuning and K-Means Clustering ToImprove The Decision Tree Accuracy
No ratings yet
Grid Search Hyper-Parameter Tuning and K-Means Clustering ToImprove The Decision Tree Accuracy
3 pages
DWM Assignment
No ratings yet
DWM Assignment
15 pages
DWM - Exp 1
No ratings yet
DWM - Exp 1
11 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
Assgg
No ratings yet
Assgg
12 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284
No ratings yet
Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284
7 pages
253777
No ratings yet
253777
66 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
Data Mining 2-5
No ratings yet
Data Mining 2-5
4 pages
F24-CSC493-579-Mid-study guide
No ratings yet
F24-CSC493-579-Mid-study guide
6 pages
ML Unit 4 @ VS
No ratings yet
ML Unit 4 @ VS
33 pages
Data Mining Graded Assignment: Problem 1: Clustering Analysis
100% (3)
Data Mining Graded Assignment: Problem 1: Clustering Analysis
39 pages
Theoritical Questions
No ratings yet
Theoritical Questions
9 pages
DWDM Unit II
No ratings yet
DWDM Unit II
29 pages
Sample Questions Big Data Preparation
No ratings yet
Sample Questions Big Data Preparation
4 pages
Top 90+ Data Science Interview Questions and Answers (2024)
No ratings yet
Top 90+ Data Science Interview Questions and Answers (2024)
38 pages
Data Mining
No ratings yet
Data Mining
28 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
DMWH M3
No ratings yet
DMWH M3
21 pages
MODULE 2 DMW
No ratings yet
MODULE 2 DMW
19 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
ML and Deploying It Using Flask and Docker.
No ratings yet
ML and Deploying It Using Flask and Docker.
30 pages
The Handwritten Solutions To The First Five Questions, and The Report of Last Question
No ratings yet
The Handwritten Solutions To The First Five Questions, and The Report of Last Question
2 pages
Data Mining University Answer
No ratings yet
Data Mining University Answer
10 pages
DSBDA 4
No ratings yet
DSBDA 4
16 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Graph Data Science For Dummies Neo4j 2nd Edition
No ratings yet
Graph Data Science For Dummies Neo4j 2nd Edition
53 pages
Face Recognition Attendance System Based On Real-Time Video
No ratings yet
Face Recognition Attendance System Based On Real-Time Video
8 pages
ISDM
No ratings yet
ISDM
27 pages
Computer Packages Exam
No ratings yet
Computer Packages Exam
6 pages
SRS of Airline
No ratings yet
SRS of Airline
7 pages
Sheet Kuis Pbo
No ratings yet
Sheet Kuis Pbo
6 pages
1701.09042
No ratings yet
1701.09042
7 pages
Computer Application Software
No ratings yet
Computer Application Software
12 pages
Academia Summary — Sap abap tutorial
No ratings yet
Academia Summary — Sap abap tutorial
2 pages
big-data-unit 5
No ratings yet
big-data-unit 5
54 pages
Part 5 - Joining Tables - Inner Join
No ratings yet
Part 5 - Joining Tables - Inner Join
17 pages
SAP CO PPT's
No ratings yet
SAP CO PPT's
5 pages
Andi Pratama Putra CV
No ratings yet
Andi Pratama Putra CV
3 pages
41-Computer Science Exam - 2
No ratings yet
41-Computer Science Exam - 2
7 pages
Distributed Database Design Concept
No ratings yet
Distributed Database Design Concept
5 pages
2020 Book AdvancesInBioinformaticsAndCom
No ratings yet
2020 Book AdvancesInBioinformaticsAndCom
284 pages
Bigdata .Profile
No ratings yet
Bigdata .Profile
3 pages
Database Management System Answer Key - Activity 1
No ratings yet
Database Management System Answer Key - Activity 1
10 pages
ip mysql
No ratings yet
ip mysql
10 pages
Data Science Course Syllabus
No ratings yet
Data Science Course Syllabus
37 pages
Chapter 2 Auditing It Governance Controls
No ratings yet
Chapter 2 Auditing It Governance Controls
54 pages
PS 27 - Xi 2021
No ratings yet
PS 27 - Xi 2021
11 pages
DSS - Final Bank Question
No ratings yet
DSS - Final Bank Question
50 pages
Hibernate
No ratings yet
Hibernate
111 pages
TSM Server Instrumentation Trace
No ratings yet
TSM Server Instrumentation Trace
48 pages
Online College Fee Payment System project
No ratings yet
Online College Fee Payment System project
50 pages