ML Assignment-1
ML Assignment-1
ML Assignment-1
Bhavya Gupta
1700121C202
3.1. Data quality can be assessed in terms of several issues, including accuracy,
completeness, and consis-tency. For each of the above three issues, discuss how the
assessment of data quality can depend on the intended use of the data, giving examples.
Propose two other dimensions of data quality.
For accuracy, first consider a recommendation system for online purchase of clothes. When it
comes to birth date, the system may only care about in which year the user was born, so that it
can provide the right choices. However, an app in facebook which makes birthday calendars for
friends must acquire the exact day on which a user was born to make a credible calendar.
• For completeness, a product manager may not care much if customers’ address information is
missing while a marketing analyst considers address information essential for analysis.
• For consistency, consider a database manager who is merging two big movie information
databases into one. When he decides whether two entries refer to the same movie, he may
check the entry’s title and release date. Here in either database, the release date must be
consistent with the title or there will be annoying problems. But when a user is searching for a
movie’s information just for entertainment using either database, whether the release date is
consistent with the title is not so important. A user usually cares more about the movie’s
content.
Two other dimensions that can be used to assess the quality of data can be taken from the
following:
• Timeliness: Data must be available within a time frame that allows it to be useful for decision
making.
• Believability: Data values must be within the range of possible results in order to be useful for
decision making.
• Value added: Data must provide additional value in terms of information that offsets the cost
of collecting and accessing it.
• Interpretability: Data must not be so complex that the effort to understand the information it
provides exceeds the benefit of its analysis.
• Accessibility: Data must be accessible so that the effort to collect it does not exceed the
benefit from its use.
3.2. In real-world data, tuples with missing values for some attributes are a common
occurrence. Describe various methods for handling this problem.
The various methods for handling the problem of missing values in data tuples include:
(a) Ignoring the tuple: This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective unless the tuple
contains several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
(b) Manually filling in the missing value: In general, this approach is time-consuming and may
not be a reasonable task for large data sets with many missing values, especially when the
value to be filled in is not easily determined.
(c) Using a global constant to fill in the missing value: Replace all missing attribute values
by the same constant, such as a label like “Unknown,” or −∞. If missing values are replaced by,
say, “Unknown,” then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common — that of “Unknown.” Hence, although this
The method is simple, it is not recommended.
(d) Using a measure of central tendency for the attribute, such as the mean (for sym-
metric numeric data), the median (for asymmetric numeric data), or the mode (for nominal data):
For example, suppose that the average income of AllElectronics customers is $28,000 and that
the data are symmetric. Use this value to replace any missing values for income.
(e) Using the attribute mean for numeric (quantitative) values or attribute mode for
nominal values, for all samples belonging to the same class as the given tuple: For
example, if classifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that of the given tuple. If
the data are numeric and skewed, use the median value.
(f) Using the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using Bayesian formalism, or decision tree induction. For
example, using the other customer attributes in your data set, you may construct a decision tree
to predict the missing values for income.
3.3. Exercise 2.2 gave the following data (in increasing order) for the attribute age: 13, 15,
16, 16, 19, 20,20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3.
Illustrate your steps. Comment on the effect of this technique for the given data.
(b) How might you determine outliers in the data?
(c) What other methods are there for data smoothing?
(a) The following steps are required to smooth the above data using smoothing by bin means
with a bin depth of 3 :-
• Step 1: Sort the data.
• Step 2: Partition the data into equi depth bins of depth 3.
Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22
Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35
Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70
• Step 3: Calculate the arithmetic mean of each bin.
• Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.
Bin 1: 142/3, 142/3, 142/3 Bin 2: 181/3, 181/3, 181/3 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 262/3, 262/3, 262/3 Bin 6: 332/3, 332/3, 332/3
Bin 7: 35, 35, 35 Bin 8: 401/3, 401/3, 401/3 Bin 9: 56, 56, 56
This method smooths a sorted data value by consulting its ”neighborhood”. It performs local
smoothing.
(b) Outliers in the data may be detected by clustering, where similar values are organized into
groups,or ‘clusters’. Values that fall outside of the set of clusters may be considered outliers.
Alterna-tively, a combination of computer and human inspection can be used where a
predetermined data distribution is implemented to allow the computer to identify possible
outliers. These possible outliers can then be verified by human inspection with much less effort
than would be required to verify the entire initial data set.
(c) Other methods that can be used for data smoothing include alternate forms of binning such
as smoothing by bin medians or smoothing by bin boundaries. Alternatively, equi width bins can
be used to implement any of the forms of binning, where the interval range of values in each bin
is constant. Methods other than binning include using regression techniques to smooth the data
by fitting it to a function such as through linear or multiple regression. Also, classification
techniques can be used to implement concept hierarchies that can smooth the data by rolling-up
lower level concepts to higher-level concepts.
3.5. What are the value ranges of the following normalization methods?
(a) min-max normalization
(b) z-score normalization
(c) z-score normalization using the mean absolute deviation instead of standard
deviation
(d) normalization by decimal scaling
(a) min-max normalization can define any value range and linearly map the original data to this
range.
(b) z-score normalization normalizes the values for an attribute A based on the mean and
standard deviation. The value range for z-score normalization is [ minA−A /σA,maxA−A /σA].
(c) z-score normalization using the mean absolute deviation is a variation of z-score
normalization by replacing the standard deviation with the mean absolute deviation of A,
denoted by sA,
(d) normalization by decimal scaling normalizes by moving the decimal point of values of
attribute [ minA /10^j,maxA /10^j].
3.6. Use the methods below to normalize the following group of data:
200, 300, 400, 600, 1000
(a) min-max normalization by setting min = 0 and max = 1
(b) z-score normalization
(c) z-score normalization using the mean absolute deviation instead of standard
deviation
(d) normalization by decimal scaling
3.7. Using the data for age given in Exercise 3.3, answer the following:
(a) Use min-max normalization to transform the value 35 for age onto the range [0.0, 1.0].
(b) Use z-score normalization to transform the value 35 for age, where the standard
deviation of age is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for age.
(d) Comment on which method you would prefer to use for the given data, giving reasons
as to why.
(a) Using the corresponding equation with minA = 13, maxA = 70, new minA = 0, new maxA =
1.0, then v = 35 is transformed to v′ = 0.39.
(b) 12.94 years. Using the corresponding equation where A = 809/27 = 29.96 and σA = 12.94,
then v = 35 is transformed to v′ = 0.39.
(c) Using the corresponding equation where j = 2, v = 35 is transformed to v′ = 0.35.
(d) Given the data, one may prefer decimal scaling for normalization as such a transformation
would maintain the data distribution and be intuitive to interpret, while still allowing mining on
spe-cific age groups. Min-max normalization has the undesired effect of not permitting any
future values to fall outside the current minimum and maximum values without encountering an
“out of bounds error”. As it is probable that such values may be present in future data, this
method is less appropriate. Also, z-score normalization transforms values into measures that
represent their distance from the mean, in terms of standard deviations. It is probable that this
type of transformation would not increase the information value of the attribute in terms of
intuitiveness to users or in usefulness of mining results.
3.8. Using the data for age and body fat given in Exercise 2.4, answer the following:
(a) Normalize the two attributes based on z-score normalization.
(b) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are
these two at- tributes positively or negatively correlated? Compute their covariance.
(a)
(b) The correlation coefficient is 0.82. The variables are positively correlated.
3.9. Suppose a group of 12 sales price records has been sorted as follows:
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215.
Partition them into three bins by each of the following methods.
(a) equal-frequency (equi depth) partitioning
(b) equal-width partitioning
(c) clustering
3.10. Use a flowchart to summarize the following procedures for attribute subset
selection:
(a) stepwise forward selection
(b) stepwise backward elimination
(c) a combination of forward selection and backward elimination