Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit 1 Assignment

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

U15IT601 Data Mining Assignment - 1

Unit -1

1. Illustrate Starnet Query model for an electronics data warehouse with four dimensions
such as location, customer, item, time and with suitable footprints for each dimension.
2. How many cuboids are to be generated for ‘n’ dimension with ‘l’ levels?
3. Suppose that a data warehouse for Big University consists of the following four dimensions:
student, course, semester, and instructor, and two measures count and avg grade. When at the
lowest conceptual level (e.g., for a given student, course, semester, and instructor
combination), the avg grade measure stores the actual course grade of the student. At higher
conceptual levels, avg grade stores the average grade for the given combination.
(a) Draw a snowflake schema diagram for the data warehouse.
(b) Starting with the base cuboid [student; course; semester; instructor], what specific OLAP
operations (e.g., roll-up from semester to year ) should one perform in order to list the
average grade of CS courses for each Big University student.
(c) If each dimension has five levels (including all), such as \student < major < status <
university < all",how many cuboids will this cube contain (including the base and apex
cuboids)?
4. Suppose that a data warehouse consists of the three dimensions time, doctor, and
patient, and the two measures count and charge, where charge is the fee that a doctor
charges a patient for a patient for a visit.
(a) Enumerate three classes of schemas that are popularly used for modeling data
warehouses.
(b) Draw a schema diagram for the above data warehouse using one of the schema
classes listed in (a)
(c) Starting with the base cuboid {day, doctor, patient}, what specific OLAP
operations should be performed in order to list the total fee collected by each
doctor in 2015?
5. Suppose that the following table is derived by AOI

Class Birth_Place Count


Programmer Canada 215
Others 120
DBA Canada 50
Others 80
(a) Transform the table into a cross tab showing the associated t-weight and d-
weight?
(b) Map the class programmer into a bidirectional quantitative description rule?

Unit -2

7 Marks

1. Data quality can be assessed using various issues such as accuracy, completeness and
consistency. For each of the specified issues, discuss how data quality assessment can
depend on the intended use of the data with suitable examples. Suggest other dimensions
of data quality.
2. Analyze the kinds of patterns that can be mined among different types of data.
3. The below contingency table contains information regarding preferred readings of 1500
people and gender is noted. Apply chi square test and find whether gender and preferred
reading correlated?

Unit -3

2 Marks

1. Suppose that the data for analysis includes the attribute age. The age values for the
data tuples are (in increasing order)
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25.
i. What is the mean of data for binning?
ii. What is the min-Max binning method for above data
2. Why is tree pruning useful in decision tree induction? What is a drawback of using a
separate set of tuples to evaluate pruning?
3. Why are decision tree classifiers so popular?
4. Distinguish between lazy learner.
7 Marks

1. Predict the salary of a person with 10 years of experience by applying linear regression using
least square method on given salary data.

2. Apply naïve bayes classification to classify Red Domestic SUV, by using given Car
theft training data set.
3. Compare the advantages and disadvantages of eager classification versus lazy
classification.
4. What could be a problem that association rules leads and show that with an example
how it could be tackle by correlation analysis.
5. Why is naïve Bayesian classification called “naïve”? Briefly outline the major ideas
of naïve Bayesian classification.
6. Use these methods to normalize the following group of data:
200, 300, 400, 600,1000
(a) min-max normalization by setting min = 0 and max = 1
(b) z-score normalization
7. A database has five transactions. Let min sup = 60% and min con f = 80%.

Find all frequent itemsets using Apriori and FP-growth, respectively


8. Construct the decision tree using ID3 algorithm for given data.
9. Transactional data for AllElectronics shown in above table. The data contain frequent
itemset X= {l1,l2,l5} . What are the association rules that can be generated from X?

Unit -4

2 Marks

1. Differentiate between Agglomerative hierarchical clustering method and Divisive


hierarchical clustering method.
2. Clustering has been popularly recognized as an important data mining task with
broad applications. Give one application example that takes clustering as a major
data mining function

7 Marks
1. Given the following measurements for the variable age
18; 22; 25; 42; 28; 43; 33; 35; 56; 28; standardize the variable by the following:
(a) Compute the mean absolute deviation of age.
(b) Compute the z-score for the first four measurements.
2. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8)
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using p = 3.
3. Suppose that the data mining task is to cluster the following eight points into three
clusters. A1(3,9),A2(2,5),A3(7,5),B1(8,9),B2(6,5),B3(6,5),C1(1,2),C2(5,10). The
distance function is Euclidean distance. Initial center of each cluster is A1, B1 and
C1. Use K-Means algorithm to show only the three cluster centers after the first
iteration of algorithm
4. The data mining task is to cluster the following eight points into three clusters.
A1(2,10) ,A2(2,5) ,A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2), C2(4,9).
The distance function is Euclidean distance. Initial center of each cluster is A1, B1
and C1. Use K-Means algorithm to show only the three cluster centers after the first
iteration of algorithm
5. Both k-means and k-medoids algorithms can perform effective clustering. Illustrate
the strength and weakness of k-means in comparison with the k-medoids algorithm.
Also, illustrate the strength and weakness of these schemes in comparison with a
hierarchical clustering scheme

Unit -5

2 Marks

1. “Is data mining a threat to privacy and data security?” Comment on the statement.
2. List out the differences between row scalability and column scalability.
3. What are the differences between visual data mining and data visualization?

7 Marks

1. Assume that your local bank has a data mining system. The bank has been studying your debit
card usage patterns. Noticing that you make many transactions at home renovation stores, the
bank decides to contact you, offering information regarding their special loans for home
improvements.
a) Discuss how this may conflict with your right to privacy.
b) Describe another situation where you feel that data mining can infringe on your
privacy.
2. What are the major challenges faced in bringing data mining research to market? Illustrate
one data mining research issue that, in your view, may have a strong impact on the market
and on society. Discuss how to approach such a research issue.
3. Propose a few implementation methods for audio data mining. Can we integrate audio and
visual data mining to bring fun and power to data mining? Is it possible to develop some
video data mining methods? State some scenarios and your solutions to make such integrated
audiovisual mining effective.
4. Assume that your local bank has a data mining system. The bank has been studying your debit
card usage patterns. Noticing that you make many transactions at home renovation stores, the
bank decides to contact you, offering information regarding their special loans for home
improvements.
a) Describe a privacy-preserving data mining method that may allow the bank to
perform customer pattern analysis without infringing on customers' right to privacy.
b) What are some examples where data mining could be used to help society? Can you
think of ways it could be used that may be detrimental to society?

You might also like