Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HW 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

DATA MINING

HOMEWORK 1

Submitted By: Tauseef Shah


Registration No.: 022A8017729014

Submitted on Date: 9 November 2022

Tauseef Shah
  
HW1
Due Date: Nov. 9, 2022
Submission requirements:

1. Suppose that a data warehouse consists of four dimensions, date, spectator, location,
and game, and two measures, count and charge, where charge is the fare that a
spectator pays when watching a game on a given date. Spectators may be students,
adults, or seniors, with each category having its own charge rate.
(a) Draw a star schema diagram for the data warehouse.

Star Schema
Spectator
Date Spectator_id
Fact Table
Date_ID Date_ID Spectator_nam
e
Day spectator_ID patient_id
Day of the Week Location_ID Phone
Game_ID Address
Month Charge Status
Quarter Count Charge_rate
Year

Game Location
Game_ID Location_ID
Game_Name Phone #
Description City
producer Street
Province
Country

(b) Starting with the base cuboid [date, spectator, location, game] , what specific
OLAP operations should one perform in order to list the total charge paid by

PAGE 1 11/18/22
student spectators in Los Angeles?

The specific OLAP operations to be performed are:


 Roll-up on date from date id to year.
 Roll-up on game from game id to all.
 Roll-up on location from location id to location name.
 Roll-up on spectator from spectator id to status.
 Dice with status=“students”, location name=“Los Angeles”, and year =
2022.

(c) Bitmap indexing is a very useful optimization technique. Please present the pros
and cons of using bitmap indexing in this given data warehouse.

Bitmap indexing is advantageous for low-cardinality domains. For example, in this


cube, if dimension location is bitmap indexed, comparison, join, and aggregation
operations over location are then reduced to bit arithmetic, which substantially reduces
the processing time. Furthermore, strings of long location names can be represented by
a single bit, which leads to significant reduction in space and I/O. For dimensions with
high cardinality, such as date in this example, the vector to present the bitmap index
could be very long. For example, a 10-year collection of data could result in 3650 data
records, meaning that every tuple in the fact table would require 3650 bits, or
approximately 456 bytes, to hold the bitmap index. However, using bit-vector
compression techniques can overcome this difficulty to a certain degree.

2. Suppose a hospital tested the age and body fat data for 18 random selected adults with
the following result:
age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%f 9. 26. 7. 17. 31. 25. 27. 27. 31. 34. 42. 28. 33. 30. 34. 32. 41. 35.
at 5 5 8 8 4 9 4 2 2 6 5 8 4 2 1 9 2 7

I used MATLAB to solve this question. MATLAB Code is given in the end of this assignment.

(a) Calculate the mean, median, and standard deviation of age and %fat.

Mean of age=46.4444, median of age = 51 & standard deviation of age = 13.2186

Mean of %fat=28.783, median of %fat=30.700& standard deviation of %fat=9.254

PAGE 2 11/18/22
(b) Draw the boxplots for age and %fat.

PAGE 3 11/18/22
(c) Draw a scatter plot based on these two variables.

(d) Normalize age based on min-max normalization.

-1.77359187028033
-1.77359187028033
-1.47098851800501
-1.47098851800501
-0.563178461179061
-0.411876785041403
0.0420282433715718
0.193329919509230
0.268980757578059
0.420282433715717
0.571584109853375
0.571584109853375
0.722885785991034
0.798536624059863
0.874187462128692
0.874187462128692
1.02548913826635
1.10113997633518

(e) Calculate the correlation coefficient (Pearson’s product moment coefficient).


Are these two variables positively or negatively correlated?

age Values
∑ = 836

PAGE 4 11/18/22
Mean = 46.444
∑(age – Mage)2 = SSage = 2970.444

%Fats Values
∑ = 518.1
Mean = 28.783
∑(Fat – Mfat)2 = SSfat = 1455.945

age and fats Combined


N = 18
∑(age – Mfat)(fat – Mfat) = 1700.333

correlation Calculation
correlation = ∑((age – Mfat)(fat – Mage)) / √((SSage)(SSfat))

correlation = 1700.333 / √((2970.444)(1455.945)) = 0.8176

Meta Numeric (cross-check)


correlation coefficient = 0.8176

The correlation coefficient is 0.8176 between age and %fat. This is a


strong positive correlation, which means that high age variable scores go with
high %fat variable scores (and vice versa).

(f) Smooth the fat data by bin means, using a bin depth of 6.
Bin1: [19.1167 19.1167 19.1167 19.1167 19.1167 19.1167]
Bin2: [30.3167 30.3167 30.3167 30.3167 30.3167 30.3167]
Bin3:[ 36.9167 36.9167 36.9167 36.9167 36.9167 36.9167]

(g) Smooth the fat data by bin boundaries, using a bin depth of 6.
Bin1:[ 7.8000 7.8000 27.2000 27.2000 27.2000 27.2000]
Bin2:[ 27.4000 27.4000 32.9000 32.9000 32.9000 32.9000]
Bin3:[ 33.4000 34.4000 33.4000 33.4000 42.5000 42.5000]

PAGE 5 11/18/22
2. Assume that a great number of emails are stored in a database. Now you would like
to build a data warehouse for these emails to facilitate data analysis or data mining
later on. Please give a design for the email data warehouse by using a schema
diagram. In addition, please provide your design for the fact table(s) and dimension
tables.

For our design we are using 3-tier architecture for the data warehouse. We are
using the star schema for the fact table and dimension table. The conceptual
design, schema used and tables used are given below

PAGE 6 11/18/22
Star Schema
Conversation Table

Contact Table Email_ID


Fact Table
Email_ID ID Email_Subject
Conversation_ID Email_ID Email_status
Address Conversation_ID Email_direction
Status Date
Contact Details

Email Table

Email_ID
Email_Created
Email_edited
Email_deleted

PAGE 7 11/18/22
MATLAB Code for Question 2, part a,b,c,d and f

age=[23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60
61];
fat=[9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5
28.8 33.4 30.2 34.1 32.9 41.2 35.7];
n_age=normalize(age);
mean_age=mean(age,2);
median_age=median(age,2);
std_age=std(age);

mean_fat=mean(fat,2);
median_fat=median(fat,2);
std_fat=std(fat);
hold off

figure(1)
boxchart(age)
ylabel('Age (years)')
title('Age');
hold off
figure (2)
boxchart(fat)
ylabel('Fat (age)')
title('Fats');
hold off
figure (3)
scatter(age,fat)

PAGE 8 11/18/22
xlabel('age')
ylabel('fats')
title('Scatter Plot');

%%%%%%% sorting data and dividing data in Bins with depth


of 6
sort_fat=sort(fat);
b1=sort_fat(1:6);
b2=sort_fat(7:12);
b3=sort_fat(13:18);

%%%%%%% Smooth the fat data by bin means, using a bin


depth of 6
mn1(1:6)=mean(b1);
mn2(1:6)=mean(b2);
mn3(1:6)=mean(b3);

PAGE 9 11/18/22

You might also like