Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Engineering Lab

Download as rtf, pdf, or txt
Download as rtf, pdf, or txt
You are on page 1of 55

DATA ENGINEERING LABORATORY

ACHARYA NAGARJUNA UNIVERSITY


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CHALAPATHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

IV/IV B. Tech II SEM, 2021-22

MASTER LAB MANUAL


For

DATA ENGINEERING LAB

Prepared by
Sk. John Sydulu, Assistant Professor
T. Lavanya, Assistant Professor

CHALAPATHI INSTITUTE OF ENGINEERING AND TECHNOLOGY


CHALAPATHI NAGAR, LAM, GUNTUR-522034.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY

CHALAPATHI INSTITUTE OF ENGINEERING & TECHNOLOGY

CHALAPATHI NAGAR, LAM, GUNTUR-522034


DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

COLLEGE VISION:

To emerge as an Institute of Excellence for Engineering and Technology and provide world-
class education and research opportunities to the students catering the needs of society.

COLLEGE MISSION:

Establishing a state-of-the-art Engineering Institute with continuously improving infrastructure


and produce students with innovative skills and global outlook.

DEPARTMENT VISION:

To produce professionally competent, research oriented and socially sensitive engineers and
technocrats in the emerging technologies.

DEPARTMENT MISSION:

DM1: State of art laboratories to meet the needs of the continuous change.
DM2: Provide a research environment to meet the societal issues.
DM3: Facilitating collaborations/MOU’S towards emerging technologies.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
CHALAPATHI INSTITUTE OF ENGINEERING & TECHNOLOGY

CHALAPATHI NAGAR, LAM,GUNTUR

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

PEO’s:

PEO-1: Graduates of the computer science program will aim a successful professional career
and actively engage in applying new ideas/technologies as the field evolves.
PEO-2: Graduates can analyze real life problems and design computing solutions by applying
computer engineering theory and practices followed.
PEO-3: Graduates shall pursue higher studies or do research through quality education.

PO’s:

1. ENGINEERING KNOWLEDGE: Apply the knowledge of mathematics, science,


engineering fundamentals, and an engineering specialization to the solution of complex
engineering problems.

2. PROBLEM ANALYSIS: Identify, formulate, research literature, and analyze complex


engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.

3. DESIGN/DEVELOPMENT OF SOLUTIONS: Design solutions for complex engineering


problems and design system components or processes that meet the specified needs with
appropriate consideration for the public health and safety, and the cultural, societal, and
environmental considerations.

4. CONDUCT INVESTIGATIONS OF COMPLEX PROBLEMS: Use research-based


knowledge and research methods including design of experiments, analysis and interpretation of
data, and synthesis of the information to provide valid conclusions.

5. MODERN TOOL USAGE: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modelling to complex engineering
activities with an understanding of the limitations.

6. THE ENGINEER AND SOCIETY: Apply reasoning informed by the contextual


knowledge to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
7. ENVIRONMENT AND SUSTAINABILITY: Understand the impact of the professional
engineering solutions in societal and environmental contexts, and demonstrate the knowledge
of, and need for sustainable development.

8. ETHICS: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.

9. INDIVIDUAL AND TEAM WORK: Function effectively as an individual, and as a


member or leader in diverse teams, and in multidisciplinary settings.

10. COMMUNICATION: Communicate effectively on complex engineering activities with


the engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, give and receive
clear instructions.

11. PROJECT MANAGEMENT AND FINANCE: Demonstrate knowledge and


understanding of the engineering and management principles and apply these to one’s own
work, as a member and leader in a team, to manage projects and in multidisciplinary
environments.

12. LIFE-LONG LEARNING: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological change.

PSO’s :

PSO1: Professional Skills: The ability to understand, analyze and develop computer programs
in the areas related to algorithms, system software, multimedia, web design, big data analytics,
and networking for efficient design of computer-based systems of varying complexity.

PSO2: Problem-Solving Skills: The ability to apply standard practices and strategies in software
project development using open-ended programming environments to deliver a quality product
for business success.

PSO3: Successful Career and Entrepreneurship: The ability to employ modern computer
languages, environments, and platforms in creating innovative career paths to be an
entrepreneur, and a zest for higher studies.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
CHALAPATHI INSTITUTE OF ENGINEERING & TECHNOLOGY

CHALAPATHI NAGAR, LAM, GUNTUR-522034


DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

DATA ENGINEERING LAB OBJECTIVES:

1. Practical exposure on implementation of well known data mining tasks.

2. Exposure to real life data sets for analysis and prediction.

3. Learning performance evaluation of data mining algorithms in a supervised and an


unsupervised setting.

4. Handling a small data mining project for a given practical domain.

DATA ENGINEERING LAB OUTCOMES:

1. The data mining process and important issues around data cleaning, pre-processing and
integration.

2. The principle algorithms and techniques used in data mining, such as clustering,
association mining, classification and prediction

3. Demonstrate understanding of the functionality of the various web mining and web
search components and appreciate the strengths and limitations of various web mining and
web search models.

4. Able to use the tools and techniques employed in data mining for different application
domains.

5. Describe different types of research and understand alternative research paradigms.


Mapping of CO’s and PO’s
 1
POS 1 2 3 4 5 6 7 8 9 10 11 2
COS

CO 1 3 2 3

CO 2 2 2 2 2

CO 3 3 2 3

2 2 1
CO 4
1 2 3
CO 5

H-HIGH M-MEDIUM L-LOW


5

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
SYLLABUS AS PER UNIVERSITY

Expt. No List of Experiments

1 Rollup And Cube Operations On The Following Tables

2 Cube Slicing- Come With 2-D View Data

3 Drill-down or Roll-down going from summary to more detailed data

4 Rollup - summarize data along a dimension hierarchy

5 Dicing – project 2-D view of data

6 Creating Star Schema and Snowflake Schema

7 Creating Fact Table

Additional Experiments

1 Write a Program to implement Apriori algorithm using WEKA

2 Write a program to implement FP Growth using WEKA

3 Write a program to implement DECISION TREE using WEKA

6
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY

Expt. Experiment Name CO’s attained Po’s attained


No

1 Rollup And Cube Operations On The Following CO1,CO2 PO3,PO4


Tables

2 Cube Slicing- Come With 2-D View Data CO1,CO2 PO3,PO4

3 Drill-down or Roll-down going from summary to CO1,CO2 PO3,PO4


more detailed data
4 Rollup - summarize data along a dimension hierarchy CO1,CO2 PO3,PO4

5 Dicing – project 2-D view of data CO1,CO2 PO3,PO4

6 Creating Star Schema and Snowflake Schema CO4 PO3,PO4

7 Creating Fact Table CO3,CO4 PO3,PO4

Additional Experiments

1 Write a Program to implement Apriori algorithm CO3,CO5 PO4,PO5


using WEKA

2 Write a program to implement FP Growth using CO3,CO5 PO4,PO5


WEKA

3 Write a program to implement DECISION TREE CO3,CO5 PO4,PO5


using WEKA

7
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
INDEX

S.NO CONTENT Page no


I VISION AND MISSION 2

II` PEO’s, PO’s & PSO’s 3

III COURSE OBJECTIVES AND OUTCOMES 5

IV MAPPING OF CO’S AND PO’S 5

V SYLLABUS 6

ROLLUP AND CUBE OPERATIONS ON THE FOLLOWING


1 9
TABLES

CUBE SLICING- COME WITH 2-D VIEW DATA


2 11

DRILL-DOWN OR ROLL-DOWN GOING FROM SUMMARY TO


3 MORE DETAILED DATA 13

ROLLUP - SUMMARIZE DATA ALONG A DIMENSION


4 HIERARCHY 15

DICING – PROJECT 2-D VIEW OF DATA


5 17

6 CREATING STAR SCHEMA AND SNOWFLAKE SCHEMA 23

7 CREATING FACT TABLE 29

ADDITIONAL PROGRAMS

WRITE A PROGRAM TO IMPLEMENT APRIORI ALGORITHM


1 30
USING WEKA

WRITE A PROGRAM TO IMPLEMENT FP GROWTH USING


2 32
WEKA

WRITE A PROGRAM TO IMPLEMENT DECISION TREE USING


3 33
WEKA
VI VIVA QUESTIONS 35

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY

EXPERIMENT 1 :
AIM: Implement Cube operations

ROLLUP AND CUBE OPERATIONS ON THE FOLLOWING TABLES

ALGORITHM:
STEP1: 1. CREATE A TABLE
2. STORE THE DATA IN THE TABLESTEP
STEP 2: WRITE A CONTROL FILE NAME WITH.CTL
STEP 3: CREATE & WRITE FILANAME.CSV WITH THE DATA
STEP4: AT DOS COMAND PROMPT EXECUTE THE COMMAND
STEP5: AT SQL COMAND PROMPT EXECUTE THE COMMAND

COLUMN NAME DATA TYPE SIZE


PET_TYPE VARCHAR2 15
STORE VARCHAR2 15
NO NUMBER 15
SQL>CREATE TABLE PETS(PET_TYPE VARCHAR2(8),STORE VARCHAR2(8),NO
NUMBER(4));
PET_TYPE STORE NO
CAT MIAMI 18
DOG MIAMI 12

DOG TAMPA 14
TURTLE TAMPA 4

DOG NAPLES 5
TURTLE NAPLES 1
LOAD DATA
INFILE 'Z:\DELAB\EXPT2\PETS.CSV'
INTO TABLE PETS
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
(PET_TYPE,STORE,NO)

CAT,MIAMI,18
DOG,MIAMI,16
DOG, yDFGY,17

Z:\>CD DELAB Z:\


>DELAB>CD EXPT2
Z:\DELAB\EXPT2> CTL SQLLDR CSUSER16/CSUSER16@//10.60.70.80/ORCL
CONTROL=LOADER.
SQL>SELECT * FROM PETS;
9

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY

SQL>SELECT PET_TYPE,STORE,SUM(NO) FROM PETS GROUP BY CUBE(PET_TYPE,STORE);

10

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY

EXPERIMENT 2: CUBE SLICING- COME WITH 2-D VIEW DATA

AIM: Implement Cube operation


Slice
A slice is a subset of a multidimensional array corresponding to a single
value for one or more members of the dimensions not in the subset.
To develop method of constraining the space requirements of the dynamic data cube of the full
data cube size by deleting unnecessary data.

ALGORITHM:

STEP1: SELECT ITEM_TUPE,PURCHASE_DATE,SOLD_QTY


STEP2: SELECT BRANCH_CITY,PURCHASE_DATE,SOLD_QTY
STEP3: EXCECUTE THE QUERY IN SQL COMMAND GIVEN BELOW
2-dimensional cuboid:
a)
SELECT item.item_type ,purchases.date1,sum(items_sold.qty) FROM
item,purchases,items_sold
WHERE item.item_id=items_sold.item_id AND
purchases.trans_id=items_sold.trans_id GROUP BY
item.item_type,purchases.date1;
Output:
SUM(ITEMS_SOLD.Q
ITEM_TYPE DATE1 TY)

COMPUTER 05-MAR-07 3

TV 05-MAR-07 1

HAP 03-JAN-07 8

COMPUTER 03-JAN-07 7

TV 08-APR-09 3

TV 20-JUN-06 6

COMPUTER 05-MAR-09 3

HAP 01-FEB-09 9

TV 01-FEB-09 3

HAP 20-JUN-08 1

HAP 01-FEB-06 5

TV 20-JUN-09 3

COMPUTER 20-JUN-06 5

HAP 01-FEB-08 8

11
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
HAP 05-MAR-09 6

COMPUTER 03-JAN-09 2

COMPUTER 20-JUN-08 2

TV 20-JUN-08 4

12

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
Experiment3: Drill-down or Roll-down going from summary to more detailed
data

AIM: Implement Cube operations

ALGORITHM:
STEP 1. Create The Tables with
ITEM,BRANCH,CITY,PURCHASES,WORKS_AT,SOLD_QTY
STEP 2. MENTION ID=ITEMS_SOLD ITEM_ID,BRANCH_ID=WORKS_AT.
STEP 3. TRANS_ID=ITEMS_SOL,TRANS_ID
STEP 4.EXCECUTE THE QUERY IN SQL COMMAND GIVEN BELOW

Solution for expt3:

3-dimensional cuboid:

SELECT
item.item_type, branch.branch_city, purchases.date1, sum(items_sold.qty)
FROM item, branch, purchases, items_sold, works_at
WHERE item.item_id=items_sold.item_id AND
branch.branch_id=works_at.branch_id AND
purchases.trans_id=items_sold.trans_id
GROUP BY item.item_type,branch.branch_city,purchases.date1;

Output:

ITEM_TYPE BRANCH_CITY DATE1 SUM(ITEMS_SOLD.QTY)

COMPUTER CHENNAI 03-JAN-06 4

TV CHENNAI 05-MAR-07 2

COMPUTER BNG 05-MAR-07 6

TV HYD 05-MAR-08 8

TV BNG 05-MAR-08 4

HAP CHENNAI 05-MAR-08 14

TV GUNTUR 08-APR-09 6

COMPUTER HYD 08-APR-09 8

HAP GUNTUR 08-APR-09 16

TV HYD 20-JUN-06 24

TV BNG 20-JUN-06 12

TV GUNTUR 20-JUN-06 12

13
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
HAP GUNTUR 20-JUN-06 4

TV CHENNAI 03-JAN-08 8

TV BNG 01-FEB-08 8

COMPUTER CHENNAI 01-FEB-08 10

240 rows returned

14

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
Experiment4: Rollup - summarize data along a dimension hierarchy
AIM: Implement Cube operations

ALGORITHM:
STEP 1. SELECT ITEM,SOLD_QTY FROM ITEMSCREATED
STEP 2. SELECT PURCHASE_DATE,OLD_QTY FROM PURCHASES
STEP 3: SELECT PURCHASE_CUST_ID,FROM PURCHASES
STEP 4: SELECT SUM ITEMS_QTY FROM ITEMS_SOLD.
STEP 5: EXCECUTE THE QUERY IN SQL COMMAND GIVEN BELOW

Solution for expt4:

1-dimensional
i) SELECT item.item_type ,sum(items_sold.qty) FROM item,items_sold
WHERE item.item_id=items_sold.item_id GROUP BY item.item_type;

Output:
ITEM_TYPE SUM(ITEMS_SOLD.QTY)

COMPUTER 65

TV 70

HAP 105

3 rows returned

ii) SELECT purchases.date1,sum(items_sold.qty) FROM purchases, items_sold


WHERE purchases.trans_id=items_sold.trans_id GROUP BY purchases.date1;
Output:
DATE1 SUM(ITEMS_SOLD.QTY)

03-JAN-06 6

05-MAR-09 18

08-APR-07 11

20-JUN-06 13

20-JUN-07 6

20-JUN-08 7

01-FEB-06 11

03-JAN-09 6
03-JAN-07 20

15

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
01-FEB-07 12

08-APR-09 13

08-APR-06 14

01-FEB-09 14

05-MAR-08 13

05-MAR-07 10

08-APR-08 11

20-JUN-09 6

03-JAN-08 19

01-FEB-08 17

05-MAR-06 13

20 rows returned

iii) SELECT purchases.cust_id,sum(items_sold.qty) FROM


purchases,items_sold
WHERE purchases.trans_id=items_sold.trans_id GROUP BY
purchases.cust_id;

Output:

CUST_ID SUM(ITEMS_SOLD.QTY)

C4 49

C5 32

C2 54

C1 51

C3 54

5 rows returned

ApexCuboid

SELECT sum(items_sold.qty) FROM items_sold;

Output:
SUM(ITEMS_S
OLD. QTY)

16

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY

Experiment5: Dicing – project 2-D view of data

AIM: Implement Cube operations

ALGORITHM:
STEP1: SELECT ITEM_TUPE,PURCHASE_DATE,SOLD_QTY
STEP2: SELECT BRANCH_CITY,PURCHASE_DATE,SOLD_QTY
STEP3: EXCECUTE THE QUERY IN SQL COMMAND GIVEN BELOW
2-dimensional cuboid:

a)
SELECT item.item_type ,purchases.date1,sum(items_sold.qty) FROM
item,purchases,items_sold
WHERE item.item_id=items_sold.item_id AND
purchases.trans_id=items_sold.trans_id GROUP BY
item.item_type,purchases.date1;

Output:
ITEM_TYPE DATE1 SUM(ITEMS_SOLD.QTY)

COMPUTER 05-MAR-07 3

TV 05-MAR-07 1

HAP 03-JAN-07 8

COMPUTER 03-JAN-07 7

TV 08-APR-09 3

TV 20-JUN-06 6

COMPUTER 05-MAR-09 3

HAP 01-FEB-09 9

TV 01-FEB-09 3

HAP 20-JUN-08 1

HAP 01-FEB-06 5

TV 20-JUN-09 3

COMPUTER 20-JUN-06 5

HAP 01-FEB-08 8

HAP 05-MAR-09 6

COMPUTER 03-JAN-09 2

COMPUTER 20-JUN-08 2

17
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
TV 20-JUN-08 4

TV 01-FEB-07 2

TV 05-MAR-08 2

HAP 03-JAN-08 8

TV 03-JAN-08 4

COMPUTER 08-APR-06 2

60 rows returned
b)
SELECT branch.branch_city,purchases.date1,sum(items_sold.qty) FROM
branch,purchases,items_sold,works_at
WHERE branch.branch_id=works_at.branch_id AND
purchases.trans_id=items_sold.trans_id GROUP BY
branch.branch_city,purchases.date1;

Output:

BRANCH_CITY DATE1 SUM(ITEMS_SOLD.QTY)

GUNTUR 05-MAR-08 26

CHENNAI 08-APR-07 22

CHENNAI 08-APR-06 28

CHENNAI 08-APR-08 22

CHENNAI 03-JAN-06 12

HYD 01-FEB-07 48

BNG 08-APR-06 28

BNG 20-JUN-09 12

GUNTUR 20-JUN-08 14

GUNTUR 03-JAN-09 12

GUNTUR 20-JUN-07 12

GUNTUR 08-APR-06 28

GUNTUR 20-JUN-06 26

GUNTUR 03-JAN-07 40

GUNTUR 05-MAR-07 20

CHENNAI 08-APR-09 26

CHENNAI 05-MAR-08 26

18
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
CHENNAI 03-JAN-07 40

CHENNAI 05-MAR-07 20

CHENNAI 01-FEB-06 22

HYD 20-JUN-08 28

HYD 05-MAR-06 52

HYD 03-JAN-09 24

HYD 05-MAR-08 52

HYD 08-APR-08 44

HYD 03-JAN-06 24

BNG 20-JUN-08 14

BNG 08-APR-07 22

80 rows returned

Run the following schema for experiment 3 to 5;

create table customer(cust_id varchar2(20), cust_name varchar2(20), cust_city varchar2(20),


cust_state varchar2(20), cust_country varchar2(20), cust_age number(3), cust_income
number(9,3), primary key(cust_id));

insert into customer values('C1','MANIDEEP','GUNTUR','AP','INDIA',23,35000);


insert into customer values('C2','MADHU','ONGOLE','AP','INDIA',23,40000);
insert into customer values('C3','ARUNBABU','GUNTUR','AP','INDIA',23,26000);
insert into customer values('C4','RAKESH','BENGALORE','KARNATAKA','INDIA',24,25000);
insert into customer values('C5','SHIRAJ','CHENNAI','TN','INDIA',25,38000);

create table item(item_id varchar2(20), item_name varchar2(20), item_brand varchar2(20),


item_type varchar2(20), primary key (item_id));

insert into item values('I1','HDTV','SAMSUNG','TV'); insert


into item values('I2','LAPTOP','DELL','COMPUTER');
insert into item values('I3','MICROWAVE OVEN','LG','HAP');

create table employee(emp_id varchar2(20),emp_name varchar2(20),emp_category


varchar2(30), primary key(emp_id));

insert into employee values('E1','JOHN','HOMEENTERTAIN');


insert into employee values('E2','SMITH','ELECTRONICS');
insert into employee values('E3','MILLER','ELECTRONICS');
insert into employee values('E4','SCOTT','HOUSEELECTRONICS');
insert into employee values('E5','KEVIN','AUTOMOBILE');
insert into employee values('E6','WARNE','HOMEENTERTAIN');
19
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
insert into employee values('E7','WATSON','ELECTRONICS');
insert into employee values('E8','HAYES','ELECTRONICS');
insert into employee values('E9','RODES','HOUSEELECTRONICS');
insert into employee values('E10','PETER','AUTOMOBILE');

create table branch(branch_id varchar2(20), branch_name varchar2(20), branch_city


varchar2(20), branch_state varchar2(20), branch_country varchar2(20), primary
key(branch_id));

insert into branch values('B1','CITYSQ','GUNTUR','AP','INDIA');


insert into branch values('B2','POTHIES','CHENNAI','TN','INDIA');
insert into branch values('B3','CMR','HYD','AP','INDIA');
insert into branch values('B4','MCM','BNG','KTK','INDIA');
insert into branch values('B5','GLAND','HYD','AP','INDIA');

create table purchases( trans_id varchar2(20), cust_id varchar2(20), emp_id varchar2(20),


date1 date, primary key(trans_id), foreign Key (cust_id) references customer(cust_id),foreign
Key (emp_id) references employee(emp_id) );

insert into purchases values('T100','C1','E1','03-JAN-06');


insert into purchases values('T101','C2','E2','01-FEB-06');
insert into purchases values('T102','C3','E3','05-MAR-07');
insert into purchases values('T103','C4','E4','08-APR-08');
insert into purchases values('T104','C5','E5','20-JUN-09');
insert into purchases values('T105','C1','E6','03-JAN-07');
insert into purchases values('T106','C2','E7','01-FEB-07');
insert into purchases values('T107','C3','E8','05-MAR-08');
insert into purchases values('T108','C4','E9','08-APR-09');
insert into purchases values('T109','C5','E10','20-JUN-06');
insert into purchases values('T110','C1','E6','03-JAN-08');
insert into purchases values('T111','C2','E7','01-FEB-08');
insert into purchases values('T112','C3','E8','05-MAR-09');
insert into purchases values('T113','C4','E9','08-APR-06');
insert into purchases values('T114','C5','E10','20-JUN-07');
insert into purchases values('T115','C1','E6','03-JAN-09');
insert into purchases values('T116','C2','E7','01-FEB-09');
insert into purchases values('T117','C3','E8','05-MAR-06');
insert into purchases values('T118','C4','E9','08-APR-07');
insert into purchases values('T119','C5','E10','20-JUN-08');

create table items_sold(trans_id varchar2(20), item_id varchar2(20), qty number(10),foreign


Key (trans_id) references purchases(trans_id),foreign Key (item_id) references item(item_id));

insert into items_sold values('T100','I1',1);


insert into items_sold values('T100','I2',2);
insert into items_sold values('T100','I3',3);
insert into items_sold values('T101','I1',2);
20

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
insert into items_sold values('T101','I2',4);
insert into items_sold values('T101','I3',5);
insert into items_sold values('T102','I1',1);
insert into items_sold values('T102','I2',3);
insert into items_sold values('T102','I3',6);
insert into items_sold values('T103','I1',2);
insert into items_sold values('T103','I2',4);
insert into items_sold values('T103','I3',5);
insert into items_sold values('T104','I1',3);
insert into items_sold values('T104','I2',2);
insert into items_sold values('T104','I3',1);
insert into items_sold values('T105','I1',5);
insert into items_sold values('T105','I2',7);
insert into items_sold values('T105','I3',8);
insert into items_sold values('T106','I1',2);
insert into items_sold values('T106','I2',3);
insert into items_sold values('T106','I3',7);
insert into items_sold values('T107','I1',2);
insert into items_sold values('T107','I2',4);
insert into items_sold values('T107','I3',7);
insert into items_sold values('T108','I1',3);
insert into items_sold values('T108','I2',2);
insert into items_sold values('T108','I3',8);
insert into items_sold values('T109','I1',6);
insert into items_sold values('T109','I2',5);
insert into items_sold values('T109','I3',2);
insert into items_sold values('T110','I1',4);
insert into items_sold values('T110','I2',7);
insert into items_sold values('T110','I3',8);
insert into items_sold values('T111','I1',4);
insert into items_sold values('T111','I2',5);
insert into items_sold values('T111','I3',8);
insert into items_sold values('T112','I1',9);
insert into items_sold values('T112','I2',3);
insert into items_sold values('T112','I3',6);
insert into items_sold values('T113','I1',7);
insert into items_sold values('T113','I2',2);
insert into items_sold values('T113','I3',5);
insert into items_sold values('T114','I1',3);
insert into items_sold values('T114','I2',2);
insert into items_sold values('T114','I3',1);
insert into items_sold values('T115','I1',3);
insert into items_sold values('T115','I2',2);
insert into items_sold values('T115','I3',1);
insert into items_sold values('T116','I1',3);
insert into items_sold values('T116','I2',2);
insert into items_sold values('T116','I3',9);
insert into items_sold values('T117','I1',3);

21
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
insert into items_sold values('T117','I2',2);
insert into items_sold values('T117','I3',8);
insert into items_sold values('T118','I1',3);
insert into items_sold values('T118','I2',2);
insert into items_sold values('T118','I3',6);
insert into items_sold values('T119','I1',4);
insert into items_sold values('T119','I2',2);
insert into items_sold values('T119','I3',1);

create table works_at(empl_id varchar2(20), branch_id varchar2(20),foreign Key (empl_id)


references employee(emp_id),foreign Key (branch_id) references branch(branch_id));

insert into works_at values('E1','B1');


insert into works_at values('E2','B2');
insert into works_at values('E3','B3');
insert into works_at values('E4','B4');
insert into works_at values('E5','B5');
insert into works_at values('E6','B1');
insert into works_at values('E7','B2');
insert into works_at values('E8','B3');
insert into works_at values('E9','B4');
insert into works_at values('E10','B5');

22

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
Experiment 6: Creating Star Schema and Snowflake Schema.
AIM: Implementation of Fact Table
Theory:
Schema Modeling Techniques:
Schemas in Data Warehouses
1. Third Normal Form
2. Star Schemas
3. Optimizing Star Queries

A schema is a collection of database objects, including tables, views, indexes, and


synonyms

Star Schemas

The star schema is the simplest data warehouse schema. It is called a star schema
because the entity-relationship diagram of this schema resembles a star, with points radiating
from a central table. The center of the star consists of a large fact table and the points of the
star are the dimension tables.

A star schema is characterized by one or more very large fact tables that contain the
primary information in the data warehouse, and a number of much smaller dimension tables
(or lookup tables), each of which contains information about the entries for a particular
attribute in the fact table.

A star query is a join between a fact table and a number of dimension tables. Each
dimension table is joined to the fact table using a primary key to foreign key join, but the
dimension tables are not joined to each other. The cost-based optimizer recognizes star
queries and generates efficient execution plans for them.

A typical fact table contains keys and measures. For example, in the sh sample
schema, the fact table, sales, contain the measures quantity_sold, amount, and cost, and
the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension tables are
customers, times, products, channels, and promotions. The product dimension table, for
example, contains information about each product number that appears in the fact table.
Implementation of k-means algorithm using ‘c’.

A star join is a primary key to foreign key join of the dimension tables to a fact table.

The main advantages of star schemas are that they:


1. Provide a direct and intuitive mapping between the business entities being analyzed
by end users and the schema design.
2. Provide highly optimized performance for typical star queries.
3. Are widely supported by a large number of business intelligence tools,
4. Theory:
5. Schema Modeling Techniques:
6. Schemas in Data Warehouses
7. Third Normal Form
8. Star Schemas
23

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
9. Optimizing Star Queries

10. A schema is a collection of database objects, including tables, views, indexes, and
synonyms

11. Star Schemas

12. The star schema is the simplest data warehouse schema. It is called a star schema
because the entity-relationship diagram of this schema resembles a star, with points
radiating from a central table. The center of the star consists of a large fact table and
the points of the star are the dimension tables.

13. A star schema is characterized by one or more very large fact tables that contain the
primary information in the data warehouse, and a number of much smaller dimension
tables (or lookup tables), each of which contains information about the entries for a
particular attribute in the fact table.

14. A star query is a join between a fact table and a number of dimension tables. Each
dimension table is joined to the fact table using a primary key to foreign key join, but
the dimension tables are not joined to each other. The cost-based optimizer recognizes
star queries and generates efficient execution plans for them.

15. A typical fact table contains keys and measures. For example, in the sh sample
schema, the fact table, sales, contain the measures quantity_sold, amount, and cost,
and the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension
tables are customers, times, products, channels, and promotions. The product
dimension table, for example, contains information about each product number that
appears in the fact table.
16. Implementation of k-means algorithm using ‘c’.
17. A star join is a primary key to foreign key join of the dimension tables to a fact table.

18. The main advantages of star schemas are that they:


19. Provide a direct and intuitive mapping between the business entities being analyzed
by end users and the schema design.
20. Provide highly optimized performance for typical star queries.
21. Theory:
22. Schema Modeling Techniques:
23. Schemas in Data Warehouses
24. Third Normal Form
25. Star Schemas
26. Optimizing Star Queries

27. A schema is a collection of database objects, including tables, views, indexes, and
synonyms

24

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
28. Star Schemas

29. The star schema is the simplest data warehouse schema. It is called a star schema
because the entity-relationship diagram of this schema resembles a star, with points
radiating from a central table. The center of the star consists of a large fact table and
the points of the star are the dimension tables.

30. A star schema is characterized by one or more very large fact tables that contain the
primary information in the data warehouse, and a number of much smaller dimension
tables (or lookup tables), each of which contains information about the entries for a
particular attribute in the fact table.
31.
32. A star query is a join between a fact table and a number of dimension tables. Each
dimension table is joined to the fact table using a primary key to foreign key join, but
the dimension tables are not joined to each other. The cost-based optimizer recognizes
star queries and generates efficient execution plans for them.

33. A typical fact table contains keys and measures. For example, in the sh sample
schema, the fact table, sales, contain the measures quantity_sold, amount, and cost,
and the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension
tables are customers, times, products, channels, and promotions. The product
dimension table, for example, contains information about each product number that
appears in the fact table.
34. Implementation of k-means algorithm using ‘c’.
35. A star join is a primary key to foreign key join of the dimension tables to a fact table.

36. The main advantages of star schemas are that they:


37. Provide a direct and intuitive mapping between the business entities being analyzed
by end users and the schema design.
38. Provide highly optimized performance for typical star queries.
39. Are widely supported by a large number of business intelligence tools, which may
anticipate or even require that the data-warehouse schema contain dimension tables
40. Star schemas are used for both simple data marts and very large data warehouses.
41. Figure: presents a graphical representation of a star schema.

42. Snowflake Schemas

43. The snowflake schema is a more complex data warehouse model than a star schema,
and is a type of star schema. It is called a snowflake schema because the diagram of
the schema resembles a snowflake.

25

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
44. Snowflake schemas normalize dimensions to eliminate redundancy. That is, the
dimension data has been grouped into multiple tables instead of one large table.

45. For example, a product dimension table in a star schema might be normalized into a
products table, a product_category table, and a product_manufacturer table in a
snowflake schema. While this saves space, it increases the number of dimension tables
and requires more foreign key joins. The result is more complex queries and reduced
query performance. Figure presents a graphical representation of a snowflake schema.
46. Figure: Snowflake Schema

47. Note:
48. Oracle Corporation recommends you choose a star schema over a snowflake schema
unless you have a clear reason not to
49. Are widely supported by a large number of business intelligence tools, which may
anticipate or even require that the data-warehouse schema contain dimension tables
50. Star schemas are used for both simple data marts and very large data warehouses.
51. Figure: presents a graphical representation of a star schema.

52. Snowflake Schemas

53. The snowflake schema is a more complex data warehouse model than a star schema,
and is a type of star schema. It is called a snowflake schema because the diagram of
the schema resembles a snowflake.

26

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
54. Snowflake schemas normalize dimensions to eliminate redundancy. That is, the
dimension data has been grouped into multiple tables instead of one large table.

55. For example, a product dimension table in a star schema might be normalized into a
products table, a product_category table, and a product_manufacturer table in a
snowflake schema. While this saves space, it increases the number of dimension tables
and requires more foreign key joins. The result is more complex queries and reduced
query performance. Figure presents a graphical representation of a snowflake schema.
56. Figure: Snowflake Schema

57. Note:
58. Oracle Corporation recommends you choose a star schema over a snowflake schema
unless you have a clear reason not to
59. which may anticipate or even require that the data-warehouse schema contain
dimension tables

60. Star schemas are used for both simple data marts and very large data warehouses.

Figure: presents a graphical representation of a star schema.

Snowflake Schemas

The snowflake schema is a more complex data warehouse model than a star schema,
and is a type of star schema. It is called a snowflake schema because the diagram of the
schema resembles a snowflake.

27

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
Snowflake schemas normalize dimensions to eliminate redundancy. That is, the
dimension data has been grouped into multiple tables instead of one large table.

For example, a product dimension table in a star schema might be normalized into a
products table, a product_category table, and a product_manufacturer table in a snowflake
schema. While this saves space, it increases the number of dimension tables and requires
more foreign key joins. The result is more complex queries and reduced query performance.
Figure presents a graphical representation of a snowflake schema.

Figure: Snowflake Schema

28

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY

Experiment 7: Creating Fact Table.

AIM: Implementation of Fact Table

Theory:

Fact Tables

A fact table typically has two types of columns: those that contain numeric facts
(often called measurements), and those that are foreign keys to dimension tables.
A fact table contains either detail-level facts or facts that have been aggregated.
Fact tables that contain aggregated facts are often called summary tables. A fact table
usually contains facts with the same level of aggregation.
Though most facts are additive, they can also be semi-additive or non-additive.
Additive facts can be aggregated by simple arithmetical addition. A common example of this
is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-
additive facts can be aggregated along some of the dimensions and not along others. An
example of this is inventory levels, where you cannot tell what a level means simply by
looking at it.

Creating a New Fact Table

You must define a fact table for each star schema. From a modeling standpoint, the
primary key of the fact table is usually a composite key that is made up of all of its foreign
keys.

Figure is a common example of a sales fact table and dimension tables customers,
products, promotions, times, and channels

29

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY

Experiment-1: Write a Program to implement Apriori algorithm using WEKA

Procedure:

Step1:- Choose WEKA Icon then WEKA GUI chooser will appear then choose and load dataset into
WEKA.

Step2:- Choose Associate

Step3:- select Apriori from “choose” then click start

Step4:- output can be viewed in Associator output frame

Data set:

30

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY

Visual output:

31

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
Experiment -2: Write a program to implement FP Growth using WEKA

Procedure:

Step1:- Choose WEKA Icon then WEKA GUI chooser will appear then choose and load dataset into
WEKA

Step2:-select associate tab

Step3:-Click “choose”. Select FPGrowth in associations.

Step4:-click Start for output

Data selection:

OUTPUT:

32

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
Experiment-3: Write a program to implement DECISION TREE using WEKA

Procedure:

Step1:- Choose WEKA Icon then WEKA GUI chooser will appear then choose and load dataset into
WEKA.

Step2:-Choose classify

Step3:- Select use training set in test options

Step4:- Select choose in classfier

Step5:- Select choose, it displays many attributes. Select Tree amongst them

Step6:- Select J48 from that tree

Step7:- Choose start

Step8:- Right click and select result list

Step9:- The result list contains 11:37:42 trees,J48-right click and select visualize tree

Data view:

33

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY

Visual Output:

34

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
VIVA QUESTIONS

1. What is data warehouse?


A data warehouse is a electronic storage of an Organization's historical data for the
purpose of reporting, analysis and data mining or knowledge discovery.

2. What is the benefits of data warehouse?


A data warehouse helps to integrate data and store them historically so that we can
analyze different aspects of business including, performance analysis, trend, prediction
etc. over a given time frame and use the result of our analysis to improve the efficiency
of business processes.

3. What is the difference between OLTP and OLAP?


OLTP is the transaction system that collects business data. Whereas OLAP is the
reporting and analysis system on that data. OLTP systems are optimized for INSERT,
UPDATE operations and therefore highly normalized. On the other hand, OLAP systems
are deliberately denormalized for fast data retrieval through SELECT operations.

4. What is data mart?


Data marts are generally designed for a single subject area. An organization may have
data pertaining to different departments like Finance, HR, Marketting etc. stored in data
warehouse and each department may have separate data marts. These data marts can be
built on top of the data warehouse.

5. What is dimension?
A dimension is something that qualifies a quantity (measure).
For an example, consider this: If I just say… “20kg”, it does not mean anything. But if I
say, "20kg of Rice (Product) is sold to Ramesh (customer) on 5th April (date)", then that
gives a meaningful sense. These product, customer and dates are some dimension that
qualified the measure - 20kg. Dimensions are mutually independent. Technically
speaking, a dimension is a data element that categorizes each item in a data set into non-
overlapping regions
.
6. What is Fact?
A fact is something that is quantifiable (Or measurable). Facts are typically (but not
always) numerical values that can be aggregated.

7. Briefly state different between data ware house & data mart?
Dataware house is made up of many datamarts. DWH contain many subject areas. but
data mart focuses on one subject area generally. e.g. If there will be DHW of bank then
there can be one data mart for accounts, one for Loans etc. This is high level definitions.
Metadata is data about data. e.g. if in data mart we are receving any file. then metadata

35

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
will contain information like how many columns, file is fix width/elimted, ordering of
fileds, dataypes of field etc...

8. What is the difference between dependent data warehouse and independent data
warehouse?
There is a third type of Datamart called Hybrid. The Hybrid datamart having source data
from Operational systems or external files and central Datawarehouse as well. I will
definitely check for Dependent and Independent Datawarehouses and update.
9. What are the storage models of OLAP?
ROLAP, MOLAP and HOLAP

10. What are CUBES?


A data cube stores data in a summarized version which helps in a faster analysis of data.
The data is stored in such a way that it allows reporting easily.
E.g. using a data cube A user may want to analyze weekly, monthly performance of an
employee. Here, month and week could be considered as the dimensions of the cube.

11. What is MODEL in Data mining world?


Models in Data mining help the different algorithms in decision making or pattern
matching. The second stage of data mining involves considering various models and
choosing the best one based on their predictive performance.

12. Explain how to mine an OLAP cube.


A data mining extension can be used to slice the data the source cube in the order as
discovered by data mining. When a cube is mined the case table is a dimension.

13. Explain how to use DMX-the data mining query language.


Data mining extension is based on the syntax of SQL. It is based on relational concepts
and mainly used to create and manage the data mining models. DMX comprises of two
types of statements: Data definition and Data manipulation. Data definition is used to
define or create new models, structures.

14. Define Rollup and cube.


Custom rollup operators provide a simple way of controlling the process of rolling up a
member to its parents values.The rollup uses the contents of the column as custom rollup
operator for each member and is used to evaluate the value of the member’s parents.
If a cube has multiple custom rollup formulas and custom rollup members, then the
formulas are resolved in the order in which the dimensions have been added to the cube.

15. Differentiate between Data Mining and Data warehousing.


Data warehousing is merely extracting data from different sources, cleaning the data and
storing it in the warehouse. Where as data mining aims to examine or explore the data
using queries. These queries can be fired on the data warehouse. Explore the data in data
mining helps in reporting, planning strategies, finding meaningful patterns etc.

36
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
E.g. a data warehouse of a company stores all the relevant information of projects and
employees. Using Data mining, one can use this data to generate different reports like
profits generated etc.

16. What is Discrete and Continuous data in Data mining world?


Discreet data can be considered as defined or finite data. E.g. Mobile numbers, gender.
Continuous data can be considered as data which changes continuously and in an
ordered fashion. E.g. age

17. What is a Decision Tree Algorithm?


A decision tree is a tree in which every node is either a leaf node or a decision node. This
tree takes an input an object and outputs some decision. All Paths from root node to the
leaf node are reached by either using AND or OR or BOTH. The tree is constructed
using the regularities of the data. The decision tree is not affected by Automatic Data
Preparation.

18. What is Naïve Bayes Algorithm?


Naïve Bayes Algorithm is used to generate mining models. These models help to
identify relationships between input columns and the predictable columns. This
algorithm can be used in the initial stage of exploration. The algorithm calculates the
probability of every state of each input column given predictable columns possible
states. After the model is made, the results can be used for exploration and making
predictions.

19. Explain clustering algorithm.


Clustering algorithm is used to group sets of data with similar characteristics also called
as clusters. These clusters help in making faster decisions, and exploring data. The
algorithm first identifies relationships in a dataset following which it generates a series
of clusters based on the relationships. The process of creating clusters is iterative. The
algorithm redefines the groupings to create clusters that better represent the data.

20. Explain Association algorithm in Data mining?


Association algorithm is used for recommendation engine that is based on a market
based analysis. This engine suggests products to customers based on what they bought
earlier. The model is built on a dataset containing identifiers. These identifiers are both
for individual cases and for the items that cases contain. These groups of items in a data
set are called as an item set. The algorithm traverses a data set to find items that appear
in a case. MINIMUM_SUPPORT parameter is used any associated items that appear
into an item set.

21. What are the goals of data mining?


Prediction, identification, classification and optimization

37

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


DATA ENGINEERING LABORATORY
22. Is data mining independent subject?
No, it is interdisciplinary subject. includes, database technology, visualization, machine
learning, pattern recognition, algorithm etc
.
23. What are different types of database?
Relational database, data warehouse and transactional database.

24. What are data mining functionality?


Mining frequent pattern, association rules, classification and prediction, clustering,
evolution analysis and outlier Analise

25. What are issues in data mining?


Issues in mining methodology, performance issues, user interactive issues, different
source of data types issues etc
.
26. List some applications of data mining.
Agriculture, biological data analysis, call record analysis, DSS, Business intelligence
system etc

27. What do you mean by interesting pattern?


A pattern is said to be interesting if it is 1. easily understood by human 2. valid 3.
potentially useful 4. novel

28. Why do we pre-process the data?


To ensure the data quality. [accuracy, completeness, consistency, timeliness,
believability, interpret-ability

29. What are the steps involved in data pre-processing?


Data cleaning, data integration, data reduction, data transformation.

30. What is distributed data warehouse?


Distributed data warehouse shares data across multiple data repositories for the purpose
of OLAP operation.

38
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

You might also like