Data Engineering Lab

DATA ENGINEERING LABORATORY
ACHARYA NAGARJUNA UNIVERSITY

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CHALAPATHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
IV/IV B. Tech II SEM, 2021-22
MASTER LAB MANUAL

For
DATA ENGINEERING LAB
Prepared by
Sk. John Sydulu, Assistant Professor
T. Lavanya, Assistant Professor
CHALAPATHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

CHALAPATHI NAGAR, LAM, GUNTUR-522034.

CHALAPATHI INSTITUTE OF ENGINEERING & TECHNOLOGY
CHALAPATHI NAGAR, LAM, GUNTUR-522034

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
COLLEGE VISION:
To emerge as an Institute of Excellence for Engineering and Technology and provide world-
class education and research opportunities to the students catering the needs of society.
COLLEGE MISSION:
Establishing a state-of-the-art Engineering Institute with continuously improving infrastructure

and produce students with innovative skills and global outlook.
DEPARTMENT VISION:
To produce professionally competent, research oriented and socially sensitive engineers and
technocrats in the emerging technologies.
DEPARTMENT MISSION:
DM1: State of art laboratories to meet the needs of the continuous change.
DM2: Provide a research environment to meet the societal issues.
DM3: Facilitating collaborations/MOU’S towards emerging technologies.

CHALAPATHI NAGAR, LAM,GUNTUR
PEO’s:
PEO-1: Graduates of the computer science program will aim a successful professional career
and actively engage in applying new ideas/technologies as the field evolves.
PEO-2: Graduates can analyze real life problems and design computing solutions by applying
computer engineering theory and practices followed.
PEO-3: Graduates shall pursue higher studies or do research through quality education.
PO’s:
1. ENGINEERING KNOWLEDGE: Apply the knowledge of mathematics, science,

engineering fundamentals, and an engineering specialization to the solution of complex
engineering problems.
2. PROBLEM ANALYSIS: Identify, formulate, research literature, and analyze complex

engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
3. DESIGN/DEVELOPMENT OF SOLUTIONS: Design solutions for complex engineering

problems and design system components or processes that meet the specified needs with
appropriate consideration for the public health and safety, and the cultural, societal, and
environmental considerations.
4. CONDUCT INVESTIGATIONS OF COMPLEX PROBLEMS: Use research-based

knowledge and research methods including design of experiments, analysis and interpretation of
data, and synthesis of the information to provide valid conclusions.
5. MODERN TOOL USAGE: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modelling to complex engineering
activities with an understanding of the limitations.
6. THE ENGINEER AND SOCIETY: Apply reasoning informed by the contextual

knowledge to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.

7. ENVIRONMENT AND SUSTAINABILITY: Understand the impact of the professional
engineering solutions in societal and environmental contexts, and demonstrate the knowledge
of, and need for sustainable development.
8. ETHICS: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9. INDIVIDUAL AND TEAM WORK: Function effectively as an individual, and as a

member or leader in diverse teams, and in multidisciplinary settings.
10. COMMUNICATION: Communicate effectively on complex engineering activities with

the engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, give and receive
clear instructions.
11. PROJECT MANAGEMENT AND FINANCE: Demonstrate knowledge and

understanding of the engineering and management principles and apply these to one’s own
work, as a member and leader in a team, to manage projects and in multidisciplinary
environments.
12. LIFE-LONG LEARNING: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological change.
PSO’s :
PSO1: Professional Skills: The ability to understand, analyze and develop computer programs
in the areas related to algorithms, system software, multimedia, web design, big data analytics,
and networking for efficient design of computer-based systems of varying complexity.
PSO2: Problem-Solving Skills: The ability to apply standard practices and strategies in software
project development using open-ended programming environments to deliver a quality product
for business success.
PSO3: Successful Career and Entrepreneurship: The ability to employ modern computer
languages, environments, and platforms in creating innovative career paths to be an
entrepreneur, and a zest for higher studies.

CHALAPATHI NAGAR, LAM, GUNTUR-522034

DATA ENGINEERING LAB OBJECTIVES:
1. Practical exposure on implementation of well known data mining tasks.
2. Exposure to real life data sets for analysis and prediction.
3. Learning performance evaluation of data mining algorithms in a supervised and an

unsupervised setting.
4. Handling a small data mining project for a given practical domain.
DATA ENGINEERING LAB OUTCOMES:
1. The data mining process and important issues around data cleaning, pre-processing and
integration.
2. The principle algorithms and techniques used in data mining, such as clustering,
association mining, classification and prediction
3. Demonstrate understanding of the functionality of the various web mining and web
search components and appreciate the strengths and limitations of various web mining and
web search models.
4. Able to use the tools and techniques employed in data mining for different application
domains.
5. Describe different types of research and understand alternative research paradigms.

Mapping of CO’s and PO’s
 1
POS 1 2 3 4 5 6 7 8 9 10 11 2
COS
CO 1 3 2 3
CO 2 2 2 2 2
CO 3 3 2 3
2 2 1
CO 4
1 2 3
CO 5
H-HIGH M-MEDIUM L-LOW

5

SYLLABUS AS PER UNIVERSITY
Expt. No List of Experiments
1 Rollup And Cube Operations On The Following Tables
2 Cube Slicing- Come With 2-D View Data
3 Drill-down or Roll-down going from summary to more detailed data
4 Rollup - summarize data along a dimension hierarchy
5 Dicing – project 2-D view of data
6 Creating Star Schema and Snowflake Schema
7 Creating Fact Table
Additional Experiments
1 Write a Program to implement Apriori algorithm using WEKA
2 Write a program to implement FP Growth using WEKA
3 Write a program to implement DECISION TREE using WEKA
6
Expt. Experiment Name CO’s attained Po’s attained

No
1 Rollup And Cube Operations On The Following CO1,CO2 PO3,PO4

Tables
2 Cube Slicing- Come With 2-D View Data CO1,CO2 PO3,PO4
3 Drill-down or Roll-down going from summary to CO1,CO2 PO3,PO4

more detailed data
4 Rollup - summarize data along a dimension hierarchy CO1,CO2 PO3,PO4
5 Dicing – project 2-D view of data CO1,CO2 PO3,PO4
6 Creating Star Schema and Snowflake Schema CO4 PO3,PO4
7 Creating Fact Table CO3,CO4 PO3,PO4
Additional Experiments
1 Write a Program to implement Apriori algorithm CO3,CO5 PO4,PO5

using WEKA
2 Write a program to implement FP Growth using CO3,CO5 PO4,PO5

WEKA
3 Write a program to implement DECISION TREE CO3,CO5 PO4,PO5

using WEKA
7
INDEX
S.NO CONTENT Page no

I VISION AND MISSION 2
II` PEO’s, PO’s & PSO’s 3
III COURSE OBJECTIVES AND OUTCOMES 5
IV MAPPING OF CO’S AND PO’S 5
V SYLLABUS 6
ROLLUP AND CUBE OPERATIONS ON THE FOLLOWING

1 9
TABLES
CUBE SLICING- COME WITH 2-D VIEW DATA

2 11
DRILL-DOWN OR ROLL-DOWN GOING FROM SUMMARY TO

3 MORE DETAILED DATA 13
ROLLUP - SUMMARIZE DATA ALONG A DIMENSION

4 HIERARCHY 15
DICING – PROJECT 2-D VIEW OF DATA

5 17
6 CREATING STAR SCHEMA AND SNOWFLAKE SCHEMA 23
7 CREATING FACT TABLE 29
ADDITIONAL PROGRAMS
WRITE A PROGRAM TO IMPLEMENT APRIORI ALGORITHM

1 30
USING WEKA
WRITE A PROGRAM TO IMPLEMENT FP GROWTH USING

2 32
WEKA
WRITE A PROGRAM TO IMPLEMENT DECISION TREE USING

3 33
WEKA
VI VIVA QUESTIONS 35

EXPERIMENT 1 :
AIM: Implement Cube operations
ROLLUP AND CUBE OPERATIONS ON THE FOLLOWING TABLES
ALGORITHM:
STEP1: 1. CREATE A TABLE
2. STORE THE DATA IN THE TABLESTEP
STEP 2: WRITE A CONTROL FILE NAME WITH.CTL
STEP 3: CREATE & WRITE FILANAME.CSV WITH THE DATA
STEP4: AT DOS COMAND PROMPT EXECUTE THE COMMAND
STEP5: AT SQL COMAND PROMPT EXECUTE THE COMMAND
COLUMN NAME DATA TYPE SIZE

PET_TYPE VARCHAR2 15
STORE VARCHAR2 15
NO NUMBER 15
SQL>CREATE TABLE PETS(PET_TYPE VARCHAR2(8),STORE VARCHAR2(8),NO
NUMBER(4));
PET_TYPE STORE NO
CAT MIAMI 18
DOG MIAMI 12
DOG TAMPA 14
TURTLE TAMPA 4
DOG NAPLES 5
TURTLE NAPLES 1
LOAD DATA
INFILE 'Z:\DELAB\EXPT2\PETS.CSV'
INTO TABLE PETS
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
(PET_TYPE,STORE,NO)
CAT,MIAMI,18
DOG,MIAMI,16
DOG, yDFGY,17
Z:\>CD DELAB Z:\

>DELAB>CD EXPT2
Z:\DELAB\EXPT2> CTL SQLLDR CSUSER16/CSUSER16@//10.60.70.80/ORCL
CONTROL=LOADER.
SQL>SELECT * FROM PETS;
9

SQL>SELECT PET_TYPE,STORE,SUM(NO) FROM PETS GROUP BY CUBE(PET_TYPE,STORE);
10

EXPERIMENT 2: CUBE SLICING- COME WITH 2-D VIEW DATA
AIM: Implement Cube operation

Slice
A slice is a subset of a multidimensional array corresponding to a single
value for one or more members of the dimensions not in the subset.
To develop method of constraining the space requirements of the dynamic data cube of the full
data cube size by deleting unnecessary data.
ALGORITHM:
STEP1: SELECT ITEM_TUPE,PURCHASE_DATE,SOLD_QTY

STEP2: SELECT BRANCH_CITY,PURCHASE_DATE,SOLD_QTY
STEP3: EXCECUTE THE QUERY IN SQL COMMAND GIVEN BELOW
2-dimensional cuboid:
a)
SELECT item.item_type ,purchases.date1,sum(items_sold.qty) FROM
item,purchases,items_sold
WHERE item.item_id=items_sold.item_id AND
purchases.trans_id=items_sold.trans_id GROUP BY
item.item_type,purchases.date1;
Output:
SUM(ITEMS_SOLD.Q
ITEM_TYPE DATE1 TY)
COMPUTER 05-MAR-07 3
TV 05-MAR-07 1
HAP 03-JAN-07 8
COMPUTER 03-JAN-07 7
TV 08-APR-09 3
TV 20-JUN-06 6
HAP 01-FEB-09 9
TV 01-FEB-09 3
HAP 20-JUN-08 1
HAP 01-FEB-06 5
TV 20-JUN-09 3
COMPUTER 20-JUN-06 5
HAP 01-FEB-08 8
11
HAP 05-MAR-09 6
TV 20-JUN-08 4
12

Experiment3: Drill-down or Roll-down going from summary to more detailed
data
ALGORITHM:
STEP 1. Create The Tables with
ITEM,BRANCH,CITY,PURCHASES,WORKS_AT,SOLD_QTY
STEP 2. MENTION ID=ITEMS_SOLD ITEM_ID,BRANCH_ID=WORKS_AT.
STEP 3. TRANS_ID=ITEMS_SOL,TRANS_ID
STEP 4.EXCECUTE THE QUERY IN SQL COMMAND GIVEN BELOW
Solution for expt3:
SELECT
item.item_type, branch.branch_city, purchases.date1, sum(items_sold.qty)
FROM item, branch, purchases, items_sold, works_at
branch.branch_id=works_at.branch_id AND
purchases.trans_id=items_sold.trans_id
GROUP BY item.item_type,branch.branch_city,purchases.date1;
Output:
ITEM_TYPE BRANCH_CITY DATE1 SUM(ITEMS_SOLD.QTY)
COMPUTER CHENNAI 03-JAN-06 4
TV CHENNAI 05-MAR-07 2
COMPUTER BNG 05-MAR-07 6
TV HYD 05-MAR-08 8
TV BNG 05-MAR-08 4
HAP CHENNAI 05-MAR-08 14
TV GUNTUR 08-APR-09 6
COMPUTER HYD 08-APR-09 8
HAP GUNTUR 08-APR-09 16
TV HYD 20-JUN-06 24
TV BNG 20-JUN-06 12
TV GUNTUR 20-JUN-06 12
13
HAP GUNTUR 20-JUN-06 4
TV CHENNAI 03-JAN-08 8
TV BNG 01-FEB-08 8
COMPUTER CHENNAI 01-FEB-08 10
240 rows returned
14

Experiment4: Rollup - summarize data along a dimension hierarchy
ALGORITHM:
STEP 1. SELECT ITEM,SOLD_QTY FROM ITEMSCREATED
STEP 2. SELECT PURCHASE_DATE,OLD_QTY FROM PURCHASES
STEP 3: SELECT PURCHASE_CUST_ID,FROM PURCHASES
STEP 4: SELECT SUM ITEMS_QTY FROM ITEMS_SOLD.
STEP 5: EXCECUTE THE QUERY IN SQL COMMAND GIVEN BELOW
Solution for expt4:
1-dimensional
i) SELECT item.item_type ,sum(items_sold.qty) FROM item,items_sold
WHERE item.item_id=items_sold.item_id GROUP BY item.item_type;
Output:
ITEM_TYPE SUM(ITEMS_SOLD.QTY)
COMPUTER 65
TV 70
HAP 105
3 rows returned
ii) SELECT purchases.date1,sum(items_sold.qty) FROM purchases, items_sold

WHERE purchases.trans_id=items_sold.trans_id GROUP BY purchases.date1;
Output:
DATE1 SUM(ITEMS_SOLD.QTY)
03-JAN-06 6
05-MAR-09 18
08-APR-07 11
20-JUN-06 13
20-JUN-07 6
20-JUN-08 7
01-FEB-06 11
03-JAN-09 6
03-JAN-07 20
15

01-FEB-07 12
08-APR-09 13
08-APR-06 14
01-FEB-09 14
05-MAR-08 13
05-MAR-07 10
08-APR-08 11
20-JUN-09 6
03-JAN-08 19
01-FEB-08 17
05-MAR-06 13
20 rows returned
iii) SELECT purchases.cust_id,sum(items_sold.qty) FROM

purchases,items_sold
WHERE purchases.trans_id=items_sold.trans_id GROUP BY
purchases.cust_id;
Output:
CUST_ID SUM(ITEMS_SOLD.QTY)
C4 49
C5 32
C2 54
C1 51
C3 54
5 rows returned
ApexCuboid
SELECT sum(items_sold.qty) FROM items_sold;
Output:
SUM(ITEMS_S
OLD. QTY)
16

Experiment5: Dicing – project 2-D view of data
ALGORITHM:
STEP1: SELECT ITEM_TUPE,PURCHASE_DATE,SOLD_QTY
STEP2: SELECT BRANCH_CITY,PURCHASE_DATE,SOLD_QTY
STEP3: EXCECUTE THE QUERY IN SQL COMMAND GIVEN BELOW
a)
SELECT item.item_type ,purchases.date1,sum(items_sold.qty) FROM
item,purchases,items_sold
item.item_type,purchases.date1;
Output:
ITEM_TYPE DATE1 SUM(ITEMS_SOLD.QTY)
TV 05-MAR-07 1
HAP 03-JAN-07 8
TV 08-APR-09 3
TV 20-JUN-06 6
HAP 01-FEB-09 9
TV 01-FEB-09 3
HAP 20-JUN-08 1
HAP 01-FEB-06 5
TV 20-JUN-09 3
HAP 01-FEB-08 8
HAP 05-MAR-09 6
17
TV 20-JUN-08 4
TV 01-FEB-07 2
TV 05-MAR-08 2
HAP 03-JAN-08 8
TV 03-JAN-08 4
COMPUTER 08-APR-06 2
60 rows returned
b)
SELECT branch.branch_city,purchases.date1,sum(items_sold.qty) FROM
branch,purchases,items_sold,works_at
WHERE branch.branch_id=works_at.branch_id AND
branch.branch_city,purchases.date1;
Output:
BRANCH_CITY DATE1 SUM(ITEMS_SOLD.QTY)
GUNTUR 05-MAR-08 26
CHENNAI 08-APR-07 22
CHENNAI 03-JAN-06 12
HYD 01-FEB-07 48
BNG 08-APR-06 28
BNG 20-JUN-09 12
GUNTUR 20-JUN-08 14
GUNTUR 03-JAN-09 12
GUNTUR 20-JUN-07 12
GUNTUR 08-APR-06 28
GUNTUR 20-JUN-06 26
GUNTUR 03-JAN-07 40
GUNTUR 05-MAR-07 20
CHENNAI 05-MAR-08 26
18
CHENNAI 03-JAN-07 40
CHENNAI 05-MAR-07 20
CHENNAI 01-FEB-06 22
HYD 20-JUN-08 28
HYD 05-MAR-06 52
HYD 03-JAN-09 24
HYD 05-MAR-08 52
HYD 08-APR-08 44
HYD 03-JAN-06 24
BNG 20-JUN-08 14
BNG 08-APR-07 22
80 rows returned
Run the following schema for experiment 3 to 5;
create table customer(cust_id varchar2(20), cust_name varchar2(20), cust_city varchar2(20),

cust_state varchar2(20), cust_country varchar2(20), cust_age number(3), cust_income
number(9,3), primary key(cust_id));
insert into customer values('C1','MANIDEEP','GUNTUR','AP','INDIA',23,35000);

insert into customer values('C2','MADHU','ONGOLE','AP','INDIA',23,40000);
insert into customer values('C3','ARUNBABU','GUNTUR','AP','INDIA',23,26000);
insert into customer values('C4','RAKESH','BENGALORE','KARNATAKA','INDIA',24,25000);
insert into customer values('C5','SHIRAJ','CHENNAI','TN','INDIA',25,38000);
create table item(item_id varchar2(20), item_name varchar2(20), item_brand varchar2(20),

item_type varchar2(20), primary key (item_id));
insert into item values('I1','HDTV','SAMSUNG','TV'); insert

into item values('I2','LAPTOP','DELL','COMPUTER');
insert into item values('I3','MICROWAVE OVEN','LG','HAP');
create table employee(emp_id varchar2(20),emp_name varchar2(20),emp_category

varchar2(30), primary key(emp_id));
insert into employee values('E1','JOHN','HOMEENTERTAIN');

insert into employee values('E2','SMITH','ELECTRONICS');
insert into employee values('E3','MILLER','ELECTRONICS');
insert into employee values('E4','SCOTT','HOUSEELECTRONICS');
insert into employee values('E5','KEVIN','AUTOMOBILE');
insert into employee values('E6','WARNE','HOMEENTERTAIN');
19
insert into employee values('E7','WATSON','ELECTRONICS');
insert into employee values('E8','HAYES','ELECTRONICS');
insert into employee values('E9','RODES','HOUSEELECTRONICS');
insert into employee values('E10','PETER','AUTOMOBILE');
create table branch(branch_id varchar2(20), branch_name varchar2(20), branch_city

varchar2(20), branch_state varchar2(20), branch_country varchar2(20), primary
key(branch_id));
insert into branch values('B1','CITYSQ','GUNTUR','AP','INDIA');

insert into branch values('B2','POTHIES','CHENNAI','TN','INDIA');
insert into branch values('B3','CMR','HYD','AP','INDIA');
insert into branch values('B4','MCM','BNG','KTK','INDIA');
insert into branch values('B5','GLAND','HYD','AP','INDIA');
create table purchases( trans_id varchar2(20), cust_id varchar2(20), emp_id varchar2(20),

date1 date, primary key(trans_id), foreign Key (cust_id) references customer(cust_id),foreign
Key (emp_id) references employee(emp_id) );
insert into purchases values('T100','C1','E1','03-JAN-06');

insert into purchases values('T101','C2','E2','01-FEB-06');
insert into purchases values('T102','C3','E3','05-MAR-07');
insert into purchases values('T103','C4','E4','08-APR-08');
insert into purchases values('T104','C5','E5','20-JUN-09');
create table items_sold(trans_id varchar2(20), item_id varchar2(20), qty number(10),foreign

Key (trans_id) references purchases(trans_id),foreign Key (item_id) references item(item_id));
insert into items_sold values('T100','I1',1);

20

21
create table works_at(empl_id varchar2(20), branch_id varchar2(20),foreign Key (empl_id)

references employee(emp_id),foreign Key (branch_id) references branch(branch_id));
insert into works_at values('E1','B1');

22

Experiment 6: Creating Star Schema and Snowflake Schema.
AIM: Implementation of Fact Table
Theory:
Schema Modeling Techniques:
Schemas in Data Warehouses
1. Third Normal Form
2. Star Schemas
3. Optimizing Star Queries
A schema is a collection of database objects, including tables, views, indexes, and

synonyms
Star Schemas
The star schema is the simplest data warehouse schema. It is called a star schema
because the entity-relationship diagram of this schema resembles a star, with points radiating
from a central table. The center of the star consists of a large fact table and the points of the
star are the dimension tables.
A star schema is characterized by one or more very large fact tables that contain the
primary information in the data warehouse, and a number of much smaller dimension tables
(or lookup tables), each of which contains information about the entries for a particular
attribute in the fact table.
A star query is a join between a fact table and a number of dimension tables. Each
dimension table is joined to the fact table using a primary key to foreign key join, but the
dimension tables are not joined to each other. The cost-based optimizer recognizes star
queries and generates efficient execution plans for them.
A typical fact table contains keys and measures. For example, in the sh sample
schema, the fact table, sales, contain the measures quantity_sold, amount, and cost, and
the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension tables are
customers, times, products, channels, and promotions. The product dimension table, for
example, contains information about each product number that appears in the fact table.
Implementation of k-means algorithm using ‘c’.
A star join is a primary key to foreign key join of the dimension tables to a fact table.
The main advantages of star schemas are that they:

1. Provide a direct and intuitive mapping between the business entities being analyzed
by end users and the schema design.
2. Provide highly optimized performance for typical star queries.
3. Are widely supported by a large number of business intelligence tools,
4. Theory:
5. Schema Modeling Techniques:
6. Schemas in Data Warehouses
8. Star Schemas
23

10. A schema is a collection of database objects, including tables, views, indexes, and
synonyms
11. Star Schemas
12. The star schema is the simplest data warehouse schema. It is called a star schema
because the entity-relationship diagram of this schema resembles a star, with points
radiating from a central table. The center of the star consists of a large fact table and
the points of the star are the dimension tables.
13. A star schema is characterized by one or more very large fact tables that contain the
primary information in the data warehouse, and a number of much smaller dimension
tables (or lookup tables), each of which contains information about the entries for a
particular attribute in the fact table.
14. A star query is a join between a fact table and a number of dimension tables. Each
dimension table is joined to the fact table using a primary key to foreign key join, but
the dimension tables are not joined to each other. The cost-based optimizer recognizes
star queries and generates efficient execution plans for them.
15. A typical fact table contains keys and measures. For example, in the sh sample
schema, the fact table, sales, contain the measures quantity_sold, amount, and cost,
and the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension
tables are customers, times, products, channels, and promotions. The product
dimension table, for example, contains information about each product number that
appears in the fact table.
16. Implementation of k-means algorithm using ‘c’.
17. A star join is a primary key to foreign key join of the dimension tables to a fact table.
18. The main advantages of star schemas are that they:

21. Theory:
22. Schema Modeling Techniques:
23. Schemas in Data Warehouses
25. Star Schemas
27. A schema is a collection of database objects, including tables, views, indexes, and
synonyms
24

28. Star Schemas
29. The star schema is the simplest data warehouse schema. It is called a star schema
because the entity-relationship diagram of this schema resembles a star, with points
radiating from a central table. The center of the star consists of a large fact table and
the points of the star are the dimension tables.
30. A star schema is characterized by one or more very large fact tables that contain the
primary information in the data warehouse, and a number of much smaller dimension
tables (or lookup tables), each of which contains information about the entries for a
particular attribute in the fact table.
31.
32. A star query is a join between a fact table and a number of dimension tables. Each
dimension table is joined to the fact table using a primary key to foreign key join, but
the dimension tables are not joined to each other. The cost-based optimizer recognizes
star queries and generates efficient execution plans for them.
33. A typical fact table contains keys and measures. For example, in the sh sample
schema, the fact table, sales, contain the measures quantity_sold, amount, and cost,
and the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension
tables are customers, times, products, channels, and promotions. The product
dimension table, for example, contains information about each product number that
appears in the fact table.
34. Implementation of k-means algorithm using ‘c’.
35. A star join is a primary key to foreign key join of the dimension tables to a fact table.
36. The main advantages of star schemas are that they:

39. Are widely supported by a large number of business intelligence tools, which may
anticipate or even require that the data-warehouse schema contain dimension tables
40. Star schemas are used for both simple data marts and very large data warehouses.
41. Figure: presents a graphical representation of a star schema.
42. Snowflake Schemas
43. The snowflake schema is a more complex data warehouse model than a star schema,
and is a type of star schema. It is called a snowflake schema because the diagram of
the schema resembles a snowflake.
25

44. Snowflake schemas normalize dimensions to eliminate redundancy. That is, the
dimension data has been grouped into multiple tables instead of one large table.
45. For example, a product dimension table in a star schema might be normalized into a
products table, a product_category table, and a product_manufacturer table in a
snowflake schema. While this saves space, it increases the number of dimension tables
and requires more foreign key joins. The result is more complex queries and reduced
query performance. Figure presents a graphical representation of a snowflake schema.
46. Figure: Snowflake Schema
47. Note:
48. Oracle Corporation recommends you choose a star schema over a snowflake schema
unless you have a clear reason not to
49. Are widely supported by a large number of business intelligence tools, which may
anticipate or even require that the data-warehouse schema contain dimension tables
51. Figure: presents a graphical representation of a star schema.
52. Snowflake Schemas
53. The snowflake schema is a more complex data warehouse model than a star schema,
and is a type of star schema. It is called a snowflake schema because the diagram of
the schema resembles a snowflake.
26

54. Snowflake schemas normalize dimensions to eliminate redundancy. That is, the
55. For example, a product dimension table in a star schema might be normalized into a
products table, a product_category table, and a product_manufacturer table in a
snowflake schema. While this saves space, it increases the number of dimension tables
and requires more foreign key joins. The result is more complex queries and reduced
query performance. Figure presents a graphical representation of a snowflake schema.
56. Figure: Snowflake Schema
57. Note:
58. Oracle Corporation recommends you choose a star schema over a snowflake schema
unless you have a clear reason not to
59. which may anticipate or even require that the data-warehouse schema contain
dimension tables
Figure: presents a graphical representation of a star schema.
Snowflake Schemas
The snowflake schema is a more complex data warehouse model than a star schema,
and is a type of star schema. It is called a snowflake schema because the diagram of the
schema resembles a snowflake.
27

Snowflake schemas normalize dimensions to eliminate redundancy. That is, the
For example, a product dimension table in a star schema might be normalized into a
products table, a product_category table, and a product_manufacturer table in a snowflake
schema. While this saves space, it increases the number of dimension tables and requires
more foreign key joins. The result is more complex queries and reduced query performance.
Figure presents a graphical representation of a snowflake schema.
Figure: Snowflake Schema
28

Experiment 7: Creating Fact Table.
AIM: Implementation of Fact Table
Theory:
Fact Tables
A fact table typically has two types of columns: those that contain numeric facts
(often called measurements), and those that are foreign keys to dimension tables.
A fact table contains either detail-level facts or facts that have been aggregated.
Fact tables that contain aggregated facts are often called summary tables. A fact table
usually contains facts with the same level of aggregation.
Though most facts are additive, they can also be semi-additive or non-additive.
Additive facts can be aggregated by simple arithmetical addition. A common example of this
is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-
additive facts can be aggregated along some of the dimensions and not along others. An
example of this is inventory levels, where you cannot tell what a level means simply by
looking at it.
Creating a New Fact Table
You must define a fact table for each star schema. From a modeling standpoint, the
primary key of the fact table is usually a composite key that is made up of all of its foreign
keys.
Figure is a common example of a sales fact table and dimension tables customers,
products, promotions, times, and channels
29

Experiment-1: Write a Program to implement Apriori algorithm using WEKA
Procedure:
Step1:- Choose WEKA Icon then WEKA GUI chooser will appear then choose and load dataset into
WEKA.
Step2:- Choose Associate
Step3:- select Apriori from “choose” then click start
Step4:- output can be viewed in Associator output frame
Data set:
30

Visual output:
31

Experiment -2: Write a program to implement FP Growth using WEKA
Procedure:
WEKA
Step2:-select associate tab
Step3:-Click “choose”. Select FPGrowth in associations.
Step4:-click Start for output
Data selection:
OUTPUT:
32

Experiment-3: Write a program to implement DECISION TREE using WEKA
Procedure:
WEKA.
Step2:-Choose classify
Step3:- Select use training set in test options
Step4:- Select choose in classfier
Step5:- Select choose, it displays many attributes. Select Tree amongst them
Step6:- Select J48 from that tree
Step7:- Choose start
Step8:- Right click and select result list
Step9:- The result list contains 11:37:42 trees,J48-right click and select visualize tree
Data view:
33

Visual Output:
34

VIVA QUESTIONS
1. What is data warehouse?

A data warehouse is a electronic storage of an Organization's historical data for the
purpose of reporting, analysis and data mining or knowledge discovery.
2. What is the benefits of data warehouse?

A data warehouse helps to integrate data and store them historically so that we can
analyze different aspects of business including, performance analysis, trend, prediction
etc. over a given time frame and use the result of our analysis to improve the efficiency
of business processes.
3. What is the difference between OLTP and OLAP?

OLTP is the transaction system that collects business data. Whereas OLAP is the
reporting and analysis system on that data. OLTP systems are optimized for INSERT,
UPDATE operations and therefore highly normalized. On the other hand, OLAP systems
are deliberately denormalized for fast data retrieval through SELECT operations.
4. What is data mart?

Data marts are generally designed for a single subject area. An organization may have
data pertaining to different departments like Finance, HR, Marketting etc. stored in data
warehouse and each department may have separate data marts. These data marts can be
built on top of the data warehouse.
5. What is dimension?
A dimension is something that qualifies a quantity (measure).
For an example, consider this: If I just say… “20kg”, it does not mean anything. But if I
say, "20kg of Rice (Product) is sold to Ramesh (customer) on 5th April (date)", then that
gives a meaningful sense. These product, customer and dates are some dimension that
qualified the measure - 20kg. Dimensions are mutually independent. Technically
speaking, a dimension is a data element that categorizes each item in a data set into non-
overlapping regions
.
6. What is Fact?
A fact is something that is quantifiable (Or measurable). Facts are typically (but not
always) numerical values that can be aggregated.
7. Briefly state different between data ware house & data mart?
Dataware house is made up of many datamarts. DWH contain many subject areas. but
data mart focuses on one subject area generally. e.g. If there will be DHW of bank then
there can be one data mart for accounts, one for Loans etc. This is high level definitions.
Metadata is data about data. e.g. if in data mart we are receving any file. then metadata
35

will contain information like how many columns, file is fix width/elimted, ordering of
fileds, dataypes of field etc...
8. What is the difference between dependent data warehouse and independent data
warehouse?
There is a third type of Datamart called Hybrid. The Hybrid datamart having source data
from Operational systems or external files and central Datawarehouse as well. I will
definitely check for Dependent and Independent Datawarehouses and update.
9. What are the storage models of OLAP?
ROLAP, MOLAP and HOLAP
10. What are CUBES?

A data cube stores data in a summarized version which helps in a faster analysis of data.
The data is stored in such a way that it allows reporting easily.
E.g. using a data cube A user may want to analyze weekly, monthly performance of an
employee. Here, month and week could be considered as the dimensions of the cube.
11. What is MODEL in Data mining world?

Models in Data mining help the different algorithms in decision making or pattern
matching. The second stage of data mining involves considering various models and
choosing the best one based on their predictive performance.
12. Explain how to mine an OLAP cube.

A data mining extension can be used to slice the data the source cube in the order as
discovered by data mining. When a cube is mined the case table is a dimension.
13. Explain how to use DMX-the data mining query language.

Data mining extension is based on the syntax of SQL. It is based on relational concepts
and mainly used to create and manage the data mining models. DMX comprises of two
types of statements: Data definition and Data manipulation. Data definition is used to
define or create new models, structures.
14. Define Rollup and cube.

Custom rollup operators provide a simple way of controlling the process of rolling up a
member to its parents values.The rollup uses the contents of the column as custom rollup
operator for each member and is used to evaluate the value of the member’s parents.
If a cube has multiple custom rollup formulas and custom rollup members, then the
formulas are resolved in the order in which the dimensions have been added to the cube.
15. Differentiate between Data Mining and Data warehousing.

Data warehousing is merely extracting data from different sources, cleaning the data and
storing it in the warehouse. Where as data mining aims to examine or explore the data
using queries. These queries can be fired on the data warehouse. Explore the data in data
mining helps in reporting, planning strategies, finding meaningful patterns etc.
36
E.g. a data warehouse of a company stores all the relevant information of projects and
employees. Using Data mining, one can use this data to generate different reports like
profits generated etc.
16. What is Discrete and Continuous data in Data mining world?

Discreet data can be considered as defined or finite data. E.g. Mobile numbers, gender.
Continuous data can be considered as data which changes continuously and in an
ordered fashion. E.g. age
17. What is a Decision Tree Algorithm?

A decision tree is a tree in which every node is either a leaf node or a decision node. This
tree takes an input an object and outputs some decision. All Paths from root node to the
leaf node are reached by either using AND or OR or BOTH. The tree is constructed
using the regularities of the data. The decision tree is not affected by Automatic Data
Preparation.
18. What is Naïve Bayes Algorithm?

Naïve Bayes Algorithm is used to generate mining models. These models help to
identify relationships between input columns and the predictable columns. This
algorithm can be used in the initial stage of exploration. The algorithm calculates the
probability of every state of each input column given predictable columns possible
states. After the model is made, the results can be used for exploration and making
predictions.
19. Explain clustering algorithm.

Clustering algorithm is used to group sets of data with similar characteristics also called
as clusters. These clusters help in making faster decisions, and exploring data. The
algorithm first identifies relationships in a dataset following which it generates a series
of clusters based on the relationships. The process of creating clusters is iterative. The
algorithm redefines the groupings to create clusters that better represent the data.
20. Explain Association algorithm in Data mining?

Association algorithm is used for recommendation engine that is based on a market
based analysis. This engine suggests products to customers based on what they bought
earlier. The model is built on a dataset containing identifiers. These identifiers are both
for individual cases and for the items that cases contain. These groups of items in a data
set are called as an item set. The algorithm traverses a data set to find items that appear
in a case. MINIMUM_SUPPORT parameter is used any associated items that appear
into an item set.
21. What are the goals of data mining?

Prediction, identification, classification and optimization
37

22. Is data mining independent subject?
No, it is interdisciplinary subject. includes, database technology, visualization, machine
learning, pattern recognition, algorithm etc
.
23. What are different types of database?
Relational database, data warehouse and transactional database.
24. What are data mining functionality?

Mining frequent pattern, association rules, classification and prediction, clustering,
evolution analysis and outlier Analise
25. What are issues in data mining?

Issues in mining methodology, performance issues, user interactive issues, different
source of data types issues etc
.
26. List some applications of data mining.
Agriculture, biological data analysis, call record analysis, DSS, Business intelligence
system etc
27. What do you mean by interesting pattern?

A pattern is said to be interesting if it is 1. easily understood by human 2. valid 3.
potentially useful 4. novel
28. Why do we pre-process the data?

To ensure the data quality. [accuracy, completeness, consistency, timeliness,
believability, interpret-ability
29. What are the steps involved in data pre-processing?

Data cleaning, data integration, data reduction, data transformation.
30. What is distributed data warehouse?

Distributed data warehouse shares data across multiple data repositories for the purpose
of OLAP operation.
38

Data Engineering Lab

Uploaded by

Copyright:

Available Formats

Data Engineering Lab

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Engineering Lab

Uploaded by

Copyright:

Available Formats

DATA ENGINEERING LABORATORY

ACHARYA NAGARJUNA UNIVERSITY

IV/IV B. Tech II SEM, 2021-22

MASTER LAB MANUAL

DATA ENGINEERING LAB

CHALAPATHI INSTITUTE OF ENGINEERING AND TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CHALAPATHI INSTITUTE OF ENGINEERING & TECHNOLOGY

CHALAPATHI NAGAR, LAM, GUNTUR-522034

Establishing a state-of-the-art Engineering Institute with continuously improving infrastructure

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CHALAPATHI NAGAR, LAM,GUNTUR

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

1. ENGINEERING KNOWLEDGE: Apply the knowledge of mathematics, science,

2. PROBLEM ANALYSIS: Identify, formulate, research literature, and analyze complex

3. DESIGN/DEVELOPMENT OF SOLUTIONS: Design solutions for complex engineering

4. CONDUCT INVESTIGATIONS OF COMPLEX PROBLEMS: Use research-based

6. THE ENGINEER AND SOCIETY: Apply reasoning informed by the contextual

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

9. INDIVIDUAL AND TEAM WORK: Function effectively as an individual, and as a

10. COMMUNICATION: Communicate effectively on complex engineering activities with

11. PROJECT MANAGEMENT AND FINANCE: Demonstrate knowledge and

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CHALAPATHI NAGAR, LAM, GUNTUR-522034

DATA ENGINEERING LAB OBJECTIVES:

1. Practical exposure on implementation of well known data mining tasks.

2. Exposure to real life data sets for analysis and prediction.

3. Learning performance evaluation of data mining algorithms in a supervised and an

4. Handling a small data mining project for a given practical domain.

DATA ENGINEERING LAB OUTCOMES:

5. Describe different types of research and understand alternative research paradigms.

H-HIGH M-MEDIUM L-LOW

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Expt. No List of Experiments

1 Rollup And Cube Operations On The Following Tables

2 Cube Slicing- Come With 2-D View Data

3 Drill-down or Roll-down going from summary to more detailed data

4 Rollup - summarize data along a dimension hierarchy

5 Dicing – project 2-D view of data

6 Creating Star Schema and Snowflake Schema

7 Creating Fact Table

1 Write a Program to implement Apriori algorithm using WEKA

2 Write a program to implement FP Growth using WEKA

3 Write a program to implement DECISION TREE using WEKA

Expt. Experiment Name CO’s attained Po’s attained

1 Rollup And Cube Operations On The Following CO1,CO2 PO3,PO4

2 Cube Slicing- Come With 2-D View Data CO1,CO2 PO3,PO4

3 Drill-down or Roll-down going from summary to CO1,CO2 PO3,PO4

5 Dicing – project 2-D view of data CO1,CO2 PO3,PO4

6 Creating Star Schema and Snowflake Schema CO4 PO3,PO4

7 Creating Fact Table CO3,CO4 PO3,PO4

1 Write a Program to implement Apriori algorithm CO3,CO5 PO4,PO5

2 Write a program to implement FP Growth using CO3,CO5 PO4,PO5

3 Write a program to implement DECISION TREE CO3,CO5 PO4,PO5

S.NO CONTENT Page no

II` PEO’s, PO’s & PSO’s 3

III COURSE OBJECTIVES AND OUTCOMES 5

IV MAPPING OF CO’S AND PO’S 5

ROLLUP AND CUBE OPERATIONS ON THE FOLLOWING

CUBE SLICING- COME WITH 2-D VIEW DATA