Data Engineering Lab
Data Engineering Lab
Data Engineering Lab
Prepared by
Sk. John Sydulu, Assistant Professor
T. Lavanya, Assistant Professor
COLLEGE VISION:
To emerge as an Institute of Excellence for Engineering and Technology and provide world-
class education and research opportunities to the students catering the needs of society.
COLLEGE MISSION:
DEPARTMENT VISION:
To produce professionally competent, research oriented and socially sensitive engineers and
technocrats in the emerging technologies.
DEPARTMENT MISSION:
DM1: State of art laboratories to meet the needs of the continuous change.
DM2: Provide a research environment to meet the societal issues.
DM3: Facilitating collaborations/MOU’S towards emerging technologies.
PEO’s:
PEO-1: Graduates of the computer science program will aim a successful professional career
and actively engage in applying new ideas/technologies as the field evolves.
PEO-2: Graduates can analyze real life problems and design computing solutions by applying
computer engineering theory and practices followed.
PEO-3: Graduates shall pursue higher studies or do research through quality education.
PO’s:
5. MODERN TOOL USAGE: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modelling to complex engineering
activities with an understanding of the limitations.
8. ETHICS: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
12. LIFE-LONG LEARNING: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological change.
PSO’s :
PSO1: Professional Skills: The ability to understand, analyze and develop computer programs
in the areas related to algorithms, system software, multimedia, web design, big data analytics,
and networking for efficient design of computer-based systems of varying complexity.
PSO2: Problem-Solving Skills: The ability to apply standard practices and strategies in software
project development using open-ended programming environments to deliver a quality product
for business success.
PSO3: Successful Career and Entrepreneurship: The ability to employ modern computer
languages, environments, and platforms in creating innovative career paths to be an
entrepreneur, and a zest for higher studies.
1. The data mining process and important issues around data cleaning, pre-processing and
integration.
2. The principle algorithms and techniques used in data mining, such as clustering,
association mining, classification and prediction
3. Demonstrate understanding of the functionality of the various web mining and web
search components and appreciate the strengths and limitations of various web mining and
web search models.
4. Able to use the tools and techniques employed in data mining for different application
domains.
CO 1 3 2 3
CO 2 2 2 2 2
CO 3 3 2 3
2 2 1
CO 4
1 2 3
CO 5
Additional Experiments
6
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
Additional Experiments
7
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
INDEX
V SYLLABUS 6
ADDITIONAL PROGRAMS
EXPERIMENT 1 :
AIM: Implement Cube operations
ALGORITHM:
STEP1: 1. CREATE A TABLE
2. STORE THE DATA IN THE TABLESTEP
STEP 2: WRITE A CONTROL FILE NAME WITH.CTL
STEP 3: CREATE & WRITE FILANAME.CSV WITH THE DATA
STEP4: AT DOS COMAND PROMPT EXECUTE THE COMMAND
STEP5: AT SQL COMAND PROMPT EXECUTE THE COMMAND
DOG TAMPA 14
TURTLE TAMPA 4
DOG NAPLES 5
TURTLE NAPLES 1
LOAD DATA
INFILE 'Z:\DELAB\EXPT2\PETS.CSV'
INTO TABLE PETS
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
(PET_TYPE,STORE,NO)
CAT,MIAMI,18
DOG,MIAMI,16
DOG, yDFGY,17
10
ALGORITHM:
COMPUTER 05-MAR-07 3
TV 05-MAR-07 1
HAP 03-JAN-07 8
COMPUTER 03-JAN-07 7
TV 08-APR-09 3
TV 20-JUN-06 6
COMPUTER 05-MAR-09 3
HAP 01-FEB-09 9
TV 01-FEB-09 3
HAP 20-JUN-08 1
HAP 01-FEB-06 5
TV 20-JUN-09 3
COMPUTER 20-JUN-06 5
HAP 01-FEB-08 8
11
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
HAP 05-MAR-09 6
COMPUTER 03-JAN-09 2
COMPUTER 20-JUN-08 2
TV 20-JUN-08 4
12
ALGORITHM:
STEP 1. Create The Tables with
ITEM,BRANCH,CITY,PURCHASES,WORKS_AT,SOLD_QTY
STEP 2. MENTION ID=ITEMS_SOLD ITEM_ID,BRANCH_ID=WORKS_AT.
STEP 3. TRANS_ID=ITEMS_SOL,TRANS_ID
STEP 4.EXCECUTE THE QUERY IN SQL COMMAND GIVEN BELOW
3-dimensional cuboid:
SELECT
item.item_type, branch.branch_city, purchases.date1, sum(items_sold.qty)
FROM item, branch, purchases, items_sold, works_at
WHERE item.item_id=items_sold.item_id AND
branch.branch_id=works_at.branch_id AND
purchases.trans_id=items_sold.trans_id
GROUP BY item.item_type,branch.branch_city,purchases.date1;
Output:
TV CHENNAI 05-MAR-07 2
TV HYD 05-MAR-08 8
TV BNG 05-MAR-08 4
TV GUNTUR 08-APR-09 6
TV HYD 20-JUN-06 24
TV BNG 20-JUN-06 12
TV GUNTUR 20-JUN-06 12
13
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
HAP GUNTUR 20-JUN-06 4
TV CHENNAI 03-JAN-08 8
TV BNG 01-FEB-08 8
14
ALGORITHM:
STEP 1. SELECT ITEM,SOLD_QTY FROM ITEMSCREATED
STEP 2. SELECT PURCHASE_DATE,OLD_QTY FROM PURCHASES
STEP 3: SELECT PURCHASE_CUST_ID,FROM PURCHASES
STEP 4: SELECT SUM ITEMS_QTY FROM ITEMS_SOLD.
STEP 5: EXCECUTE THE QUERY IN SQL COMMAND GIVEN BELOW
1-dimensional
i) SELECT item.item_type ,sum(items_sold.qty) FROM item,items_sold
WHERE item.item_id=items_sold.item_id GROUP BY item.item_type;
Output:
ITEM_TYPE SUM(ITEMS_SOLD.QTY)
COMPUTER 65
TV 70
HAP 105
3 rows returned
03-JAN-06 6
05-MAR-09 18
08-APR-07 11
20-JUN-06 13
20-JUN-07 6
20-JUN-08 7
01-FEB-06 11
03-JAN-09 6
03-JAN-07 20
15
08-APR-09 13
08-APR-06 14
01-FEB-09 14
05-MAR-08 13
05-MAR-07 10
08-APR-08 11
20-JUN-09 6
03-JAN-08 19
01-FEB-08 17
05-MAR-06 13
20 rows returned
Output:
CUST_ID SUM(ITEMS_SOLD.QTY)
C4 49
C5 32
C2 54
C1 51
C3 54
5 rows returned
ApexCuboid
Output:
SUM(ITEMS_S
OLD. QTY)
16
ALGORITHM:
STEP1: SELECT ITEM_TUPE,PURCHASE_DATE,SOLD_QTY
STEP2: SELECT BRANCH_CITY,PURCHASE_DATE,SOLD_QTY
STEP3: EXCECUTE THE QUERY IN SQL COMMAND GIVEN BELOW
2-dimensional cuboid:
a)
SELECT item.item_type ,purchases.date1,sum(items_sold.qty) FROM
item,purchases,items_sold
WHERE item.item_id=items_sold.item_id AND
purchases.trans_id=items_sold.trans_id GROUP BY
item.item_type,purchases.date1;
Output:
ITEM_TYPE DATE1 SUM(ITEMS_SOLD.QTY)
COMPUTER 05-MAR-07 3
TV 05-MAR-07 1
HAP 03-JAN-07 8
COMPUTER 03-JAN-07 7
TV 08-APR-09 3
TV 20-JUN-06 6
COMPUTER 05-MAR-09 3
HAP 01-FEB-09 9
TV 01-FEB-09 3
HAP 20-JUN-08 1
HAP 01-FEB-06 5
TV 20-JUN-09 3
COMPUTER 20-JUN-06 5
HAP 01-FEB-08 8
HAP 05-MAR-09 6
COMPUTER 03-JAN-09 2
COMPUTER 20-JUN-08 2
17
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
TV 20-JUN-08 4
TV 01-FEB-07 2
TV 05-MAR-08 2
HAP 03-JAN-08 8
TV 03-JAN-08 4
COMPUTER 08-APR-06 2
60 rows returned
b)
SELECT branch.branch_city,purchases.date1,sum(items_sold.qty) FROM
branch,purchases,items_sold,works_at
WHERE branch.branch_id=works_at.branch_id AND
purchases.trans_id=items_sold.trans_id GROUP BY
branch.branch_city,purchases.date1;
Output:
GUNTUR 05-MAR-08 26
CHENNAI 08-APR-07 22
CHENNAI 08-APR-06 28
CHENNAI 08-APR-08 22
CHENNAI 03-JAN-06 12
HYD 01-FEB-07 48
BNG 08-APR-06 28
BNG 20-JUN-09 12
GUNTUR 20-JUN-08 14
GUNTUR 03-JAN-09 12
GUNTUR 20-JUN-07 12
GUNTUR 08-APR-06 28
GUNTUR 20-JUN-06 26
GUNTUR 03-JAN-07 40
GUNTUR 05-MAR-07 20
CHENNAI 08-APR-09 26
CHENNAI 05-MAR-08 26
18
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
CHENNAI 03-JAN-07 40
CHENNAI 05-MAR-07 20
CHENNAI 01-FEB-06 22
HYD 20-JUN-08 28
HYD 05-MAR-06 52
HYD 03-JAN-09 24
HYD 05-MAR-08 52
HYD 08-APR-08 44
HYD 03-JAN-06 24
BNG 20-JUN-08 14
BNG 08-APR-07 22
80 rows returned
21
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
insert into items_sold values('T117','I2',2);
insert into items_sold values('T117','I3',8);
insert into items_sold values('T118','I1',3);
insert into items_sold values('T118','I2',2);
insert into items_sold values('T118','I3',6);
insert into items_sold values('T119','I1',4);
insert into items_sold values('T119','I2',2);
insert into items_sold values('T119','I3',1);
22
Star Schemas
The star schema is the simplest data warehouse schema. It is called a star schema
because the entity-relationship diagram of this schema resembles a star, with points radiating
from a central table. The center of the star consists of a large fact table and the points of the
star are the dimension tables.
A star schema is characterized by one or more very large fact tables that contain the
primary information in the data warehouse, and a number of much smaller dimension tables
(or lookup tables), each of which contains information about the entries for a particular
attribute in the fact table.
A star query is a join between a fact table and a number of dimension tables. Each
dimension table is joined to the fact table using a primary key to foreign key join, but the
dimension tables are not joined to each other. The cost-based optimizer recognizes star
queries and generates efficient execution plans for them.
A typical fact table contains keys and measures. For example, in the sh sample
schema, the fact table, sales, contain the measures quantity_sold, amount, and cost, and
the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension tables are
customers, times, products, channels, and promotions. The product dimension table, for
example, contains information about each product number that appears in the fact table.
Implementation of k-means algorithm using ‘c’.
A star join is a primary key to foreign key join of the dimension tables to a fact table.
10. A schema is a collection of database objects, including tables, views, indexes, and
synonyms
12. The star schema is the simplest data warehouse schema. It is called a star schema
because the entity-relationship diagram of this schema resembles a star, with points
radiating from a central table. The center of the star consists of a large fact table and
the points of the star are the dimension tables.
13. A star schema is characterized by one or more very large fact tables that contain the
primary information in the data warehouse, and a number of much smaller dimension
tables (or lookup tables), each of which contains information about the entries for a
particular attribute in the fact table.
14. A star query is a join between a fact table and a number of dimension tables. Each
dimension table is joined to the fact table using a primary key to foreign key join, but
the dimension tables are not joined to each other. The cost-based optimizer recognizes
star queries and generates efficient execution plans for them.
15. A typical fact table contains keys and measures. For example, in the sh sample
schema, the fact table, sales, contain the measures quantity_sold, amount, and cost,
and the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension
tables are customers, times, products, channels, and promotions. The product
dimension table, for example, contains information about each product number that
appears in the fact table.
16. Implementation of k-means algorithm using ‘c’.
17. A star join is a primary key to foreign key join of the dimension tables to a fact table.
27. A schema is a collection of database objects, including tables, views, indexes, and
synonyms
24
29. The star schema is the simplest data warehouse schema. It is called a star schema
because the entity-relationship diagram of this schema resembles a star, with points
radiating from a central table. The center of the star consists of a large fact table and
the points of the star are the dimension tables.
30. A star schema is characterized by one or more very large fact tables that contain the
primary information in the data warehouse, and a number of much smaller dimension
tables (or lookup tables), each of which contains information about the entries for a
particular attribute in the fact table.
31.
32. A star query is a join between a fact table and a number of dimension tables. Each
dimension table is joined to the fact table using a primary key to foreign key join, but
the dimension tables are not joined to each other. The cost-based optimizer recognizes
star queries and generates efficient execution plans for them.
33. A typical fact table contains keys and measures. For example, in the sh sample
schema, the fact table, sales, contain the measures quantity_sold, amount, and cost,
and the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension
tables are customers, times, products, channels, and promotions. The product
dimension table, for example, contains information about each product number that
appears in the fact table.
34. Implementation of k-means algorithm using ‘c’.
35. A star join is a primary key to foreign key join of the dimension tables to a fact table.
43. The snowflake schema is a more complex data warehouse model than a star schema,
and is a type of star schema. It is called a snowflake schema because the diagram of
the schema resembles a snowflake.
25
45. For example, a product dimension table in a star schema might be normalized into a
products table, a product_category table, and a product_manufacturer table in a
snowflake schema. While this saves space, it increases the number of dimension tables
and requires more foreign key joins. The result is more complex queries and reduced
query performance. Figure presents a graphical representation of a snowflake schema.
46. Figure: Snowflake Schema
47. Note:
48. Oracle Corporation recommends you choose a star schema over a snowflake schema
unless you have a clear reason not to
49. Are widely supported by a large number of business intelligence tools, which may
anticipate or even require that the data-warehouse schema contain dimension tables
50. Star schemas are used for both simple data marts and very large data warehouses.
51. Figure: presents a graphical representation of a star schema.
53. The snowflake schema is a more complex data warehouse model than a star schema,
and is a type of star schema. It is called a snowflake schema because the diagram of
the schema resembles a snowflake.
26
55. For example, a product dimension table in a star schema might be normalized into a
products table, a product_category table, and a product_manufacturer table in a
snowflake schema. While this saves space, it increases the number of dimension tables
and requires more foreign key joins. The result is more complex queries and reduced
query performance. Figure presents a graphical representation of a snowflake schema.
56. Figure: Snowflake Schema
57. Note:
58. Oracle Corporation recommends you choose a star schema over a snowflake schema
unless you have a clear reason not to
59. which may anticipate or even require that the data-warehouse schema contain
dimension tables
60. Star schemas are used for both simple data marts and very large data warehouses.
Snowflake Schemas
The snowflake schema is a more complex data warehouse model than a star schema,
and is a type of star schema. It is called a snowflake schema because the diagram of the
schema resembles a snowflake.
27
For example, a product dimension table in a star schema might be normalized into a
products table, a product_category table, and a product_manufacturer table in a snowflake
schema. While this saves space, it increases the number of dimension tables and requires
more foreign key joins. The result is more complex queries and reduced query performance.
Figure presents a graphical representation of a snowflake schema.
28
Theory:
Fact Tables
A fact table typically has two types of columns: those that contain numeric facts
(often called measurements), and those that are foreign keys to dimension tables.
A fact table contains either detail-level facts or facts that have been aggregated.
Fact tables that contain aggregated facts are often called summary tables. A fact table
usually contains facts with the same level of aggregation.
Though most facts are additive, they can also be semi-additive or non-additive.
Additive facts can be aggregated by simple arithmetical addition. A common example of this
is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-
additive facts can be aggregated along some of the dimensions and not along others. An
example of this is inventory levels, where you cannot tell what a level means simply by
looking at it.
You must define a fact table for each star schema. From a modeling standpoint, the
primary key of the fact table is usually a composite key that is made up of all of its foreign
keys.
Figure is a common example of a sales fact table and dimension tables customers,
products, promotions, times, and channels
29
Procedure:
Step1:- Choose WEKA Icon then WEKA GUI chooser will appear then choose and load dataset into
WEKA.
Data set:
30
Visual output:
31
Procedure:
Step1:- Choose WEKA Icon then WEKA GUI chooser will appear then choose and load dataset into
WEKA
Data selection:
OUTPUT:
32
Procedure:
Step1:- Choose WEKA Icon then WEKA GUI chooser will appear then choose and load dataset into
WEKA.
Step2:-Choose classify
Step5:- Select choose, it displays many attributes. Select Tree amongst them
Step9:- The result list contains 11:37:42 trees,J48-right click and select visualize tree
Data view:
33
Visual Output:
34
5. What is dimension?
A dimension is something that qualifies a quantity (measure).
For an example, consider this: If I just say… “20kg”, it does not mean anything. But if I
say, "20kg of Rice (Product) is sold to Ramesh (customer) on 5th April (date)", then that
gives a meaningful sense. These product, customer and dates are some dimension that
qualified the measure - 20kg. Dimensions are mutually independent. Technically
speaking, a dimension is a data element that categorizes each item in a data set into non-
overlapping regions
.
6. What is Fact?
A fact is something that is quantifiable (Or measurable). Facts are typically (but not
always) numerical values that can be aggregated.
7. Briefly state different between data ware house & data mart?
Dataware house is made up of many datamarts. DWH contain many subject areas. but
data mart focuses on one subject area generally. e.g. If there will be DHW of bank then
there can be one data mart for accounts, one for Loans etc. This is high level definitions.
Metadata is data about data. e.g. if in data mart we are receving any file. then metadata
35
8. What is the difference between dependent data warehouse and independent data
warehouse?
There is a third type of Datamart called Hybrid. The Hybrid datamart having source data
from Operational systems or external files and central Datawarehouse as well. I will
definitely check for Dependent and Independent Datawarehouses and update.
9. What are the storage models of OLAP?
ROLAP, MOLAP and HOLAP
36
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA ENGINEERING LABORATORY
E.g. a data warehouse of a company stores all the relevant information of projects and
employees. Using Data mining, one can use this data to generate different reports like
profits generated etc.
37
38
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING