Chapter 3
Chapter 3
Chapter 3
An Overview
Chapter 3
3.7 Exercises
1. State why, for the integration of multiple heterogeneous information sources, many companies in industry
prefer theupdate-driven approach (which constructs and uses data warehouses), rather than thequery-driven
approach(which applies wrappers and integrators). Describe situations where the query-driven approach is
preferable over the update-driven approach.
Answer:
For decision-making queries and frequently-asked queries, the update-driven approach is more
preferable. This is because expensive data integration and aggregate computation are done before
query processing time. In order for the data collected in multiple heterogeneous databases to be used in
decision-making processes, data must be integrated and summarized with the semantic heterogeneity
problems among multiple databases analyzed and solved. If the query-driven approach is employed,
these queries will be translated into multiple (often complex) queries for each individual database. The
translated queries will compete for resources with the activities at the local sites, thus degrading their
performance. In addition, these queries will generate a complex answer set, which will require further
filtering and integration. Thus, the query-driven approach is, in general, inefficient and expensive. The
update-driven approach employed in data warehousing is faster and more efficient since most of the
queries needed could be done off-line.
For queries that are used rarely, reference the most current data, and/or do not require aggregations,
the query-driven approach would be preferable over the update-driven approach. In this case, it may not
be justifiable for an organization to pay heavy expenses for building and maintaining a data warehouse,
if only a small number and/or relatively small size databases are used; or if the queries rely on the
current data, since the data warehouses do not contain the most current information.
2. Briefly compare the following concepts. You may use an example to explain your point(s).
The snowflake schema and fact constellation are both variants of thestar schema model, which consists of
a fact table and a set of dimension tables; thesnowflake schema contains some normalized dimension
tables, whereas thefact constellation contains a set of fact tables that share some common dimension
tables. Astarnet query model is a query model (not a schema model), which consists of a set of radial lines
emanating from a central point, where each radial line represents one dimension and each point (called
a “footprint”) along the line represents a level of the dimension, and each step going out from the center
represents the stepping down of a concept hierarchy of the dimension. The starnet query model, as
suggested by its name, is used for querying and provides users with a global view of OLAP operations.
(b) Data cleaning, data transformation, refresh
Data cleaningis the process of detecting errors in the data and rectifying them when possible. Data
transformationis the process of converting the data from heterogeneous sources to a unified data
warehouse format or semantics.Refresh is the function propagating the updates from the data sources
to the warehouse.
(c) Discovery-driven cube, multi-feature cube, virtual warehouse
Adiscovery-driven cube uses precomputed measures and visual cues to indicate data exceptions at all
levels of aggregation, guiding the user in the data analysis process. Amulti-feature cube computes
complex queries involving multiple dependent aggregates at multiple granularities (e.g., to find the total
sales for every item having a maximum price, we need to apply the aggregate function SUM to the tuple
set output by the aggregate functionMAX). Avirtual warehouse is a set of views (containing data
warehouse schema, dimension, and aggregate measure definitions) over operational databases.
3. Suppose that a data warehouse consists of the three dimensionstime, doctor, andpatient, and the two
measurescount andcharge, wherecharge is the fee that a doctor charges a patient for a visit.
(a) Enumerate three classes of schemas that are popularly used for modeling data warehouses.
(b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a).
(c) Starting with the base cuboid [day, doctor, patient], what specificOLAP operations should be performed
in order to list the total fee collected by each doctor in 2004?
(d) To obtain the same list, write an SQL query assuming the data is stored in a relational database with
the schemafee (day, month, year, doctor, hospital, patient, count, charge).
Answer:
(a) Enumerate three classes of schemas that are popularly used for modeling data warehouses.
Three classes of schemas popularly used for modeling data warehouses are the star schema, the snowflake
schema, and the fact constellations schema.
(b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a).
A star schema is shown in Figure 3.1.
(c) Starting with the base cuboid [day, doctor, patient], what specificOLAP operations should be performed
in order to list the total fee collected by each doctor in 2004?
The operations to be performed are:
• Roll-up ontime fromday toyear.
• Slice fortime=2004.
• Roll-up onpatient from individual patient toall.
(d) To obtain the same list, write an SQL query assuming the data is stored in a relational database with
the schema.
f ee(day, month, year, doctor, hospital, patient, count, charge).
selectdoctor, SUM(charge)
fromfee
whereyear=2004
group bydoctor
4. Suppose that a data warehouse forBig-University consists of the following four dimensions:student, course,
semester, and instructor, and two measures countand avg grade. When at the lowest conceptual level (e.g.,
for a given student, course, semester, and instructor combination), theavg grade measure stores the
actual course grade of the student. At higher conceptual levels,avg grade stores the average grade for
the given combination.
(a) Draw asnowflake schema diagram for the data warehouse.
(b) Starting with the base cuboid [student, course, semester, instructor], what specificOLAP operations (e.g., roll-up
fromsemester toyear ) should one perform in order to list the average grade ofCS courses for eachBig-
University student.
(c) If each dimension has five levels (includingall), such as “student< major< status< university< all”,
how many cuboids will this cube contain (including the base and apex cuboids)?
Answer:
(a) Draw asnowflake schema diagram for the data warehouse.
A snowflake schema is shown in Figure 3.2.
(b) Starting with thebase cuboid [student, course, semester, instructor], what specificOLAP operations (e.g., roll-up
fromsemester toyear) should one perform in order to list the average grade ofCS courses for eachBig-
University student.
The specific OLAP operations to be performed are:
• Roll-up on course fromcourse id todepartment.
• Roll-up on student fromstudent id touniversity.
• Dice on course, student withdepartment=“CS” anduniversity = “Big-University”.