Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chapter 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Data Warehouse and OLAP Technology:

An Overview
Chapter 3
3.7 Exercises
1. State why, for the integration of multiple heterogeneous information sources, many companies in industry
prefer theupdate-driven approach (which constructs and uses data warehouses), rather than thequery-driven
approach(which applies wrappers and integrators). Describe situations where the query-driven approach is
preferable over the update-driven approach.
Answer:
For decision-making queries and frequently-asked queries, the update-driven approach is more
preferable. This is because expensive data integration and aggregate computation are done before
query processing time. In order for the data collected in multiple heterogeneous databases to be used in
decision-making processes, data must be integrated and summarized with the semantic heterogeneity
problems among multiple databases analyzed and solved. If the query-driven approach is employed,
these queries will be translated into multiple (often complex) queries for each individual database. The
translated queries will compete for resources with the activities at the local sites, thus degrading their
performance. In addition, these queries will generate a complex answer set, which will require further
filtering and integration. Thus, the query-driven approach is, in general, inefficient and expensive. The
update-driven approach employed in data warehousing is faster and more efficient since most of the
queries needed could be done off-line.
For queries that are used rarely, reference the most current data, and/or do not require aggregations,
the query-driven approach would be preferable over the update-driven approach. In this case, it may not
be justifiable for an organization to pay heavy expenses for building and maintaining a data warehouse,
if only a small number and/or relatively small size databases are used; or if the queries rely on the
current data, since the data warehouses do not contain the most current information.

2. Briefly compare the following concepts. You may use an example to explain your point(s).

a) Snowflake schema, fact constellation, starnet query model

(b) Data cleaning, data transformation, refresh

(c) Discovery-driven cube, multifeature cube, virtual warehouse


Answer:
(a) Snowflake schema, fact constellation, starnet query model

The snowflake schema and fact constellation are both variants of thestar schema model, which consists of
a fact table and a set of dimension tables; thesnowflake schema contains some normalized dimension
tables, whereas thefact constellation contains a set of fact tables that share some common dimension
tables. Astarnet query model is a query model (not a schema model), which consists of a set of radial lines
emanating from a central point, where each radial line represents one dimension and each point (called
a “footprint”) along the line represents a level of the dimension, and each step going out from the center
represents the stepping down of a concept hierarchy of the dimension. The starnet query model, as
suggested by its name, is used for querying and provides users with a global view of OLAP operations.
(b) Data cleaning, data transformation, refresh
Data cleaningis the process of detecting errors in the data and rectifying them when possible. Data
transformationis the process of converting the data from heterogeneous sources to a unified data
warehouse format or semantics.Refresh is the function propagating the updates from the data sources
to the warehouse.
(c) Discovery-driven cube, multi-feature cube, virtual warehouse
Adiscovery-driven cube uses precomputed measures and visual cues to indicate data exceptions at all
levels of aggregation, guiding the user in the data analysis process. Amulti-feature cube computes
complex queries involving multiple dependent aggregates at multiple granularities (e.g., to find the total
sales for every item having a maximum price, we need to apply the aggregate function SUM to the tuple
set output by the aggregate functionMAX). Avirtual warehouse is a set of views (containing data
warehouse schema, dimension, and aggregate measure definitions) over operational databases.
3. Suppose that a data warehouse consists of the three dimensionstime, doctor, andpatient, and the two
measurescount andcharge, wherecharge is the fee that a doctor charges a patient for a visit.
(a) Enumerate three classes of schemas that are popularly used for modeling data warehouses.
(b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a).
(c) Starting with the base cuboid [day, doctor, patient], what specificOLAP operations should be performed
in order to list the total fee collected by each doctor in 2004?
(d) To obtain the same list, write an SQL query assuming the data is stored in a relational database with

the schemafee (day, month, year, doctor, hospital, patient, count, charge).
Answer:
(a) Enumerate three classes of schemas that are popularly used for modeling data warehouses.
Three classes of schemas popularly used for modeling data warehouses are the star schema, the snowflake
schema, and the fact constellations schema.
(b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a).
A star schema is shown in Figure 3.1.
(c) Starting with the base cuboid [day, doctor, patient], what specificOLAP operations should be performed
in order to list the total fee collected by each doctor in 2004?
The operations to be performed are:
• Roll-up ontime fromday toyear.
• Slice fortime=2004.
• Roll-up onpatient from individual patient toall.
(d) To obtain the same list, write an SQL query assuming the data is stored in a relational database with
the schema.
f ee(day, month, year, doctor, hospital, patient, count, charge).

selectdoctor, SUM(charge)
fromfee
whereyear=2004
group bydoctor
4. Suppose that a data warehouse forBig-University consists of the following four dimensions:student, course,
semester, and instructor, and two measures countand avg grade. When at the lowest conceptual level (e.g.,
for a given student, course, semester, and instructor combination), theavg grade measure stores the
actual course grade of the student. At higher conceptual levels,avg grade stores the average grade for
the given combination.
(a) Draw asnowflake schema diagram for the data warehouse.
(b) Starting with the base cuboid [student, course, semester, instructor], what specificOLAP operations (e.g., roll-up
fromsemester toyear ) should one perform in order to list the average grade ofCS courses for eachBig-
University student.
(c) If each dimension has five levels (includingall), such as “student< major< status< university< all”,
how many cuboids will this cube contain (including the base and apex cuboids)?
Answer:
(a) Draw asnowflake schema diagram for the data warehouse.
A snowflake schema is shown in Figure 3.2.
(b) Starting with thebase cuboid [student, course, semester, instructor], what specificOLAP operations (e.g., roll-up
fromsemester toyear) should one perform in order to list the average grade ofCS courses for eachBig-
University student.
The specific OLAP operations to be performed are:
• Roll-up on course fromcourse id todepartment.
• Roll-up on student fromstudent id touniversity.
• Dice on course, student withdepartment=“CS” anduniversity = “Big-University”.

• Drill-down onstudent fromuniversity tostudent name.


(c) If each dimension has five levels (includingall), such asstudent< major< status< university< all,
how many cuboids will this cube contain (including the base and apex cuboids)?
This cube will contain 54 = 625 cuboids.
5. Suppose that a data warehouse consists of the four dimensions,date, spectator, location, andgame, and the
two measures,count andcharge, wherecharge is the fare that a spectator pays when watching a game on a
given date. Spectators may be students, adults, or seniors, with each category having its own charge
rate.
(a) Draw astar schema diagram for the data warehouse.
(b) Starting with the base cuboid [date, spectator, location, game], what specificOLAP operations should
one perform in order to list the total charge paid by student spectators at GM Place in 2004?
(c)Bitmap indexing is useful in data warehousing. Taking this cube as an example, briefly discuss advan-
tages and problems of using a bitmap index structure.
Answer:
(a) Draw astar schema diagram for the data warehouse.
A star schema is shown in Figure 3.3.
(b) Starting with the base cuboid [date, spectator, location, game], what specificOLAP operations should
one perform in order to list the total charge paid by student spectators at GM Place in 2004?
The specific OLAP operations to be performed are:
• Roll-up ondate fromdate id toyear.
• Roll-up ongame fromgame id toall.
• Roll-up onlocation fromlocation id tolocation name.
• Roll-up onspectator fromspectator id tostatus.
• Dice withstatus=“students”,location name=“GM Place”, andyear=2004
(c)Bitmap indexing is useful in data warehousing. Taking this cube as an example, briefly discuss advan-
tages and problems of using a bitmap index structure.
Bitmap indexing is advantageous for low-cardinality domains. For example, in this cube, if dimension
locationis bitmap indexed, comparison, join, and aggregation operations over locationare then reduced
to bit arithmetic, which substantially reduces the processing time. Furthermore, strings of longlocation
names can be represented by a single bit, which leads to significant reduction in space and I/O. For
dimensions with high-cardinality, such asdate in this example, the vector to present the bitmap index
could be very long. For example, a 10-year collection of data could result in 3650 date records, meaning
that every tuple in the fact table would require 3650 bits, or approximately 456 bytes, to hold the bitmap
index.
6. [Tao Cheng] A data warehouse can be modeled by either astar schema or asnowflake schema. Briefly
describe the similarities and the differences of the two models, and then analyze their advantages and
disadvantages with regard to one another. Give your opinion of which might be more empirically useful
and state the reasons behind your answer.
Answer:
They are similar in the sense that they all have a fact table, as well as some dimensional tables. The
major difference is that some dimension tables in the snowflake schema are normalized, thereby further
splitting the data into additional tables. The advantage for star schema is its simplicity, which will enable
efficiency, but it requires more space. For snowflake schema, it reduce some redundancy by sharing
common tables: The tables are easy to maintain and save some space. However, it is less efficient, and
the saving of space is negligible in comparison with the typical magnitude of the fact table. Therefore,
empirically, star schema is better simply because nowadays, efficiency has higher priority over space, if
it is not too huge. Sometimes in industry, to speed up processing, people “denormalize data from a
snowflake schema into a star schema” [1]. Another option here is that “some practitioners use a
snowflake schema to maintain dimensions, and then present users with the same data collapsed into a
star” [2].
Some references for the answer to this question: Oracle Tip: Understand the difference between star and snowflake
schemas in OLAP.

[1] http://builder.com.com/5100-6388-5221690.html Star vs. Snowflake Schemas.

You might also like