DWH Int Questions
DWH Int Questions
DWH Int Questions
Explanatory Note
Non-volatile means that the data once loaded in the warehouse will not get deleted later. Time-variant means the
data will change with respect to time.
Explanatory Note:
In a departmental shop, when we pay the prices at the check-out counter, the sales person at the counter keys-in all
the data into a "Point-Of-Sales" machine. That data is transaction data and the related system is a OLTP system. On
the other hand, the manager of the store might want to view a report on out-of-stock materials, so that he can place
purchase order for them. Such report will come out from OLAP system
What is ER model?
ER model is entity-relationship model which is designed with a goal of normalizing the data.
What is dimensional modeling?
Dimensional model consists of dimension and fact tables. Fact tables store different transactional measurements and
the foreign keys from dimension tables that qualifies the data. The goal of Dimensional model is not to achive high
degree of normalization but to facilitate easy and faster data retrieval.
What is dimension?
A dimension is something that qualifies a quantity (measure).
If I just say… “20kg”, it does not mean anything. But 20kg of Rice (Product) is sold to Ramesh (customer) on 5th April
(date), gives a meaningful sense. These product, customer and dates are some dimension that qualified the
measure. Dimensions are mutually independent.
Technically speaking, a dimension is a data element that categorizes each item in a data set into non-overlapping
regions.
What is fact?
A fact is something that is quantifiable (Or measurable). Facts are typically (but not always) numerical values that can
be aggregated.
Semi-additive measures are those where only a subset of aggregation function can be applied. Let’s say account
balance. A sum() function on balance does not give a useful result but max() or min() balance might be useful.
Consider price rate or currency rate. Sum is meaningless on rate; however, average function might be useful.
Additive measures can be used with any aggregation function like Sum (), Avg () etc. Example is Sales Quantity etc.
What is Star-schema?
This schema is used in data warehouse models where one centralized fact table references number of dimension
tables so as the keys (primary key) from all the dimension tables flow into the fact table (as foreign key) where
measures are stored. This entity-relationship diagram looks like a star, hence the name.
Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales quantity will be
the measure here and keys from customer, product and time dimension tables will flow into the fact table.
Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales quantity will be
the measure here and keys from customer, product and time dimension tables will flow into the fact table. Additionally
all the products can be further grouped under different product families stored in a different table so that primary key
of product family tables also goes into the product table as a foreign key. Such construct will be called a snow-flake
schema as product table is further snow-flaked into product family.
Note
Snow-flake increases degree of normalization in the design.
Based on how frequently the data inside a dimension changes, we can further classify dimension as
1. Unchanging or static dimension (UCD)
2. Slowly changing dimension (SCD)
3. Rapidly changing Dimension (RCD)
Theoretically, two dimensions which are either identical or strict mathematical subsets of one another are said to be
conformed.
A dimension key, such as transaction number, receipt number, Invoice number etc. does not have any more
associated attributes and hence can not be designed as a dimension table.
These junk dimension attributes might not be related. The only purpose of this table is to store all the combinations of
the dimensional attributes which you could not fit into the different dimension tables otherwise. One may want to read
an interesting document, De-clutter with Junk (Dimension)
What is SCD?
SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly changing. These can be of many
types, e.g. Type 0, Type 1, Type 2, Type 3 and Type 6, although Type 1, 2 and 3 are most common.
Type 1:
A type 1 dimension is where history is not maintained and the table always shows the recent data. This effectively
means that such dimension table is always updated with recent data whenever there is a change, and because of this
update, we lose the previous values.
Type 2:
A type 2 dimension table tracks the historical changes by creating separate rows in the table with different surrogate
keys. Consider there is a customer C1 under group G1 first and later on the customer is changed to group G2. Then
there will be two separate records in dimension table like below,
Note that separate surrogate keys are generated for the two records. NULL end date in the second row denotes that
the record is the current record. Also note that, instead of start and end dates, one could also keep version number
column (1, 2 … etc.) to denote different versions of the record.
Type 3:
A type 3 dimension stored the history in a separate column instead of separate rows. So unlike a type 2 dimension
which is vertically growing, a type 3 dimension is horizontally growing. See the example below,
This is only good when you need not store many consecutive histories and when date of change is not required to be
stored.
Type 6:
A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but only you add one extra
column to denote which record is the current record.
Key Customer Group Start Date End Date Current Flag
1 C1 G1 1st Jan 2000 31st Dec 2005 N
2 C1 G2 1st Jan 2006 NULL Y
What is a fact-less-fact?
A fact table that does not contain any measure is called a fact-less fact. This table will only contain keys from different
dimension tables. This is often used to resolve a many-to-many cardinality issue.
Explanatory Note:
Consider a school, where a single student may be taught by many teachers and a single teacher may have many
students. To model this situation in dimensional model, one might introduce a fact-less-fact table joining teacher and
student keys. Such a fact table will then be able to answer queries like,
1. Who are the students taught by a specific teacher.
2. Which teacher teaches maximum students.
3. Which student has highest number of teachers.etc. etc.
Why not? Because fact-less fact table only stores the positive scenarios (like student being taught by a tutor) but if
there is a student who is not being taught by a teacher, then that student's key does not appear in this table, thereby
reducing the coverage of the table.
Coverage fact table attempts to answer this - often by adding an extra flag column. Flag = 0 indicates a negative
condition and flag = 1 indicates a positive condition. To understand this better, let's consider a class where there are
100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 = 500 records (all combinations) and if
a certain teacher is not teaching a certain student, the corresponding flag for that record will be 0.
To understand this, let's consider an example from retail business. A certain retail chain has 500 shops accross
Europe. All the shops record detail level transactions regarding the products they sale and those data are captured in
a data warehouse.
Each shop manager can access the data warehouse and they can see which products are sold by whom and in what
quantity on any given date. Thus the data warehouse helps the shop managers with the detail level data that can be
used for inventory management, trend prediction etc.
Now think about the CEO of that retail chain. He does not really care about which certain sales girl in London sold the
highest number of chopsticks or which shop is the best seller of 'brown breads'. All he is interested is, perhaps to
check the percentage increase of his revenue margin accross Europe. Or may be year to year sales growth on
eastern Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is derived by summing
up the individual sales data from each shop in East Europe.
Therefore, to support different levels of data warehouse users, data aggregation is needed.
What is slicing-dicing?
Slicing means showing the slice of a data, given a certain set of dimension (e.g. Product) and value (e.g. Brown
Bread) and measures (e.g. sales).
Dicing means viewing the slice with respect to different dimensions and in different level of aggregations.
What is drill-through?
Drill through is the process of going to the detail level data from summary data.
Consider the above example on retail shops. If the CEO finds out that sales in East Europe has declined this year
compared to last year, he then might want to know the root cause of the decrease. For this, he may start drilling
through his report to more detail level and eventually find out that even though individual shop sales has actually
increased, the overall sales figure has decreased because a certain shop in Turkey has stopped operating the
business. The detail level of data, which CEO was not much interested on earlier, has this time helped him to pin
point the root cause of declined sales. And the method he has followed to obtain the details from the aggregated data
is called drill through.
1)what is junk dimension?