Data Warehousing Concepts 2
Data Warehousing Concepts 2
Data Warehousing Concepts 2
Yanamala
What is Metadata?
Information about domain structure of data warehouse
10. Why do you use dimensional Modelling instead of ER Modelling for data
warehousing applications ?
1) Erwin - Is it possible to reverse engineer to diff schemes into single data model
2) Suppose there is a star schema where a fact table has 3 dimesnsion tables
and this system is in product. Is it possible to add the more dimension table to
the fact table . What is the impact in all the stages.
The technique s not quite recommendable if you are going to use OLAP tools for
your front end due to speed issues.
snowflaking allows for easy update and load of data as redundancy of data is
avoided to some extent, but browsing capabilites are greatly compromised. But
sometimes it may become a necessary evil.
To add a little to this, snowflaking often becomes necessary when you need data
for which there is a one-to-many relationship with a dimension table. To try to
consolidate this data into the dimension table would necessarily lead to
redundancy (this is a violation of
second normal form, which will produce a Cartesian product). This sort of
redundancy can cause misleading results in queries, since the count of rows is
artificially large (due to the Cartesian product). A simple example of such a
situation might be a "customer" dimension for which there is a need to store
multiple contacts. If the contact information is brought in to the customer table,
there would be one row for each contact (i.e., one for each customer/contact
combination). In this situation, it is better just to create a "contact" snowflake
table with a FK to the customer. In general, it is better to avoid snowflaking if
possible, but sometimes the consequences of avoiding it are much worse.
In star schema, all your dimensions will be linked directly with your fact table. On
the other hand in Snowflake schema, dimensions maybe interlinked or may have
one to many relationship with other tables. As previous mails said this isn't a
desirable situation but you can make best choice once you have gathered all the
requirements.
The snowflake is a design like a star but with a connect tables in the dimensions
tables is a relation between 2 dimensions.
3. Q: Which is better, Star or Snowflake?
A: Strict data warehousing rules would have you use a Star schema but in reality
most designs tend to become Snowflakes. They each have their pros and cons
but both are far better then trying to use a transactional system third-normal form
design.
A: This is one of the absolute worst things you can do. A lot of people initially go
down this road because a tool vendor will support the idea when making their
sales pitch. Many of these attempts will even experience success for a short
period of time. It’s not until your data sets grow and your business questions
begin to be complex that this design mistake will really come out to bite you.
The star schema and OLAP cube are intimately related. Star schemas are most
appropriate for very large data sets. OLAP cubes are most appropriate for
smaller data sets where analytic tools can perform complex data comparisons
and calculations. In almost all OLAP cube environments, it’s recommended that
you originally source data into a star schema structure, and then use wizards to
transform the data into the OLAP cube.
Dimensional modeling divides the world of data into two major types:
Measurements and Descriptions of the context surrounding those
measurements. The measurements, which are typically numeric, are stored in
fact tables, and the descriptions of the context, which are typically textual, are
stored in the dimension tables.
A fact table in a pure star schema consists of multiple foreign keys, each paired
with a primary key in a dimension, together with the facts containing the
measurements.
Every foreign key in the fact table has a match to a unique primary key in the
respective dimension (referential integrity). This allows the dimension table to
possess primary keys that aren’t found in the fact table. Therefore, a product
dimension table might be paired with a sales fact table in which some of the
products are never sold.
Dimensional models are full-fledged relational models, where the fact table is in
third normal form and the dimension tables are in second normal form.
The main difference between second and third normal form is that repeated
entries are removed from a second normal form table and placed in their own
“snowflake”. Thus the act of removing the context from a fact record and creating
dimension tables places the fact table in third normal form.
The fact tables are mostly very huge and almost never fetch a single record into
our answer set. We fetch a very large number of records on which we then do,
adding, counting, averaging, or taking the min or max. The most common of
them is adding. Applications are simpler if they store facts in an additive format
as often as possible. Thus, in the grocery example, we don’t need to store the
unit price. We compute the unit price by dividing the dollar sales by the unit sales
whenever necessary.
Some facts, like bank balances and inventory levels, represent intensities that
are awkward to express in an additive format. We can treat these semi additive
facts as if they were additive – but just before presenting the results to the end
user; divide the answer by the number of time periods to get the right result. This
technique is called averaging over time.
When the enterprise decides to create a set of common labels across all the
sources of data, the separate data mart teams (or, single centralized team) must
sit down to create master dimensions that everyone will use for every data
source. These master dimensions are called Conformed Dimensions.
Two dimensions are conformed if the fields that you use as row headers have the
same domain.
Drilling Across adds more data to an existing row. If drilling down is requesting
ever finer and granular data from the same fact table, then drilling across is the
process fo linking two or more fact tables at the same granularity, or, in other
words, tables with the same set of grouping columns and dimensional
constraints.
A drill across report can be created by using grouping columns that apply to all
the fact tables used in the report.
The new fact table called for in the drill-across operation must share certain
dimensions with the fact table in the original query. All fact tables in a drill-across
query must use conformed dimensions.
If drilling down is adding grouping columns from the dimension tables, then
drilling up is subtracting grouping columns.
The final variant of drilling is drilling around a value circle. This is similar to the
linear value chain that I showed in the previous example, but occurs in a data
warehouse where the related fact tables that share common dimensions are not
arranged i n a linear order. The best example is from health care, where as many
as 10 separate entities are processing patient encounters, and are sharing this
information with one another.
E.g. a typical health care value circle with 10 separate entities surrounding the
patient.
When the common dimensions are conformed and the requested grouping
columns are drawn from dimensions that tie to all the fact tables in a given report,
you can generate really powerful drill around reports by performing separate
queries on each fa ct table and outer joining the answer sets in the client tool.
Time_key
Day_of_week
Day_number_in_month
Day_number_overall
Month
Month_number_overall
Quarter
Fiscal_period
Season
Holiday_flag
Weekday_flag
Last_day_in_month_flag
The tiem stamp in a fact table should be a surrogate key instead of a real date
because:
Q. Why have more than one fact table instead of a single fact table?
We cannot combine all of the business processes into a single fact table
because:
the separate fact tables in the value chain do not share all the
dimensions. You simply can’t put the customer ship to dimension on
the finished goods inventory data
each fact table possesses different facts, and the fact table records
are recorded at different tiems along the alue chain
Q. What is mean by Slowly Changing Dimensions and what are the different
types of SCD’s? (Mascot)
The 3 fundamental choices for handling the slowly changing dimension are:
A Type 3 SCD adds a new field in the dimension record but does not create a
new record. We might change the designation of the customer’s sales territory
because we redraw the sales territory map, or we arbitrarily change the category
of the product from confectionary to candy. In both cases, we augment the
original dimension attribute with an “old” attribute so we can switch between
these alternate realities.
Overwriting
Creating another dimension record
Creating a current value filed
It is useful because the natural primary key (i.e. Customer Number in Customer
table) can change and this makes updates more difficult.
Another benefit you can get from surrogate keys (SID) is in Tracking the SCD -
Slowly Changing Dimension.
A classical example:
On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's
what would be in your Employee Dimension). This employee has a turnover
allocated to him on the Business Unit 'BU1' But on the 2nd of June the Employee
'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' All the new
turnover has to belong to the new Business Unit 'BU2' but the old one should
Belong to the Business Unit 'BU1.'
If you used the natural business key 'E1' for your employee within your data
warehouse everything would be allocated to Business Unit 'BU2' even what
actually belongs to 'BU1.'
If you use surrogate keys, you could create on the 2nd of June a new record for
the Employee 'E1' in your Employee Dimension with a new surrogate key.
This way, in your fact table, you have your old data (before 2nd of June) with the
SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the
SID of the employee 'E1' + 'BU2.'
Every join between dimension tables and fact tables in a data warehouse
environment should be based on surrogate key, not natural keys.
Production may reuse keys that it has purged but that you are still
maintaining
Production might legitimately overwrite some part of a product
description or a customer description with new values but not change
the product key or the customer key to a new value. We might be
wondering what to do about the revised attribute values (slowly
changing dimension crisis)
Production may generalize its key format to handle some new situation
in the transaction system. E.g. changing the production keys from
integers to alphanumeric or may have 12-byte keys you are used to
have become 20-byte keys
Acquisition of companies
We can save substantial storage space with integer valued surrogate keys
Eliminate administrative surprises coming from production
Potentially adapt to big surprises like a merger or an acquisition
Have a flexible mechanism for handling slowly changing dimensions
Fact tables which do not have any facts are called factless fact tables. They may
consist of nothing but keys.
There are two kinds of fact tables that do not have any facts at all.
The first type of factless fact table is a table that records an event. Many event-
tracking tables in dimensional data warehouses turn out to be factless.
E.g. A student tracking system that detects each student attendance event each
day.
The second type of factless fact table is called a coverage table. Coverage tables
are frequently needed when a primary fact table in a dimensional data
warehouse is sparse.
E.g. A sales fact table that records the sales of products in stores on particular
days under each promotion condition. The sales fact table does answer many
interesting questions but cannot answer questions about things that did not
happen. For instance, it cannot answer the question, “which products were in
promotion that did not sell?” because it contains only the records of products that
did sell. In this case the coverage table comes to the rescue. A record is placed
in the coverage table for each product in each store that is on promotion in each
time period.
A causal dimension is a kind of advisory dimension that should not change the
fundamental grain of a fact table.
E.g. why the customer bought the product? It can be due to promotion, sales etc.
Q What is Slicing and Dicing ? How we can do in Impromptu (We cannot do)? It
is done only in Powerplay.
GENERAL
Approximately 900GB.
Q. What is the daily data volume (in GB/records)? Or What is the size of the data
extracted in the extraction process? (Polaris)
Q. How many dimension tables did you had in your project and name some
dimensions (columns)? (Mascot)
Q. How many Facts & Dimension Tables are there in your Project? (Mascot)
Data warehouses can have many different types of life cycles with independent
data marts. The following is an example of a data warehouse life cycle.
In the life cycle of this example, four important steps are involved.
Q. What are the different Reporting and ETL tools available in the market?
The term data warehousing is often used to describe the process of creating,
managing and using a data warehouse.
A data mart is a selected part of the data warehouse which supports specific
decision support application requirements of a company’s department or
geographical region. It usually contains simple replicates of warehouse partitions
or data that has been further summarized or derived from base warehouse data.
Instead of running ad hoc queries against a huge data warehouse, data marts
allow the efficient execution of predicted queries over a significantly smaller
database.
A data warehouse is for very large databases (VLDBs) and a data mart is for
smaller databases. The difference lies in the scope of the things with which they
deal.
A data mart is an implementation of a data warehouse with a small and more
tightly restricted scope of data and data warehouse functions. A data mart serves
a single department or part of an organization. In other words, the scope of a
data mart is smaller than the data warehouse. It is a data warehouse for a
smaller group of end users.
Q. What is the aim/objective of having a data warehouse? And who needs a data
warehouse? Or what is the use of Data Warehousing? (Polaris)
Data warehousing technology comprises a set of new concepts and tools which
support the executives, managers and analysts with information material for
decision making.
The fundamental reason for building a data warehouse is to improve the quality
of information in the organization.
The main goal of data warehouse is to report and present the information in a
very user friendly form.
A data warehouse system (DWS) comprises the data warehouse and all
components used for building, accessing and maintaining the DWH (illustrated in
Figure 1). The center of a data warehouse system is the data warehouse itself.
The data import and preparation component is responsible for data acquisition. It
includes all programs, applications and legacy systems interfaces that are
responsible for extracting data from operational sources, preparing and loading it
into the warehouse. The access component includes all different applications
(OLAP or data mining applications) that make use of the information stored in the
warehouse.
After the initial load (the first load of the DWH according to the DWH
configuration), during the DWS operation phase, warehouse data must be
regularly refreshed, i.e., modifications of operational data since the last DWH
refreshment must be propagated into the warehouse such that data stored in the
DWH reflect the state of the underlying operational systems. Besides DWH
refreshment, DWS operation includes further tasks like archiving and purging of
DWH data or DWH monitoring.
The need to access historical data (i.e., histories of warehouse data over a
prolonged period of time) is one of the primary incentives for adopting the data
warehouse approach. Historical data are necessary for business trend analysis
which can be expressed in terms of understanding the differences between
several views of the real-time data (e.g., profitability at the end of each month).
Maintaining historical data means that periodical snapshots of the corresponding
operational data are propagated and stored in the warehouse without overriding
previous warehouse states. However, the potential volume of historical data and
the associated storage costs must always be considered in relation to their
potential business benefits.
Finally, a data warehouse contains usually additional data, not explicitly stored in
the operational sources, but derived through some process from operational data
(called also derived data). For example, operational sales data could be stored in
several aggregation levels (weekly, monthly, quarterly sales) in the warehouse.
Q. How often should data be loaded into a data warehouse from transaction
processing and other source systems?
It all depends on the needs of the users, how fast data changes and the volume
of information that is to be loaded into the data warehouse. It is common to
schedule daily, weekly or monthly dumps from operational data stores during
periods of low activity (for example, at night or on weekends). The longer the gap
between loads, the longer the processing times for the load when it does run. A
technical IS/IT staffer should make some calculations and consult with potential
users to develop a schedule to load new data.
Some of the potential benefits of putting data into a data warehouse include:
1. Improving turnaround time for data access and reporting;
2. Standardizing data across the organization so there will be one view of
the "truth";
3. Merging data from various source systems to create a more
comprehensive information source;
4. Lowering costs to create and distribute information and reports;
5. Sharing data and allowing others to access and analyze the data;
6. Encouraging and improving fact-based decision making.
The major limitations associated with data warehousing are related to user
expectations, lack of data and poor data quality. Building a data warehouse
creates some unrealistic expectations that need to be managed. A data
warehouse doesn't meet all decision support needs. If needed data is not
currently collected, transaction systems need to be altered to collect the data. If
data quality is a problem, the problem should be corrected in the source system
before the data warehouse is built. Software can provide only limited support for
cleaning and transforming data. Missing and inaccurate data can not be "fixed"
using software. Historical data can be collected manually, coded and "fixed", but
at some point source systems need to provide quality data that can be loaded
into the data warehouse without manual clerical intervention.
Build one! The easiest way to get started with data warehousing is to analyze
some existing transaction processing systems and see what type of historical
trends and comparisons might be interesting to examine to support decision
making. See if there is a "real" user need for integrating the data. If there is, then
IS/IT staff can develop a data model for a new schema and load it with some
current data and start creating a decision support data store using a database
management system (DBMS). Find some software for query and reporting and
build a decision support interface that's easy to use. Although the initial data
warehouse/data-driven DSS may seem to meet only limited needs, it is a "first
step". Start small and build more sophisticated systems based upon experience
and successes.
Q. Why should the OLTP database different from data warehouse database?
Data warehouse usually contains historical data that is derived from transaction
data, but it can include data from other sources. Having separate databases will
separate analysis workload from transaction workload and enables an
organization to consolidate data from several sources.
Q. What are the data modeling tools you have used? (Polaris)
Q. What is a Physical data model?
During the physical design process, you convert the data gathered during the
logical design phase into a description of the physical database, including tables
and constraints.
A logical design is a conceptual and abstract design. We do not deal with the
physical implementation details yet; we deal only with defining the types of
information that we need.
The process of logical design involves arranging data into a series of logical
relationships called entities and attributes.
Entity-Relationship.
Q. How do you extract data from different data sources explain with an example?
(Polaris)
Q. What are the reporting tools you have used? What is the difference between
them? (Polaris)
An example would be to break down the Time dimension and create tables for
each level; years, quarters, months; weeks, days… These additional branches
on the ERD create ore of a Snowflake shape then Star.
Data mining can be defined as "a decision support process in which we search
for patterns of information in data." This search may be done just by the user, i.e.
just by performing queries, in which case it is quite hard and in most of the cases
not comprehensive enough to reveal intricate patterns. Data mining uses
sophisticated statistical analysis and modeling techniques to uncover such
patterns and relationships hidden in organizational databases – patterns that
ordinary methods might miss. Once found, the information needs to be presented
in a suitable form, with graphs, reports, etc.
Q. What are the Different types of OLAP's? What are their differences? (Mascot)
ROLAP, MOLAP and HOLAP are specialized OLAP (Online Analytical Analysis)
applications.
ROLAP stands for Relational OLAP. Users see their data organized in cubes with
dimensions, but the data is really stored in a Relational Database (RDBMS) like
Oracle. The RDBMS will store data at a fine grain level, response times are
usually slow.
MOLAP stands for Multidimensional OLAP. Users see their data organized in
cubes with dimensions, but the data is store in a Multi-dimensional database
(MDBMS) like Oracle Express Server. In a MOLAP system lot of queries have a
finite answer and performance is usually critical and fast.
DOLAP
The terms data warehousing and OLAP are often used interchangeably. As the
definitions suggest, warehousing refers to the organization and storage of data
from a variety of sources so that it can be analyzed and retrieved easily. OLAP
deals with the software and the process of analyzing data, managing
aggregations, and partitioning information into cubes for in-depth analysis,
retrieval and visualization. Some vendors are replacing the term OLAP with the
terms analytical software and business intelligence.
Q. Aggregate navigation
Q. How do I set the log level higher for more detailed information within Data
Warehouse Center 7.2?
Within DWC, log level capability can be set from 0 to 4. There is a log level 5, yet
it cannot be turned on using the GUI, but must be turned on manually. A
command line trace can be used for any trace level, and this is the only way to
turn on a level 5 trace:
Be sure to reset the trace level to 0 using the command line when you are done:
db2 => update iwh.configuration set value_int = 0 where name =
'TRACELVL'
and (component = '<component name>')
When you run a trace, the Data Warehouse Center writes information to text
files. Data Warehouse Center programs that are called from steps also write any
trace information to this directory. These files are located in the directory
specified by the VWS_LOGGING environment variable.
The Data Warehouse Center supports a wide variety of relational and non
relational data sources. You can populate your Data Warehouse Center
warehouse with data from the following databases and files:
Any DB2 family database
Oracle
Sybase
Informix
Microsoft SQL Server
IBM DataJoiner
Multiple Virtual Storage (OS/390), Virtual Machine (VM), and local area network
(LAN) files
IMS and Virtual Storage Access Method (VSAM) (with Data Joiner Classic
Connect)
When you install the warehouse server, the warehouse control database that you
specify during installation is initialized. Initialization is the process in which the
Data Warehouse Center creates the control tables that are required to store Data
Warehouse Center metadata. If you have more than one warehouse control
database, you can use the Data Warehouse Center -->
Control Database Management window to initialize the second warehouse
control database. However, only one warehouse control database can be active
at a time.
Q. What databases need to be registered as system ODBC data sources for the
Data Warehouse Center?
1. What was the original business problem that led you to do this project?
A consultant who asks this question knows not to make any assumptions
about how much progress you’ve made. She probably also understands
that you might be wrong. There are plenty of clients who have begun
application development without having gathered requirements.
Understanding where the client thinks he is is just as important as
understanding where he wants to be. It also helps the consultant in
making improvement suggestions or recommendations for additional skills
or technologies.
3. How long do you see this position being filled by an external resource?
The consultant who doesn’t ask about deliverables is the consultant who
expects to sit around giving advice. Beware of the "ivory tower"
consultants, who are too light for heavy work and too heavy for light work.
Every consultant you talk to should expect to produce some sort of
deliverable, be it a requirements document, a data model, HTML, a project
plan, test procedures or a mission statement.
The fact that a consultant would offer references is testimony that she
knows her stuff. Many do not. Those consultants who hide behind
nondisclosures for not giving references should be avoided. While it’s
often valid to deny prospective clients work samples because of
confidentiality agreements, there’s no good reason not to offer the name
and phone number of someone who will sing the consultant’s praises.
Don’t be satisfied with a reference for the entire firm. Many good firms can
employ below-average consultants. Ask to talk to someone who’s worked
with the person or team you’re considering. Once you’ve hired that
consultant and are happy with his work, offer to be a reference. It comes
around.
Fact And Fact Table Types
Types of Facts
• Additive: Additive facts are facts that can be summed up through all of
the dimensions in the fact table.
• Semi-Additive: Semi-additive facts are facts that can be summed up for
some of the dimensions in the fact table, but not the others.
• Non-Additive: Non-additive facts are facts that cannot be summed up for
any of the dimensions present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first
example assumes that we are a retailer, and we have a fact table with the
following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in each
store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is
an additive fact, because you can sum up this fact along any of the three
dimensions present in the fact table -- date, store, and product. For example, the
sum of Sales_Amount for all 7 days in a week represent the total sales amount
for that week.
Date
Account
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each account at the
end of each day, as well as the profit margin for each account for each day.
Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-
additive fact, as it makes sense to add them up for all accounts (what's the total
current balance for all accounts in the bank?), but it does not make sense to add
them up through time (adding up all current balances for a given account for
each day of the month does not give us any useful information). Profit_Margin is
a non-additive fact, for it does not make sense to add them up for the account
level or the day level.
Based on the above classifications, there are two types of fact tables:
• Cumulative: This type of fact table describes what has happened over a
period of time. For example, this fact table may describe the total sales by
product by store by day. The facts for this type of fact tables are mostly
additive facts. The first example presented here is a cumulative fact table.
• Snapshot: This type of fact table describes the state of things in a
particular instance of time, and usually includes more semi-additive and
non-additive facts. The second example presented here is a snapshot fact
table.
A coverage fact table is required when the primary fact table can not answer
certain questions. For example a sales fact table can not answer the question
which items were not sold?.Covrage fact table keeps information of all items
which were in promotion irrespective of whether they are sold or not. Answer to
the above question is the difference set between coverage and primary fact
tablesA fact table with only foriegn Keys and no facts is called a factless fact
table.Fact tables are used to record events, such as Web page clicks and
employee or student attendance. Events do not always result in facts. So, if we
are interested in handling event-based scenarios where there are no facts, we
use event fact tables that consists of either pseudo facts or factless facts.
Some of these considerations are:Event-based (event fact table's) fact tables
typically have pseudo facts or no facts at all.Pseudo facts can be helpful in
counting.The factless fact event table has only foreign keys and no facts. The
foreign keys can be used for counting purposes.
Fact less fact tables can be categorized in 2 ways and they are:
Fact table with record events also termed as (Even fact tables) and coverage
fact table
Consider an example of a hospital which is very busy with patients for the entire
day. Figure 6-76 shows a simple star schema that can be used to track the daily
revenue of the hospital. The grain of the star schema is a single patient visiting
the hospital in a single day. The following are the dimension tables in the star
schema:
Insurance dimension: Contains one row for each insurance the patient has
Date dimension: The data at the daily level
Hospital dimension: A single row for each hospital
Customer dimension: One row for each customer
If a patient visits a hospital but does not have insurance, then the row in the
insurance dimension will be the Does not have insurance row. However, if a
patient visits a hospital and has insurance, then the insurance row specifying the
type of insurance is attached to the fact table row. What this means is that there
may be several patients that have insurance, but they are visible inside the fact
table only when they visit the hospital. Once they visit the hospital, a row appears
for these patients inside the fact table HOSPITAL_DAILY_REVENUE.