Data mining and warehousing(chp#3) .
Data mining and warehousing(chp#3) .
Data mining and warehousing(chp#3) .
Introduction:
Data warehouse (DW) data modeling is the process of designing and creating
a conceptual representation of data for analytical and reporting purposes. It involves
structuring data in a way that optimizes query performance, data integrity, and user
understanding.
The Sales table is the fact table and the Products, Location, and Time tables are the
dimensional tables. This schema has been made because sales made at various
locations, times, and by selling different products need to be ascertained. With this,
queries can be carried out and relevant findings can be made
Snowflake Schema
Consider a case where we don’t just need to know the products sold but also the
categories the products belong to and subsequently the subcategories, we would need
to make further extensions of the Product table. This means that tables for category
names and subcategory names would be created. This would aid in the classification of
the products. In this case, the database would be modified further into this:
Simply put, the snowflake schema is an extension of the star schema. In this case, the
dimension tables are further restructured or normalized into sub-dimensions in order to
achieve desired goals.
OLAP (online analytical processing) and data warehousing uses multi dimensional
databases.It is used to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the
data from many dimensions and perspectives. It is defined by dimensions and facts and
is represented by a fact table. Facts are numerical measures and fact tables contain
measures of the related dimensional tables or names of the facts.
The following stages should be followed by every project for building a Multi
Dimensional Data Model :
Stage 1 : Assembling data from the client : In first stage, a Multi Dimensional Data
Model collects correct data from the client. Mostly, software professionals provide
simplicity to the client about the range of data which can be gained with the selected
technology and collect the complete data in detail.
Stage 2 : Grouping different segments of the system : In the second stage, the Multi
Dimensional Data Model recognizes and classifies all the data to the respective section
they belong to and also builds it problem-free to apply step by step.
Stage 3 : Noticing the different proportions : In the third stage, it is the basis on which
the design of the system is based. In this stage, the main factors are recognized
according to the user’s point of view. These factors are also known as “Dimensions”.
Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth
stage, the factors which are recognized in the previous step are used further for
identifying the related qualities. These qualities are also known as “attributes” in the
database.
Stage 5 : Finding the actuality of factors which are listed previously and their qualities :
In the fifth stage, A Multi Dimensional Data Model separates and differentiates the
actuality from the factors which are collected by it. These actually play a significant role
in the arrangement of a Multi Dimensional Data Model.
Stage 6 : Building the Schema to place the data, with respect to the information
collected from the steps above : In the sixth stage, on the basis of the data which was
collected previously, a Schema is built.
Additional Tips:
1. Involve business stakeholders.
2. Document requirements.
3. Use data profiling.
4. Iterate and refine.
Kimball's Four-Step Process ensures data warehouses meet business needs.
A Slowly Changing Dimension (SCD) is a dimension that stores and manages both
current and historical data over time in a data warehouse. It is considered and
implemented as one of the most critical ETL tasks in tracking the history of dimension
records.
There are three types of SCDs and you can use Warehouse Builder to define, deploy,
and load all three types of SCDs.
Type-1 SCDs(Overwriting ):In a Type 1 SCD the new data overwrites the existing data.
Thus the existing data is lost as it is not stored anywhere else. This is the default type of
dimension you create. You do not need to specify any additional information to create a
Type 1 SCD.
Type -2 SCDs(Historical tracking):A Type 2 SCD retains the full history of values.
When the value of a chosen attribute changes, the current record is closed. A new
record is created with the changed data values and this new record becomes the
current record. Each record contains the effective time and expiration time to identify the
time period between which the record was active.
Type -3 SCDs(Current and previous values):A Type 3 SCD stores two versions of
values for certain selected level attributes. Each record stores the previous value and
the current value of the selected attribute. When the value of any of the selected
attributes changes, the current value is stored as the old value and the new value
becomes the current value.
Differentiate between typeI,II and III :-Type 1 – This model involves overwriting
the old current value with the new current value. No history is maintained. Type 2 – The
current and the historical records are kept and maintained in the same file or table. Type
3 – The current data and historical data are kept in the same record.