Abinitio
Abinitio
Inmon
OR
OR
“A copy of the transactional data specially structured for reporting and analysis”
Organizations Use of Dataware Housing
Retail
Customer Loyalty
Market Planning
Financial
Risk Management
Fraud Detection
Manufacturing
Cost Reduction
Logistics Management
Utilities
Asset Management
Resource Management
Airlines
Route Profitability
Yield Management
Dataware House – Subject Oriented
Organized around major subjects, such as customer, Sales, Account.
Focusing on the modeling and analysis of data for decision makers, not on daily operations
or transaction processing.
Provide a simple and concise view around particular subject issues by excluding data that
are not useful in the decision support process
Accounts Account
Customer Customer
Receivable
Billing Data
Order
Processing
Sales REG
Data
Dataware House - Integrated
Savings
Subject =
Account
Account
Loans
Account
Checking
Account
Dataware House – Non Volatile
Create Delete
Access
Load
Order Sales
Processing Data
Insert
Update
Dataware House – Time Variant
The time horizon for the data warehouse is significantly longer than that of operational
systems
Operational database: current value data
Data warehouse data: provide information from a historical perspective (e.g., past 5-10
years)
Every key structure in the data warehouse
Contains an element of time
But the key of operational data may or may not contain “time element”
Deposit Customer
System Data
60 - 90 days 5 - 10 years
Dataware House – OLTP Vs OLAP
Staging
Area
Dataware House Vs Data Mart
Star Schema : “A modeling technique used to map multidimensional decision support data
into a relational database with the purpose of performing advantage data analysis”
OR
“A relational database schema organized around a central table (Fact table) joined to few
smaller tables (dimension tables) using foreign key references”
1)Fact table
2)Dimension table
Fact Table :
“Fact table contains the transactional data generated out of business transactions”
Dimension Table :
“Dimension table contains master data or referential data used to analyze transactional
data”
Fact Table contains two types of columns
Key Section
Date
1)Measures
2)Key section Prod_id
Cust_id
Dataware House 3 types of measures
Fact Table
Additive measures :
“Measures that can involve in the calculation inorder to derive new measures”
Non-additive measures :
“Measures that can’t participate in the calculations”
Semi-additive measures :
“Measures that can be participate in the calculations depend on the context “
Measures that can be added across few dimensions and not with others.
Types of Star Schema
Star Schema :
“Fact tables existing in normalized format where as dimension tables existing in the
demoralized format”
Snowflake Schema :
“Fact and dimension tables are existed in the normalized format”
“Events of the transactions can occur without the measures. Resulting in a fact table without
the measures”
Example of Star Schema
Example Of Snow Flake Schema
Dataware House –Slowly Changing Dimensions
Slowly Changing Dimensions :
Dimensions that change over time are called Slowly Changing Dimensions. For instance,a
product price changes over time; People change their names for some reason; Country and
State names may change over time. These are a few examples of Slowly Changing
Dimensions since some changes are happening to them over a period of time.
In the year 2005, if the price of the product changes to $250, then the old values of the columns "Year" and
"Product Price" have to be updated and replaced with the new values. In this Type 1, there is no way to find
out the old value of the product "Product1" in year 2004 since the table now contains only the new price and
year information.
Product
PRODUCT
Product ID Effective Date Year Product Name Price Expiry Date
(PK) time (PK) time
In this Type 3, the latest update to the changed values can be seen. Example mentioned below illustrates how to
add new columns and keep track of the changes. From that, we are able to see the current price and the previous
price of the product, Product1.
The problem with the Type 3 approach, is over years, if the product price continuously changes, then the
complete history may not be stored, only the latest change will be stored. For example, in year 2006, if the
product1's price changes to $350, then we would not be able to see the complete history of 2004 prices, since the
old values would have been updated with 2005 product information
• The transform stage applies a series of rules or functions to the extracted data
from the source to derive the data for loading into the end target. Some data
sources will require very little or even no manipulation of data. In other cases, one
or more of the following transformation types may be required to meet the
business and technical needs of the target database:
• Generating surrogate-key values
• Lookup and validate the relevant data from tables or referential files for slowly
changing dimensions.
Load
• The load phase loads the data into the end target, usually the data warehouse
(DW). Depending on the requirements of the organization, this process varies
widely.
• As the load phase interacts with a database, the constraints defined in the
database schema — as well as in triggers activated upon data load — apply (for
example, uniqueness, referential integrity, mandatory fields), which also
contribute to the overall data quality performance of the ETL process.
Real-life ETL cycle
The typical real-life ETL cycle consists of the following execution steps:
• Cycle initiation
• Build reference data
• Extract (from sources)
• Validate
• Transform (clean, apply business rules, check for data integrity, create
aggregates or disaggregates)
• Stage (load into staging tables, if used)
• Audit reports (for example, on compliance with business rules. Also, in case
of failure, helps to diagnose/repair)
• Publish (to target tables)
• Archive
• Clean up
A recent development in ETL software is the implementation of parallel processing.
This has enabled a number of methods to improve overall performance of ETL
processes when dealing with large volumes of data.
ETL applications implement three main types of parallelism:
Data: By splitting a single sequential file into smaller data files to provide parallel
access.
Pipeline: Allowing the simultaneous running of several components on the same
data stream. For example: looking up a value on record 1 at the same time as adding
two fields on record 2.
Component: The simultaneous running of multiple processes on different data
streams in the same job, for example, sorting one input file while removing duplicates
on another file.
AB INITIO INTRODUCTION
Data processing tool from Ab Initio software corporation (
http://www.abinitio.com)
Latin for “from the beginning”