Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 28

DATAWARE HOUSING FUNDAMENTALS

Definition of Data warehouse

 Inmon

“ A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant


collection of data in support of management’s decisions”.

OR

The Dataware House is an informational environment that

• Provides an integrated and total view of the enterprise


• Makes the enterprise’s Current Historical and Information easily available for decision
making
• Makes the decision-support transactions possible without hindering operational systems
• Renders the organization’s information consistent
• Presents a flexible and interactive source of strategic information

OR

“A copy of the transactional data specially structured for reporting and analysis”
Organizations Use of Dataware Housing

 Retail
Customer Loyalty
Market Planning

 Financial
Risk Management
Fraud Detection

 Manufacturing
Cost Reduction
Logistics Management

 Utilities
Asset Management
Resource Management

 Airlines
Route Profitability
Yield Management
Dataware House – Subject Oriented
 Organized around major subjects, such as customer, Sales, Account.
 Focusing on the modeling and analysis of data for decision makers, not on daily operations
or transaction processing.
 Provide a simple and concise view around particular subject issues by excluding data that
are not useful in the decision support process

Operational Systems Data Warehouse

Accounts Account
Customer Customer
Receivable
Billing Data
Order
Processing

Sales REG
Data
Dataware House - Integrated

 Constructed by integrating multiple, heterogeneous data sources


 Relational or other databases, flat files, external data
 Data cleaning and data integration techniques are applied
 Ensure consistency in naming conventions, encoding structures, attribute measures,
etc. among different data sources
 When data is moved to the warehouse, it is converted

Operational Systems Data Warehouse

Savings
Subject =
Account
Account
Loans
Account

Checking
Account
Dataware House – Non Volatile

 A physically separate store of data transformed from the operational environment


 Operational update of data does not occur in the data warehouse environment
 Does not require transaction processing, recovery, and concurrency control mechanisms
 Requires only : loading and access of data.

Operational Systems Data Warehouse

Create Delete
Access
Load
Order Sales
Processing Data
Insert
Update
Dataware House – Time Variant
 The time horizon for the data warehouse is significantly longer than that of operational
systems
 Operational database: current value data
 Data warehouse data: provide information from a historical perspective (e.g., past 5-10
years)
 Every key structure in the data warehouse
 Contains an element of time
 But the key of operational data may or may not contain “time element”

Operational Systems Data Warehouse

Deposit Customer
System Data

60 - 90 days 5 - 10 years
Dataware House – OLTP Vs OLAP

 OLTP (On-line Transaction Processing)


 holds current data
 OLAP (On-line Analytical Processing)
 Useful for end users
 holds historic and integrated data
 stores detailed data
 Useful for EIS And DSS
 data is dynamic  stores detailed and summarized data
 repetitive processing (One  data is largely static (A group of records
record process at a time) processed in a batch)
 high level of transaction  ad-hoc, unstructured and heuristic
processing
 throughput  medium or low-level of transaction
 predictable pattern of usage throughput
 transaction driven  unpredictable pattern of usage
 application oriented  analysis driven
 support day-to-day decisions  subject oriented
 supports strategic decisions
 Response time is very quick
 Response time is optimum
 serves large number of operational  serves relatively lower level of managerial
users users
Dataware House Architecture

Staging
Area
Dataware House Vs Data Mart

 Dataware House  Data Mart

 Corporate/Enterprise wide  Departmental


 Union of all data marts  A single Business process
 Data received from staging area  Star join (Facts & Dimensions)
 Structure for corporate view of data  Structure to view the departmental view
of the data
 Queries on presentation resource  Technology optimal for data access and
analysis
 Structure to suit the departmental view of
 Organized an ER model
data
Dataware To Meet Requirements within Dataware House

• The data is organized differently in Dataware House


(e.g. : Multidimenssional)
-Star Schema
-Snow Flake Schema

• The data is viewed differently

• Data is stored differently


-Vector (array) storage

• Data is Indexed Differently


-Bitmap indexes
-Join indexes
Star Schema

 Star Schema : “A modeling technique used to map multidimensional decision support data
into a relational database with the purpose of performing advantage data analysis”

OR

“A relational database schema organized around a central table (Fact table) joined to few
smaller tables (dimension tables) using foreign key references”

Types of star schema

1)Basic star schema or Star Schema


2)Extended star schema or Snowflake schema.
Multidimensional modeling

 Multidimensional modeling is based on the concept of star schema.

Star schema consists of two types of tables.

1)Fact table
2)Dimension table

Fact Table :
“Fact table contains the transactional data generated out of business transactions”

Dimension Table :
“Dimension table contains master data or referential data used to analyze transactional
data”
 Fact Table contains two types of columns
Key Section
Date
1)Measures
2)Key section Prod_id
Cust_id
Dataware House 3 types of measures

1)Additive measures Measures


2)Non-additive measures
Sales_revenue
3)semi –additive measures
Tot_quantity
Unit_cost
Sale_price

Fact Table
Additive measures :
“Measures that can involve in the calculation inorder to derive new measures”

Non-additive measures :
“Measures that can’t participate in the calculations”

Semi-additive measures :
“Measures that can be participate in the calculations depend on the context “
Measures that can be added across few dimensions and not with others.
Types of Star Schema

Dataware House supports 2 types of star schemas

1)Basic star schema or Star schema


2)Extended star schema or Snow flake schema

Star Schema :
“Fact tables existing in normalized format where as dimension tables existing in the
demoralized format”

Snowflake Schema :
“Fact and dimension tables are existed in the normalized format”

Fact less fact table or Coverage tables:

“Events of the transactions can occur without the measures. Resulting in a fact table without
the measures”
Example of Star Schema
Example Of Snow Flake Schema
Dataware House –Slowly Changing Dimensions
Slowly Changing Dimensions :

Dimensions that change over time are called Slowly Changing Dimensions. For instance,a
product price changes over time; People change their names for some reason; Country and
State names may change over time. These are a few examples of Slowly Changing
Dimensions since some changes are happening to them over a period of time.

Type1:Over writing the old values


Type2:Creating an another additional record
Type3:Creating new fields
SCD Type1

 Type1 : Overwriting the old values

 Product price in 2004

Product ID year Prod Price


(PK) Name
1 2004 Product1 150

 In the year 2005, if the price of the product changes to $250, then the old values of the columns "Year" and
"Product Price" have to be updated and replaced with the new values. In this Type 1, there is no way to find
out the old value of the product "Product1" in year 2004 since the table now contains only the new price and
year information.

 Product

Product Year Prod Price


ID (PK) Name
1 2005 Product1 250
SCD Type2

Type 2: Creating an another additional record.

PRODUCT
Product ID Effective Date Year Product Name Price Expiry Date
(PK) time (PK) time

1 01-01-2004 2004 Product1 150 12-31-2004


12.00AM 11.59PM

1 01-01-2005 2005 Product1 250


12.00AM
SCD Type3

 Type3 : Creating new fields

 In this Type 3, the latest update to the changed values can be seen. Example mentioned below illustrates how to
add new columns and keep track of the changes. From that, we are able to see the current price and the previous
price of the product, Product1.

Product Current Product Current Old Product Old Year


ID(PK) Year Name Product Price
Price
1 2005 Product1 250 150 2004

 The problem with the Type 3 approach, is over years, if the product price continuously changes, then the
complete history may not be stored, only the latest change will be stored. For example, in year 2006, if the
product1's price changes to $350, then we would not be able to see the complete history of 2004 prices, since the
old values would have been updated with 2005 product information

Product Year Product Product Old Product Old Year


ID(PK) Name Price Price

1 2006 Product1 350 250 2005


• Extract, transform, and load (ETL) is a process in database usage and
especially in data warehousing that involves:
• Extracting data from outside sources
• Transforming it to fit operational needs (which can include quality levels)
• Loading it into the end target (database or data warehouse)
Extract :-
• The first part of an ETL process involves extracting the data from the source
systems.
• Most data warehousing projects consolidate data from different source
systems. Common data source formats are relational databases and flat files,
but may include non-relational database structures such as Information
Management System (IMS) or other data structures such as Virtual Storage
Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even
fetching from outside sources such as through web spidering or screen-
scraping. Extraction converts the data into a format for transformation
Transform

• The transform stage applies a series of rules or functions to the extracted data
from the source to derive the data for loading into the end target. Some data
sources will require very little or even no manipulation of data. In other cases, one
or more of the following transformation types may be required to meet the
business and technical needs of the target database:
• Generating surrogate-key values

• Transposing or pivoting (turning multiple columns into multiple rows or vice


versa)

• Splitting a column into multiple columns (e.g., putting a comma-separated list


specified as a string in one column as individual values in different columns)

• Disaggregation of repeating columns into a separate detail table (e.g., moving a


series of addresses in one record into single addresses in a set of records in a linked
address table)

• Lookup and validate the relevant data from tables or referential files for slowly
changing dimensions.
Load

• The load phase loads the data into the end target, usually the data warehouse
(DW). Depending on the requirements of the organization, this process varies
widely.

• Some data warehouses may overwrite existing information with cumulative,


frequently updating extract data is done on daily, weekly or monthly. while
other DW (or even other parts of the same DW) may add new data in a
historicized form, for example, hourly.

• As the load phase interacts with a database, the constraints defined in the
database schema — as well as in triggers activated upon data load — apply (for
example, uniqueness, referential integrity, mandatory fields), which also
contribute to the overall data quality performance of the ETL process.
Real-life ETL cycle

The typical real-life ETL cycle consists of the following execution steps:

• Cycle initiation
• Build reference data
• Extract (from sources)
• Validate
• Transform (clean, apply business rules, check for data integrity, create
aggregates or disaggregates)
• Stage (load into staging tables, if used)
• Audit reports (for example, on compliance with business rules. Also, in case
of failure, helps to diagnose/repair)
• Publish (to target tables)
• Archive
• Clean up
A recent development in ETL software is the implementation of parallel processing.
This has enabled a number of methods to improve overall performance of ETL
processes when dealing with large volumes of data.
ETL applications implement three main types of parallelism:
Data: By splitting a single sequential file into smaller data files to provide parallel
access.
Pipeline: Allowing the simultaneous running of several components on the same
data stream. For example: looking up a value on record 1 at the same time as adding
two fields on record 2.
Component: The simultaneous running of multiple processes on different data
streams in the same job, for example, sorting one input file while removing duplicates
on another file.
AB INITIO INTRODUCTION
 Data processing tool from Ab Initio software corporation (

http://www.abinitio.com)
 Latin for “from the beginning”

 Designed to support largest and most complex business


applications
 Graphical, intuitive, and “fits the way your business works” text
Importance of Ab Initio When Compared to other ETL’s

1)Able to Process huge amount of data in a less span of time


2)Easy to write complex and custom ETL logics especially in case of Banking
and Financial Applications.
Ex :- Amortization.
3)Ab Initio follows all three types of parallelism , which an ETL tool needs to
handle.
4)Data Parallelism of Ab Initio is one feature which makes it distinct from the
other ETL tools.
5)When Handling complex logics , you can write custom code , as it is Pro C
based code.

You might also like