Datawarehouse / Etl Testing: Reason For Build Data Warehouse: 1) Data Is Scattered at Different Places

DATAWAREHOUSE / ETL TESTING
Reason for build Data warehouse:

1) Data is scattered at different places.
2) Data inconsistency.
3) Depending on volatile and non-volatile. [Data keep on changing]
4) Surrogate keys means Artificial Keys
Data warehouse: - Ralph Kimball

A data warehouse is relation management system, which is specifically design business
analysis and making decisions to active the business goals.
A data warehouse is design to support decision making process hence it is called as
decision supporting system (DSS).
A data warehouse is historical database which stores historical business information
required for analysis.** A data warehouse is read only database which supports business
manages to query the data required for analysis, but not for business transactions processing.
A data warehouse is a integrated database which stores the data in an integrated formatted,
the data is collected from multiple OLTP source systems.
Data warehouse: W.H.Inmon
A data warehouse is a
1)
2)
3)
4)
Time Variant
Non-volatile
Subject Oriented
Integrated database
OLTP System
Data warehouse
Load
Student admission
Transact
ion (0r)
Extraction
Integrat
e
Student fee details

Student examination
Student subjects
Student Subject
Application Oriented
Subject oriented
Ex:- Over Business Application
Data
OLTP system
Extraction
Current
Saving
Checking
Transac
tion (or)
Integrat
ed
Load
Account
Subject
Ex: - Over Sales Application
OLTP System
Data warehouse
Load
Order Application
Integration Application
Transact
ion (0r)
Integrat
e
Sales Subject
Shipment Application
Extraction
Users Payment
Application Oriented
Characteristic features of Data warehouse:1) Time-Variant:- A Data warehouse is a time-variant database which supports business needs
of end users in comparing and analyzing the business with different time periods. This is also
known as Time series analysis.
y
Data
Calendar (or) Time

Year
Quarter
Month
Week
Day
Q1
varience
Q2
Sales
Amount
X
Quarter product
x
x
2) Non-volatile: i.
ii.
A data warehouse is a non-volatile.

Once the data entered into the data warehouse it doesnt reflect to the change which
takes place at operational database .Hence the data is static in the data warehouse.
3) Subject oriented: A data warehouse is a subject oriented database which supports the business need of
department specific users.
Ex: Sales, accounts, hr, students, loans etc..
A subject is derived from multi OLTP applications which organize the data two meet specific
business functionality.
4) Integrated:a data warehouse is an integrated database which collects the data from multiple
OLTP databases information.
SA
DWH
LES
OLTP DB at U.K
Load
Extraction
Integrate
OLTP DB at IND
A data warehouse is container to store the business.
Data warehousing:A data warehousing is process of building a data warehousing. A process includes
i.
ii.
iii.
iv.
Business requirement analysis

Database design
ETL development and testing
Report development and testing
A Business analyst and onsite technical co-ordinaters the business requirement and
technical requirement.
BRS(Business Requirement specification) :A BRS contains the business requirements which are collected by an analyst.
A SRS contains software and hardware requirement which are collected by senior technical
people.
The process of designing the database is called as a modeling or dimensional modeling. A
database architect or data modeler designs the warehouse with set of tables.
OLAP (online analytical processing): An OLAP is technology which supports the business managers to make a query from date
warehouse. An OLAP provides the gateway between users and data warehouse.
A data warehouse is known as OLAP database
Ex: Cognos, Bos
Differences between OLTP database and data warehouse:

OLTP
1) It is design to support business
transaction processing
2) Volatile data
DWH
1) It is design to support a decision making
processing
2) non-volatile data
3)
4)
5)
6)
7)
8)
9)
Current data
Detailed data
Design for running the business
Normalization
Application oriented data
Design for critical operation
ER-modeling
3) historical data
4) summary data
5) design for analyzing the business
6) de-normalization
7) subject oriented data
8) design for managerial operation
9)dimensional modeling
Enterprising Data warehousing objects: A relational database is a defined as collection of objects such as tables, views, procedures,
macros, triggers etc...
Table: - A table is a two directional object where the data can be stored in the form of rows
and columns.
View: - A view is like a window into one or more table it provides a containise access to the
base table and provides.
1) Restricting which columns are visible from base tables
2) Restricting which rows are visible from base tables.
3) Combining rows and columns from several base tables.
It may be define a subject of rows of table.
It may be define subject of columns of table.
Data warehouse-RDBMS: The following are the relation databases can be defined to build data warehousing
i.
ii.
iii.
iv.
v.
vi.
vii.
viii.
ix.
Oracle
Sql server
IBM DB2
Tera data
Green plum
Netezza
Sybase
Redbrick
informix
One of the best RDBMS to store massive historical information, parallel storage, parallel
retrieval is Teradata.
Data acquisition: It is a process of extracting the relevant business information, transforming data into require
business format and loading into the target system.
A data acquisition is defined with following type of process
1) data extraction
2) data transformation
3) data loading
There are two types of ETL used to build data acquisition.
1) Code based ETL
2) GUI based
Code based ETL:
An ETL application can be developed using some programming language as SQL,PL/SQL
Ex: SAS base, SAS excess, Teradata, ETL utilities
Teradata ETL utilities:
i.
ii.
iii.
iv.
BETQ
Fast Load
Multi load
T pump
GUI base ETL:An ETL application can be design with simple graphical user interfacing, point and click
techniques.
Ex: Informatica, Data stage, Abinitio, ODI (oracle data integrated), data services, data
manger, SSIS (SQL server integration services).
Data Extraction:
It is a process of reading the data from various types of source systems. The following are
type of sources which are used to define extraction.
1) ERP sources
i.
SAP
ii.
Oracle applications
iii.
J.D.Edwards
iv.
People soft
2) File sources
i.
XML files
ii.
Flat files
3) Relational sources
i.
Oracle
ii.
SQL server
iii.
DB2
iv.
Sybase
4) Legacy sources
i.
Main frames
ii.
AS 400
iii.
Cobal files etc
Data transformation:
It is the process of transforming data and cleaning the data into the required business format.
The following are data transformation activities takes place in the stage
i.
ii.
iii.
iv.
Data
Data
Data
Data
merging
cleansing
scrubbing
aggregation
Staging:
A staging is temporary memory where the following activities take place in the stage
i.
ii.
iii.
iv.
Data
Data
Data
Data
merging
cleansing
scrubbing
aggregation
Data cleansing
It is the process of changing inconsistencies and in-aquarist
Or
Its process of removing unwanted data from staging.

Source
Sales Amount
$ 1.016
$ 2.00
$ 3.765
$ 4.0
Target
Sales Amount
$ 1.10
$ 2.00
$ 3.76
$ 4.0
Staging
ROU
ND
Data cleansing
Examples: 1) Removing duplicates is a data cleaning
2) Records which contains a null
3) Removing spaces
Data Scrubbing:
It is a process of deriving new attributes
Attributes nothing but a table columns
Source System
OLTP
Sales
Sale id
Product
Price
QTY
Discount
DWH
Sales
amount
QTY*Price
Sales Info
Sale id
Product
Price
Sale amount
QTY
Discount
Sale tax
Profit
Sales tax
QTY*Price
*0.15
Data aggregation:
It is a process of integrating the data from multiple sources system.
Sum()
Or
Max()
Detailed data
Summary
data
Staging
Data Merging:
It is a process of integrating the data from multiple sources system
There are two types of data merge operation takes places in the staging
1. Horizontal merging
2. Vertical merging
Horizontal merging:
It is process of merging the records horizontally
Using Joins
S
1
Empno
Ename
Sal
2355
smith
4000
Dname
10
S
2
10
S1
Join
S
2
Deptno
S
1
Ename
deptno
sales
sal
deptno
2355 smith 4000

texas
loc
texas
S1
s2
Joi
n
s2
Ename sal dept

dname loc
10
sales
Dname
loc
Vertical merging:
It is process of merging the records vertically when the two sources are having same meta
data (union).
Meta data means data structures (two or three table column names are same)
Source system
S
1
Empno
Ename
Sal
2355
smith
4000
S
2
empno
ename
3255
allen
deptno
sal
staging
10
deptno
2140
uni
on
Empno
deptno
2355
ename
sal
smith
4000
10
20
Data Loading:
It is the process of inserting the data into a target system. There are two type of data code
1) Initial load or full load
2) Incremental load or delta load
ETL client server technology:
An ETL plan defines extraction, transforming and loading; An ETL plan is design with following
types of metadata.
1) Source definition: it is the structure of source data or source table from data extracts.
2) Target definition: it is the structure of the target table to which data load
3) Transformation rule: it defined the business logic used to for processing the data.
Meta data:
It defined the structure of the data, which is represent as column name, data type, precision,
scale and keys (primary key, foreign key).
Custom
Customer
CID number(4) PK
CFName
varchar2(6)
CLName
varchar2(6)
Gender
varchar2(6)
+CID
+CFName
+CLName
+Gender
B-Logic
T-
+CID
+CName
+Gender
T-Customer
CID number(4)
Cname
varchar2(6)
Gender
varchar2(6)
ETL Plan
Repositor
y
ETL
ETL client:
An ETL client is a graphical user component where an ETL developer can design ETL plane
ETL repository:
An ETL repository is brain of an ETL system where you can store metadata such as ETL plants
ETL server: it is an ETL engine which performs extraction, transforming and loading
In Informatica an ETL plan is call as mapping.
In datastage it is called as Job
In Abinitio it is called as graph
Extra:
Initial load or full load:
We are extracting data from source system to load the data into target system first time the
records are entered directly to the target system.
OLTP
CUSTOME
DWH
CID
CNAME
TCID
Incremental load (or) delta load:

We are extracting data from the source system and load into target system first newly
entered the records as well as update the records into the target system
If we want to design ETL plan we need only metadata
Data definition we call it as metadata
GUI base uses we are extracting the data from the different databases (Ex: - SQL,
SQL server, Sybase, Oracle) in the different places source system.
We have to give the length of data type same (or) more of the target system but we
have not given the less length of source system
Data warehouse Database design:
A data warehouse is design with the following types of schema
1. Star schema
2. Snow flake schema
3. Galaxy schema (Conistallation schema, Integrated schema, highbred schema and multi
star schema)
The process of designing the database is known as data modeling
A database architect (or) data modeler creates database schemas using a GUI base
database designing tool called ERIWN. It is a process of computer associates.
1) Star schema :
A star schema is a database design which contains centrally located fact table which is
surrounded by dimension tables.
Since database design looks like a star schema database design
In data warehouse facts are numeric. A fact table contains facts.
Not every numeric is a fact but numeric which are of type key performance indicator
are known as facts
Facts are business measures which are used to evaluate the performance of an entire
price
A fact table contain the facts at lowest level granularity
A fact granularity determine level of details
A dimension is a descriptive data which describes the key performance known as
facts
A dimension table contains a de-normalized data
A fact table contains a normalized data
A fact table contains a component key where each candidate key is a foreign key to
the dimension table
Customer_Key
A dimension provides answer to the following business question
(PK)
Date_Key(PK)
1) Who 2) what 3) when 4) where
Year
Name
Transaction_ID(PK)
Quarter
Dim-customer
Dim-Time
Monthly
Address
Date_key(FK)
Week
Sale-Transaction
Market_key(FK)
Day
Phone
Product_Key(FK)
Customer_Key(FK)
QTY
REVENUE
Profit
Product_key(P
K)
Category
Sub category
Product
Marketkey(PK)
Market Code
Market Name
Dim-Product
Dim-Market
Dim means dimension

2) Snow flake: In snow flake a large dimension table is spitted into one or more table to flat and
hierarchy
In snowflake a dimension may have a parent table
A large dimension table spitted into one or more normalized table (de-composite
dimension)
D42
D1
D41
D4
Facts
Parent
Child
Flake
Child
D3
D2
Dimension (D4) is spitted into dimension-four-one(d41) and dimension fourtwo(d42). D42 Is a parent table and all are child table.
Extra:
A schema is nothing but a table
qty
Revenue
KPI
Profit
Gross profit
KPI means key performance indicator
A numeric which is performing a key role in the business analysis and estimating the
business enterprise in the data warehouse is a fact.
A facts table contains facts. Facts are numeric
Every numeric is not a facts
Facts are business measure because the estimate the business performance
A dimension is a descriptive data is stored is called as facts
A facts are analyze by descriptive data is called dimension
Mapping is a ETL plan which extraction-transforming-loading
Customer
BSR
City
HYD
DESCRIPTIVE
Date
3/11/2010
DATA
Product
LG LED
QTY
1
Revenue
Profit
$250
50
Fact Table
Data warehouse-Galaxy schema:

A data warehouse is a design with integration of multiple star schemas (or) snowflakes
schema are both. A galaxy schema is also known as highbred schema (or) constellation
schema.
Fact constellation: it is a process of joining two fact tables
Conformed dimensions: a dimension table which can be shared by multiple fact tables is
known as conformed dimensions.
Factless fact table:
A fact table without any facts is known as factless fact table
Junk dimensions:
A dimension which cant be used to describe key performance indicators is known as junk
dimension.
Ex: phone number, fax number, customer address etc

Slowly changing dimensions:
A dimension which can be changed over the period of time is known as slowly changing
dimension. There are three types of dimensions.
1. Type one dimension:
a type one dimension stores only current data in the target it doesnt maintain any
history
2. Type two dimension:
A type two dimension maintains the full history in the target for each update it inserts
a new record in the target.
3. Type three dimension:
Type three dimensions maintains partial history (current and previous information)
Dirty Dimensions:
In a dimension table if record exist more than once with a changing a non key attribute is
known as Dirty dimension.
Fact Constellation
OLTP DB
Customer_Id
Emp_Id
c-name
Emp-name
c-address
Emp-address
c-phone
Emp-phone
DWH-Dimensional
Customer_id
Employee_id
modeling
c-name
e-name
c-address
e-address
c-phone
e-phone
Dim-customer
Dim-employee
Extra: Difference of star and snow flake schema: Star schema is a de-normalized, it means duplicated records are maintained
Snow flake schema splited tables are normalized.it means dont maintained the
duplicate records.
Snow flake schema is used to reduces the table spaces
Galaxy Schema:
Integration of star and snow flake schema are called galaxy schema
Common or conformed or reusable dimensional are shared by fact table such as
dimensional tables are called conformed
A constellation is a process of joining two fact tables.
Galaxy is the schema which can be more multiple schema
Key performance indicator is called facts
A fact table without any facts are called fact less
Dimensions which cant be used to describe facts such dimension are junk
dimension.
Customer
Fact
cname Analyzing
c-id
c-
Customer(F
k)
Employee(F
K)
QTY
Revenue
Junk dimension
table
Product
c-name
QTY
Revenue
t.v
bsr
10
7500
c-id
101
address
s.r.nagar
This column is using it is a completely analyzing

otherwise junk dimension
A junk dimension provide addition information to the main dimension

Common or conformed or reusable dimensions:
The dimensions are shared by facts table such as dimensions tables are called conformed or
reusable or common dimensions.
Y1
A
Y2
C
Slowly changing dimension:
Dim-customer
Custome
name c-add
c-name
c-address
DSNR
c-phone
kp
Customer_id c101
101
101
rjnr
BSR
BSR
BSR
Customerkey(PK)
name
C-ID
C-key
1
C-NAMEDSNR
BSR
C-ADD
c-id
c-
c-add
BSR
C-Phonekp
2
3
101
101
101
BSR
rjnr
A surrogate key is an artificial key that is treated as a primary key

A surrogate key is a system generated sequential number that is treated as primary
key.
OLAP:
OLAP is nothing but set of specification or technologies which allows client application in
returning the data from data warehouse
Or
OLAP is nothing but a interface or gateway between the user and database
Types of OLAP
1. DOLAP (Desktop OLAP):
An OLAP which can query the data from a database which is contributed by using
desktop databases like dbase, FoxPro, clipper etc.
XML file, txt file, XL there are desktop databases
2. ROLAP (Relation OLAP):
ROLAP is used to query the data from relational sources like SQL, Oracle, Sybase, and
Teradata
3. MOLAP (Multi OLAP):
It is used to query the data from multi-dimensional sources like Cube,DMR
4. HOLAP (Highbred OLAP):
It is a combination ROLAP and MOLAP
Datamart and types of datamart:
A data mart is a subject oriented database which supports the business needs of department
specific business managers
Or
A data mart is a subset of enterprise data warehouse. a data warehouse is also known as high
performance query structure (HPQS).
There are two types of datamarts
1. Dependent datamarts
2. Independent datamarts
The enterprise is an integration of various departments.
Integration of multiple datamarts is a enterprise
Difference between Enterprise data warehouse and data mart
EDW
1) It is an integration of multiple
subjects
2) It stores enterprise specific business
information
3) Design for top management (CEO,
board of director)
Datamarts
1) It defines a single object
2) It stores department specific
information
3) Design for middle management
users
Top-Down Data warehousing approach: - (W.H.Inmon)

According to the Inmon first we need to build an Enterprise data warehouse, from the EDW
Design subject oriented, department specific database known as datamarts.
DM
EDH
DM
Bottom-up data warehousing approach: - Kimbol

According to Kimbol design department specific, subject oriented database as datamarts,
Integrate the datamarts to define enterprise data warehouse.
DM
EDH
DM
Dependent datamart:
In a top-down approach a datamart development dependents on enterprise data
warehouse hence datamarts are known as dependent datamart
Independent datamart:
In a bottom-up approach a datamart development is independent of enterprise data
warehouse. Hence such datamarts are known as independent.
ODS (Operational database):
ODS
Similarity: integrated database
DSS(Decision supporting system) DWH

Similarity: Integrated database
Differences: 1) volatile data

2) Current data
3) Detailed data
Difference: 1) non-volatile data

2) historical data
3) summary data

Datawarehouse / Etl Testing: Reason For Build Data Warehouse: 1) Data Is Scattered at Different Places

Uploaded by

Copyright:

Available Formats

Datawarehouse / Etl Testing: Reason For Build Data Warehouse: 1) Data Is Scattered at Different Places

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Datawarehouse / Etl Testing: Reason For Build Data Warehouse: 1) Data Is Scattered at Different Places

Uploaded by

Copyright:

Available Formats

DATAWAREHOUSE / ETL TESTING

Reason for build Data warehouse:

4) Surrogate keys means Artificial Keys

Data warehouse: - Ralph Kimball

Student fee details

Ex: - Over Sales Application

Calendar (or) Time

A data warehouse is a non-volatile.

Business requirement analysis

Differences between OLTP database and data warehouse:

Its process of removing unwanted data from staging.

2355 smith 4000

Ename sal dept

Incremental load (or) delta load:

Dim means dimension

Data warehouse-Galaxy schema:

Ex: phone number, fax number, customer address etc

This column is using it is a completely analyzing

A junk dimension provide addition information to the main dimension

A surrogate key is an artificial key that is treated as a primary key

Top-Down Data warehousing approach: - (W.H.Inmon)

Bottom-up data warehousing approach: - Kimbol

DSS(Decision supporting system) DWH

Differences: 1) volatile data

Difference: 1) non-volatile data

You might also like