Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Datawarehouse / Etl Testing: Reason For Build Data Warehouse: 1) Data Is Scattered at Different Places

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 19

DATAWAREHOUSE / ETL TESTING

Reason for build Data warehouse:


1) Data is scattered at different places.

2) Data inconsistency.
3) Depending on volatile and non-volatile. [Data keep on changing]

4) Surrogate keys means Artificial Keys

Data warehouse: - Ralph Kimball


A data warehouse is relation management system, which is specifically design business
analysis and making decisions to active the business goals.
A data warehouse is design to support decision making process hence it is called as
decision supporting system (DSS).
A data warehouse is historical database which stores historical business information
required for analysis.** A data warehouse is read only database which supports business
manages to query the data required for analysis, but not for business transactions processing.
A data warehouse is a integrated database which stores the data in an integrated formatted,
the data is collected from multiple OLTP source systems.
Data warehouse: W.H.Inmon
A data warehouse is a
1)
2)
3)
4)

Time Variant
Non-volatile
Subject Oriented
Integrated database

OLTP System

Data warehouse

Load

Student admission
Transact
ion (0r)
Extraction
Integrat
e

Student fee details


Student examination
Student subjects

Student Subject

Application Oriented
Subject oriented
Ex:- Over Business Application
Data

OLTP system

Extraction

Current
Saving
Checking

Transac
tion (or)
Integrat
ed

Load

Account
Subject

Ex: - Over Sales Application

OLTP System

Data warehouse

Load

Order Application
Integration Application

Transact
ion (0r)
Integrat
e

Sales Subject

Shipment Application

Extraction

Users Payment

Application Oriented
Characteristic features of Data warehouse:1) Time-Variant:- A Data warehouse is a time-variant database which supports business needs
of end users in comparing and analyzing the business with different time periods. This is also
known as Time series analysis.
y

Data

Calendar (or) Time


Year
Quarter
Month
Week
Day

Q1

varience

Q2

Sales
Amount
X
Quarter product

x
x

2) Non-volatile: i.
ii.

A data warehouse is a non-volatile.


Once the data entered into the data warehouse it doesnt reflect to the change which
takes place at operational database .Hence the data is static in the data warehouse.

3) Subject oriented: A data warehouse is a subject oriented database which supports the business need of
department specific users.
Ex: Sales, accounts, hr, students, loans etc..
A subject is derived from multi OLTP applications which organize the data two meet specific
business functionality.
4) Integrated:a data warehouse is an integrated database which collects the data from multiple
OLTP databases information.

SA

DWH
LES

OLTP DB at U.K
Load
Extraction

Integrate

OLTP DB at IND
A data warehouse is container to store the business.
Data warehousing:A data warehousing is process of building a data warehousing. A process includes
i.
ii.
iii.
iv.

Business requirement analysis


Database design
ETL development and testing
Report development and testing

A Business analyst and onsite technical co-ordinaters the business requirement and
technical requirement.
BRS(Business Requirement specification) :A BRS contains the business requirements which are collected by an analyst.
A SRS contains software and hardware requirement which are collected by senior technical
people.
The process of designing the database is called as a modeling or dimensional modeling. A
database architect or data modeler designs the warehouse with set of tables.
OLAP (online analytical processing): An OLAP is technology which supports the business managers to make a query from date
warehouse. An OLAP provides the gateway between users and data warehouse.
A data warehouse is known as OLAP database
Ex: Cognos, Bos

Differences between OLTP database and data warehouse:


OLTP
1) It is design to support business
transaction processing
2) Volatile data

DWH
1) It is design to support a decision making
processing
2) non-volatile data

3)
4)
5)
6)
7)
8)
9)

Current data
Detailed data
Design for running the business
Normalization
Application oriented data
Design for critical operation
ER-modeling

3) historical data
4) summary data
5) design for analyzing the business
6) de-normalization
7) subject oriented data
8) design for managerial operation
9)dimensional modeling

Enterprising Data warehousing objects: A relational database is a defined as collection of objects such as tables, views, procedures,
macros, triggers etc...
Table: - A table is a two directional object where the data can be stored in the form of rows
and columns.
View: - A view is like a window into one or more table it provides a containise access to the
base table and provides.
1) Restricting which columns are visible from base tables
2) Restricting which rows are visible from base tables.
3) Combining rows and columns from several base tables.
It may be define a subject of rows of table.
It may be define subject of columns of table.
Data warehouse-RDBMS: The following are the relation databases can be defined to build data warehousing
i.
ii.
iii.
iv.
v.
vi.
vii.
viii.
ix.

Oracle
Sql server
IBM DB2
Tera data
Green plum
Netezza
Sybase
Redbrick
informix

One of the best RDBMS to store massive historical information, parallel storage, parallel
retrieval is Teradata.
Data acquisition: It is a process of extracting the relevant business information, transforming data into require
business format and loading into the target system.
A data acquisition is defined with following type of process

1) data extraction
2) data transformation
3) data loading
There are two types of ETL used to build data acquisition.
1) Code based ETL
2) GUI based
Code based ETL:
An ETL application can be developed using some programming language as SQL,PL/SQL
Ex: SAS base, SAS excess, Teradata, ETL utilities
Teradata ETL utilities:
i.
ii.
iii.
iv.

BETQ
Fast Load
Multi load
T pump

GUI base ETL:An ETL application can be design with simple graphical user interfacing, point and click
techniques.
Ex: Informatica, Data stage, Abinitio, ODI (oracle data integrated), data services, data
manger, SSIS (SQL server integration services).
Data Extraction:
It is a process of reading the data from various types of source systems. The following are
type of sources which are used to define extraction.
1) ERP sources
i.
SAP
ii.
Oracle applications
iii.
J.D.Edwards
iv.
People soft
2) File sources
i.
XML files
ii.
Flat files
3) Relational sources
i.
Oracle
ii.
SQL server
iii.
DB2
iv.
Sybase
4) Legacy sources
i.
Main frames
ii.
AS 400
iii.
Cobal files etc

Data transformation:
It is the process of transforming data and cleaning the data into the required business format.
The following are data transformation activities takes place in the stage
i.
ii.
iii.
iv.

Data
Data
Data
Data

merging
cleansing
scrubbing
aggregation

Staging:
A staging is temporary memory where the following activities take place in the stage
i.
ii.
iii.
iv.

Data
Data
Data
Data

merging
cleansing
scrubbing
aggregation

Data cleansing
It is the process of changing inconsistencies and in-aquarist

Or

Its process of removing unwanted data from staging.


Source
Sales Amount
$ 1.016
$ 2.00
$ 3.765
$ 4.0

Target

Sales Amount
$ 1.10
$ 2.00
$ 3.76
$ 4.0

Staging
ROU
ND

Data cleansing
Examples: 1) Removing duplicates is a data cleaning
2) Records which contains a null
3) Removing spaces
Data Scrubbing:
It is a process of deriving new attributes
Attributes nothing but a table columns
Source System
OLTP
Sales
Sale id
Product
Price
QTY
Discount

DWH

Sales
amount
QTY*Price

Sales Info
Sale id
Product
Price
Sale amount
QTY
Discount
Sale tax
Profit

Sales tax
QTY*Price
*0.15

Data aggregation:
It is a process of integrating the data from multiple sources system.

Sum()
Or
Max()

Detailed data

Summary
data

Staging
Data Merging:
It is a process of integrating the data from multiple sources system
There are two types of data merge operation takes places in the staging
1. Horizontal merging
2. Vertical merging
Horizontal merging:
It is process of merging the records horizontally
Using Joins

S
1
Empno

Ename

Sal

2355

smith

4000

Dname

10

S
2

10

S1

Join

S
2
Deptno
S
1
Ename

deptno

sales
sal

deptno

2355 smith 4000


texas

loc
texas

S1

s2

Joi
n

s2

Ename sal dept


dname loc

10

sales

Dname

loc

Vertical merging:
It is process of merging the records vertically when the two sources are having same meta
data (union).
Meta data means data structures (two or three table column names are same)
Source system

S
1
Empno

Ename

Sal

2355

smith

4000

S
2
empno

ename

3255

allen

deptno

sal

staging

10

deptno

2140

uni
on

Empno
deptno
2355

ename

sal

smith

4000

10

20

Data Loading:
It is the process of inserting the data into a target system. There are two type of data code
1) Initial load or full load
2) Incremental load or delta load
ETL client server technology:
An ETL plan defines extraction, transforming and loading; An ETL plan is design with following
types of metadata.
1) Source definition: it is the structure of source data or source table from data extracts.
2) Target definition: it is the structure of the target table to which data load
3) Transformation rule: it defined the business logic used to for processing the data.
Meta data:
It defined the structure of the data, which is represent as column name, data type, precision,
scale and keys (primary key, foreign key).
Custom
Customer

CID number(4) PK
CFName
varchar2(6)
CLName
varchar2(6)
Gender
varchar2(6)

+CID
+CFName
+CLName
+Gender

B-Logic

T-

+CID
+CName
+Gender

T-Customer

CID number(4)
Cname
varchar2(6)
Gender
varchar2(6)

ETL Plan
Repositor
y

ETL

ETL client:
An ETL client is a graphical user component where an ETL developer can design ETL plane
ETL repository:
An ETL repository is brain of an ETL system where you can store metadata such as ETL plants
ETL server: it is an ETL engine which performs extraction, transforming and loading
In Informatica an ETL plan is call as mapping.
In datastage it is called as Job
In Abinitio it is called as graph
Extra:
Initial load or full load:
We are extracting data from source system to load the data into target system first time the
records are entered directly to the target system.

OLTP

CUSTOME

DWH

CID
CNAME

TCID

Incremental load (or) delta load:


We are extracting data from the source system and load into target system first newly
entered the records as well as update the records into the target system
If we want to design ETL plan we need only metadata
Data definition we call it as metadata
GUI base uses we are extracting the data from the different databases (Ex: - SQL,
SQL server, Sybase, Oracle) in the different places source system.

We have to give the length of data type same (or) more of the target system but we
have not given the less length of source system
Data warehouse Database design:
A data warehouse is design with the following types of schema
1. Star schema
2. Snow flake schema
3. Galaxy schema (Conistallation schema, Integrated schema, highbred schema and multi
star schema)
The process of designing the database is known as data modeling
A database architect (or) data modeler creates database schemas using a GUI base
database designing tool called ERIWN. It is a process of computer associates.
1) Star schema :
A star schema is a database design which contains centrally located fact table which is
surrounded by dimension tables.
Since database design looks like a star schema database design
In data warehouse facts are numeric. A fact table contains facts.
Not every numeric is a fact but numeric which are of type key performance indicator
are known as facts
Facts are business measures which are used to evaluate the performance of an entire
price
A fact table contain the facts at lowest level granularity
A fact granularity determine level of details
A dimension is a descriptive data which describes the key performance known as
facts
A dimension table contains a de-normalized data
A fact table contains a normalized data
A fact table contains a component key where each candidate key is a foreign key to
the dimension table
Customer_Key
A dimension provides answer to the following business question
(PK)
Date_Key(PK)
1) Who 2) what 3) when 4) where
Year
Name
Transaction_ID(PK)
Quarter
Dim-customer
Dim-Time
Monthly
Address
Date_key(FK)
Week
Sale-Transaction
Market_key(FK)
Day
Phone
Product_Key(FK)
Customer_Key(FK)
QTY
REVENUE
Profit

Product_key(P
K)
Category
Sub category
Product

Marketkey(PK)
Market Code
Market Name

Dim-Product

Dim-Market

Dim means dimension


2) Snow flake: In snow flake a large dimension table is spitted into one or more table to flat and
hierarchy
In snowflake a dimension may have a parent table
A large dimension table spitted into one or more normalized table (de-composite
dimension)

D42
D1

D41

D4
Facts

Parent

Child
Flake

Child
D3

D2

Dimension (D4) is spitted into dimension-four-one(d41) and dimension fourtwo(d42). D42 Is a parent table and all are child table.
Extra:
A schema is nothing but a table
qty
Revenue
KPI
Profit

Gross profit
KPI means key performance indicator
A numeric which is performing a key role in the business analysis and estimating the
business enterprise in the data warehouse is a fact.
A facts table contains facts. Facts are numeric
Every numeric is not a facts
Facts are business measure because the estimate the business performance
A dimension is a descriptive data is stored is called as facts
A facts are analyze by descriptive data is called dimension
Mapping is a ETL plan which extraction-transforming-loading

Customer
BSR

City
HYD
DESCRIPTIVE

Date
3/11/2010
DATA

Product
LG LED

QTY
1

Revenue
Profit
$250
50
Fact Table

Data warehouse-Galaxy schema:


A data warehouse is a design with integration of multiple star schemas (or) snowflakes
schema are both. A galaxy schema is also known as highbred schema (or) constellation
schema.
Fact constellation: it is a process of joining two fact tables
Conformed dimensions: a dimension table which can be shared by multiple fact tables is
known as conformed dimensions.
Factless fact table:
A fact table without any facts is known as factless fact table
Junk dimensions:
A dimension which cant be used to describe key performance indicators is known as junk
dimension.

Ex: phone number, fax number, customer address etc


Slowly changing dimensions:
A dimension which can be changed over the period of time is known as slowly changing
dimension. There are three types of dimensions.
1. Type one dimension:
a type one dimension stores only current data in the target it doesnt maintain any
history
2. Type two dimension:
A type two dimension maintains the full history in the target for each update it inserts
a new record in the target.
3. Type three dimension:
Type three dimensions maintains partial history (current and previous information)
Dirty Dimensions:
In a dimension table if record exist more than once with a changing a non key attribute is
known as Dirty dimension.

Fact Constellation

OLTP DB
Customer_Id

Emp_Id

c-name

Emp-name

c-address

Emp-address

c-phone

Emp-phone

DWH-Dimensional
Customer_id
Employee_id
modeling
c-name

e-name

c-address

e-address

c-phone

e-phone

Dim-customer

Dim-employee

Extra: Difference of star and snow flake schema: Star schema is a de-normalized, it means duplicated records are maintained
Snow flake schema splited tables are normalized.it means dont maintained the
duplicate records.
Snow flake schema is used to reduces the table spaces
Galaxy Schema:
Integration of star and snow flake schema are called galaxy schema
Common or conformed or reusable dimensional are shared by fact table such as
dimensional tables are called conformed
A constellation is a process of joining two fact tables.
Galaxy is the schema which can be more multiple schema
Key performance indicator is called facts
A fact table without any facts are called fact less
Dimensions which cant be used to describe facts such dimension are junk
dimension.
Customer

Fact
cname Analyzing
c-id
c-

Customer(F
k)
Employee(F
K)
QTY
Revenue

Junk dimension
table
Product
c-name
QTY
Revenue
t.v
bsr
10
7500

c-id
101

address
s.r.nagar

This column is using it is a completely analyzing


otherwise junk dimension

A junk dimension provide addition information to the main dimension


Common or conformed or reusable dimensions:
The dimensions are shared by facts table such as dimensions tables are called conformed or
reusable or common dimensions.
Y1
A

Y2

C
Slowly changing dimension:

Dim-customer
Custome
name c-add
c-name
c-address
DSNR
c-phone
kp

Customer_id c101
101
101

rjnr

BSR
BSR
BSR

Customerkey(PK)

name

C-ID

C-key
1

C-NAMEDSNR
BSR
C-ADD

c-id

c-

c-add

BSR
C-Phonekp

2
3

101
101
101

BSR

rjnr

A surrogate key is an artificial key that is treated as a primary key


A surrogate key is a system generated sequential number that is treated as primary
key.
OLAP:
OLAP is nothing but set of specification or technologies which allows client application in
returning the data from data warehouse
Or
OLAP is nothing but a interface or gateway between the user and database

Types of OLAP
1. DOLAP (Desktop OLAP):
An OLAP which can query the data from a database which is contributed by using
desktop databases like dbase, FoxPro, clipper etc.
XML file, txt file, XL there are desktop databases
2. ROLAP (Relation OLAP):
ROLAP is used to query the data from relational sources like SQL, Oracle, Sybase, and
Teradata
3. MOLAP (Multi OLAP):
It is used to query the data from multi-dimensional sources like Cube,DMR
4. HOLAP (Highbred OLAP):
It is a combination ROLAP and MOLAP
Datamart and types of datamart:
A data mart is a subject oriented database which supports the business needs of department
specific business managers
Or
A data mart is a subset of enterprise data warehouse. a data warehouse is also known as high
performance query structure (HPQS).
There are two types of datamarts
1. Dependent datamarts
2. Independent datamarts
The enterprise is an integration of various departments.
Integration of multiple datamarts is a enterprise
Difference between Enterprise data warehouse and data mart
EDW
1) It is an integration of multiple
subjects
2) It stores enterprise specific business
information
3) Design for top management (CEO,
board of director)

Datamarts
1) It defines a single object
2) It stores department specific
information
3) Design for middle management
users

Top-Down Data warehousing approach: - (W.H.Inmon)


According to the Inmon first we need to build an Enterprise data warehouse, from the EDW
Design subject oriented, department specific database known as datamarts.
DM
EDH

DM

Bottom-up data warehousing approach: - Kimbol


According to Kimbol design department specific, subject oriented database as datamarts,
Integrate the datamarts to define enterprise data warehouse.

DM
EDH
DM
Dependent datamart:
In a top-down approach a datamart development dependents on enterprise data
warehouse hence datamarts are known as dependent datamart
Independent datamart:
In a bottom-up approach a datamart development is independent of enterprise data
warehouse. Hence such datamarts are known as independent.
ODS (Operational database):
ODS
Similarity: integrated database

DSS(Decision supporting system) DWH


Similarity: Integrated database

Differences: 1) volatile data


2) Current data
3) Detailed data

Difference: 1) non-volatile data


2) historical data
3) summary data

You might also like