Datawarehouse / Etl Testing: Reason For Build Data Warehouse: 1) Data Is Scattered at Different Places
Datawarehouse / Etl Testing: Reason For Build Data Warehouse: 1) Data Is Scattered at Different Places
Datawarehouse / Etl Testing: Reason For Build Data Warehouse: 1) Data Is Scattered at Different Places
2) Data inconsistency.
3) Depending on volatile and non-volatile. [Data keep on changing]
Time Variant
Non-volatile
Subject Oriented
Integrated database
OLTP System
Data warehouse
Load
Student admission
Transact
ion (0r)
Extraction
Integrat
e
Student Subject
Application Oriented
Subject oriented
Ex:- Over Business Application
Data
OLTP system
Extraction
Current
Saving
Checking
Transac
tion (or)
Integrat
ed
Load
Account
Subject
OLTP System
Data warehouse
Load
Order Application
Integration Application
Transact
ion (0r)
Integrat
e
Sales Subject
Shipment Application
Extraction
Users Payment
Application Oriented
Characteristic features of Data warehouse:1) Time-Variant:- A Data warehouse is a time-variant database which supports business needs
of end users in comparing and analyzing the business with different time periods. This is also
known as Time series analysis.
y
Data
Q1
varience
Q2
Sales
Amount
X
Quarter product
x
x
2) Non-volatile: i.
ii.
3) Subject oriented: A data warehouse is a subject oriented database which supports the business need of
department specific users.
Ex: Sales, accounts, hr, students, loans etc..
A subject is derived from multi OLTP applications which organize the data two meet specific
business functionality.
4) Integrated:a data warehouse is an integrated database which collects the data from multiple
OLTP databases information.
SA
DWH
LES
OLTP DB at U.K
Load
Extraction
Integrate
OLTP DB at IND
A data warehouse is container to store the business.
Data warehousing:A data warehousing is process of building a data warehousing. A process includes
i.
ii.
iii.
iv.
A Business analyst and onsite technical co-ordinaters the business requirement and
technical requirement.
BRS(Business Requirement specification) :A BRS contains the business requirements which are collected by an analyst.
A SRS contains software and hardware requirement which are collected by senior technical
people.
The process of designing the database is called as a modeling or dimensional modeling. A
database architect or data modeler designs the warehouse with set of tables.
OLAP (online analytical processing): An OLAP is technology which supports the business managers to make a query from date
warehouse. An OLAP provides the gateway between users and data warehouse.
A data warehouse is known as OLAP database
Ex: Cognos, Bos
DWH
1) It is design to support a decision making
processing
2) non-volatile data
3)
4)
5)
6)
7)
8)
9)
Current data
Detailed data
Design for running the business
Normalization
Application oriented data
Design for critical operation
ER-modeling
3) historical data
4) summary data
5) design for analyzing the business
6) de-normalization
7) subject oriented data
8) design for managerial operation
9)dimensional modeling
Enterprising Data warehousing objects: A relational database is a defined as collection of objects such as tables, views, procedures,
macros, triggers etc...
Table: - A table is a two directional object where the data can be stored in the form of rows
and columns.
View: - A view is like a window into one or more table it provides a containise access to the
base table and provides.
1) Restricting which columns are visible from base tables
2) Restricting which rows are visible from base tables.
3) Combining rows and columns from several base tables.
It may be define a subject of rows of table.
It may be define subject of columns of table.
Data warehouse-RDBMS: The following are the relation databases can be defined to build data warehousing
i.
ii.
iii.
iv.
v.
vi.
vii.
viii.
ix.
Oracle
Sql server
IBM DB2
Tera data
Green plum
Netezza
Sybase
Redbrick
informix
One of the best RDBMS to store massive historical information, parallel storage, parallel
retrieval is Teradata.
Data acquisition: It is a process of extracting the relevant business information, transforming data into require
business format and loading into the target system.
A data acquisition is defined with following type of process
1) data extraction
2) data transformation
3) data loading
There are two types of ETL used to build data acquisition.
1) Code based ETL
2) GUI based
Code based ETL:
An ETL application can be developed using some programming language as SQL,PL/SQL
Ex: SAS base, SAS excess, Teradata, ETL utilities
Teradata ETL utilities:
i.
ii.
iii.
iv.
BETQ
Fast Load
Multi load
T pump
GUI base ETL:An ETL application can be design with simple graphical user interfacing, point and click
techniques.
Ex: Informatica, Data stage, Abinitio, ODI (oracle data integrated), data services, data
manger, SSIS (SQL server integration services).
Data Extraction:
It is a process of reading the data from various types of source systems. The following are
type of sources which are used to define extraction.
1) ERP sources
i.
SAP
ii.
Oracle applications
iii.
J.D.Edwards
iv.
People soft
2) File sources
i.
XML files
ii.
Flat files
3) Relational sources
i.
Oracle
ii.
SQL server
iii.
DB2
iv.
Sybase
4) Legacy sources
i.
Main frames
ii.
AS 400
iii.
Cobal files etc
Data transformation:
It is the process of transforming data and cleaning the data into the required business format.
The following are data transformation activities takes place in the stage
i.
ii.
iii.
iv.
Data
Data
Data
Data
merging
cleansing
scrubbing
aggregation
Staging:
A staging is temporary memory where the following activities take place in the stage
i.
ii.
iii.
iv.
Data
Data
Data
Data
merging
cleansing
scrubbing
aggregation
Data cleansing
It is the process of changing inconsistencies and in-aquarist
Or
Target
Sales Amount
$ 1.10
$ 2.00
$ 3.76
$ 4.0
Staging
ROU
ND
Data cleansing
Examples: 1) Removing duplicates is a data cleaning
2) Records which contains a null
3) Removing spaces
Data Scrubbing:
It is a process of deriving new attributes
Attributes nothing but a table columns
Source System
OLTP
Sales
Sale id
Product
Price
QTY
Discount
DWH
Sales
amount
QTY*Price
Sales Info
Sale id
Product
Price
Sale amount
QTY
Discount
Sale tax
Profit
Sales tax
QTY*Price
*0.15
Data aggregation:
It is a process of integrating the data from multiple sources system.
Sum()
Or
Max()
Detailed data
Summary
data
Staging
Data Merging:
It is a process of integrating the data from multiple sources system
There are two types of data merge operation takes places in the staging
1. Horizontal merging
2. Vertical merging
Horizontal merging:
It is process of merging the records horizontally
Using Joins
S
1
Empno
Ename
Sal
2355
smith
4000
Dname
10
S
2
10
S1
Join
S
2
Deptno
S
1
Ename
deptno
sales
sal
deptno
loc
texas
S1
s2
Joi
n
s2
10
sales
Dname
loc
Vertical merging:
It is process of merging the records vertically when the two sources are having same meta
data (union).
Meta data means data structures (two or three table column names are same)
Source system
S
1
Empno
Ename
Sal
2355
smith
4000
S
2
empno
ename
3255
allen
deptno
sal
staging
10
deptno
2140
uni
on
Empno
deptno
2355
ename
sal
smith
4000
10
20
Data Loading:
It is the process of inserting the data into a target system. There are two type of data code
1) Initial load or full load
2) Incremental load or delta load
ETL client server technology:
An ETL plan defines extraction, transforming and loading; An ETL plan is design with following
types of metadata.
1) Source definition: it is the structure of source data or source table from data extracts.
2) Target definition: it is the structure of the target table to which data load
3) Transformation rule: it defined the business logic used to for processing the data.
Meta data:
It defined the structure of the data, which is represent as column name, data type, precision,
scale and keys (primary key, foreign key).
Custom
Customer
CID number(4) PK
CFName
varchar2(6)
CLName
varchar2(6)
Gender
varchar2(6)
+CID
+CFName
+CLName
+Gender
B-Logic
T-
+CID
+CName
+Gender
T-Customer
CID number(4)
Cname
varchar2(6)
Gender
varchar2(6)
ETL Plan
Repositor
y
ETL
ETL client:
An ETL client is a graphical user component where an ETL developer can design ETL plane
ETL repository:
An ETL repository is brain of an ETL system where you can store metadata such as ETL plants
ETL server: it is an ETL engine which performs extraction, transforming and loading
In Informatica an ETL plan is call as mapping.
In datastage it is called as Job
In Abinitio it is called as graph
Extra:
Initial load or full load:
We are extracting data from source system to load the data into target system first time the
records are entered directly to the target system.
OLTP
CUSTOME
DWH
CID
CNAME
TCID
We have to give the length of data type same (or) more of the target system but we
have not given the less length of source system
Data warehouse Database design:
A data warehouse is design with the following types of schema
1. Star schema
2. Snow flake schema
3. Galaxy schema (Conistallation schema, Integrated schema, highbred schema and multi
star schema)
The process of designing the database is known as data modeling
A database architect (or) data modeler creates database schemas using a GUI base
database designing tool called ERIWN. It is a process of computer associates.
1) Star schema :
A star schema is a database design which contains centrally located fact table which is
surrounded by dimension tables.
Since database design looks like a star schema database design
In data warehouse facts are numeric. A fact table contains facts.
Not every numeric is a fact but numeric which are of type key performance indicator
are known as facts
Facts are business measures which are used to evaluate the performance of an entire
price
A fact table contain the facts at lowest level granularity
A fact granularity determine level of details
A dimension is a descriptive data which describes the key performance known as
facts
A dimension table contains a de-normalized data
A fact table contains a normalized data
A fact table contains a component key where each candidate key is a foreign key to
the dimension table
Customer_Key
A dimension provides answer to the following business question
(PK)
Date_Key(PK)
1) Who 2) what 3) when 4) where
Year
Name
Transaction_ID(PK)
Quarter
Dim-customer
Dim-Time
Monthly
Address
Date_key(FK)
Week
Sale-Transaction
Market_key(FK)
Day
Phone
Product_Key(FK)
Customer_Key(FK)
QTY
REVENUE
Profit
Product_key(P
K)
Category
Sub category
Product
Marketkey(PK)
Market Code
Market Name
Dim-Product
Dim-Market
D42
D1
D41
D4
Facts
Parent
Child
Flake
Child
D3
D2
Dimension (D4) is spitted into dimension-four-one(d41) and dimension fourtwo(d42). D42 Is a parent table and all are child table.
Extra:
A schema is nothing but a table
qty
Revenue
KPI
Profit
Gross profit
KPI means key performance indicator
A numeric which is performing a key role in the business analysis and estimating the
business enterprise in the data warehouse is a fact.
A facts table contains facts. Facts are numeric
Every numeric is not a facts
Facts are business measure because the estimate the business performance
A dimension is a descriptive data is stored is called as facts
A facts are analyze by descriptive data is called dimension
Mapping is a ETL plan which extraction-transforming-loading
Customer
BSR
City
HYD
DESCRIPTIVE
Date
3/11/2010
DATA
Product
LG LED
QTY
1
Revenue
Profit
$250
50
Fact Table
Fact Constellation
OLTP DB
Customer_Id
Emp_Id
c-name
Emp-name
c-address
Emp-address
c-phone
Emp-phone
DWH-Dimensional
Customer_id
Employee_id
modeling
c-name
e-name
c-address
e-address
c-phone
e-phone
Dim-customer
Dim-employee
Extra: Difference of star and snow flake schema: Star schema is a de-normalized, it means duplicated records are maintained
Snow flake schema splited tables are normalized.it means dont maintained the
duplicate records.
Snow flake schema is used to reduces the table spaces
Galaxy Schema:
Integration of star and snow flake schema are called galaxy schema
Common or conformed or reusable dimensional are shared by fact table such as
dimensional tables are called conformed
A constellation is a process of joining two fact tables.
Galaxy is the schema which can be more multiple schema
Key performance indicator is called facts
A fact table without any facts are called fact less
Dimensions which cant be used to describe facts such dimension are junk
dimension.
Customer
Fact
cname Analyzing
c-id
c-
Customer(F
k)
Employee(F
K)
QTY
Revenue
Junk dimension
table
Product
c-name
QTY
Revenue
t.v
bsr
10
7500
c-id
101
address
s.r.nagar
Y2
C
Slowly changing dimension:
Dim-customer
Custome
name c-add
c-name
c-address
DSNR
c-phone
kp
Customer_id c101
101
101
rjnr
BSR
BSR
BSR
Customerkey(PK)
name
C-ID
C-key
1
C-NAMEDSNR
BSR
C-ADD
c-id
c-
c-add
BSR
C-Phonekp
2
3
101
101
101
BSR
rjnr
Types of OLAP
1. DOLAP (Desktop OLAP):
An OLAP which can query the data from a database which is contributed by using
desktop databases like dbase, FoxPro, clipper etc.
XML file, txt file, XL there are desktop databases
2. ROLAP (Relation OLAP):
ROLAP is used to query the data from relational sources like SQL, Oracle, Sybase, and
Teradata
3. MOLAP (Multi OLAP):
It is used to query the data from multi-dimensional sources like Cube,DMR
4. HOLAP (Highbred OLAP):
It is a combination ROLAP and MOLAP
Datamart and types of datamart:
A data mart is a subject oriented database which supports the business needs of department
specific business managers
Or
A data mart is a subset of enterprise data warehouse. a data warehouse is also known as high
performance query structure (HPQS).
There are two types of datamarts
1. Dependent datamarts
2. Independent datamarts
The enterprise is an integration of various departments.
Integration of multiple datamarts is a enterprise
Difference between Enterprise data warehouse and data mart
EDW
1) It is an integration of multiple
subjects
2) It stores enterprise specific business
information
3) Design for top management (CEO,
board of director)
Datamarts
1) It defines a single object
2) It stores department specific
information
3) Design for middle management
users
DM
DM
EDH
DM
Dependent datamart:
In a top-down approach a datamart development dependents on enterprise data
warehouse hence datamarts are known as dependent datamart
Independent datamart:
In a bottom-up approach a datamart development is independent of enterprise data
warehouse. Hence such datamarts are known as independent.
ODS (Operational database):
ODS
Similarity: integrated database