DWH Architecture & Lifecycle: Muchake Brian Tel: 0701178573
DWH Architecture & Lifecycle: Muchake Brian Tel: 0701178573
DWH Architecture & Lifecycle: Muchake Brian Tel: 0701178573
Muchake Brian
Email: bmuchake@gmail.com
Tel: 0701178573
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Query
Extract
DBs Serve Reports
Transform Data
Data mining
Load Warehouse
Refresh
Data Marts
ETL
Operationalsystem Summarize Data Mart
d reports
4
Components of a DWH
Source System
An operational system whose function is to capture the transaction of the business.
Often called ‘legacy systems’ in a main frame environment.
Source systems may be in the form of data existing in:
1) Production operational systems
2) Archives
3) External data from outside the company, such as external statistics relating to the
company business, Market share data of competitors, financial indicator data for
performance monitoring.
Provide input for data staging area.
5
Components of a DWH [Cont’d]
Data Staging Area:
Is a major process that includes sub processes: extracting, transforming and
loading; (ETL).
Extracting: Reading and understanding the source data, and copying the parts that
are needed to the data staging area for further work.
Transforming: Once the data is extracted into the staging area, transformation takes
place:
1) Cleaning the data by correcting misspellings, resolving domain conflicts (e.g.
Employee and staff, customer and subscriber).
2) Creating surrogate keys for each dimension record in order to avoid a
dependence on legacy defined keys, where the surrogate key generation
process enforces referential integrity between the dimension tables and the fact
tables. 6
Components of a DWH [Cont’d]
At the end of transformation; there is a collection of integrated data that is clean,
standardized, and summarized.
Loading: At the end of the transformation loading process, dimensions and facts
tables are populated
Initial Load (populating all the data warehouse tables for the first time)
Incremental Load (applying ongoing changes as necessary in a periodic manner)
Refreshing - Involves updating from data sources to warehouse.
7
Components of a DWH [Cont’d]
Why ETL?
To ensure that data resident in the DWH is:
Relevant and useful to the business users.
High quality to ensure quality information
Building the ETL is the biggest task of building a warehouse; it is complex and time
consuming. In many implementations, it can take more than half of the total warehouse
implementation effort.
8
Components of a DWH [Cont’d]
Data storage Component (DWH and Data Marts)
A Data Mart is a subset of a DWH facts and summary data that provide users with
information specific to their requirements.
Scope
A DWH deals with multiple subject areas and implemented at enterprise level
while data mart is on departmental.
Implementation
Less time to implement. Easier to build and maintain. Can be used as a “proof of
concept” step towards building an enterprise warehouse.
9
Data Warehouses and Data Marts
Human
Data
Resources
Mart 3
Dept
10
Data Warehouses and Data Marts
11
Data Marts Dependent/Independent
Dependent (Used in the Top – Down Independent (Used in the Bottom – Up
Approach) Approach)
Created from warehouse Scaled down, less expensive
Replicated (Functional subset of version of data warehouse
warehouse) Designed for a department or BU
Org may have multiple data marts
– Difficult to integrate
Data
Source 1 Mart 1 Source 1 Data
Mart 1
Enterprise Data
Source 2
Data Mart 2 Source 2 Data
Warehouse Data
Mart 2
Source 3 Warehouse
Data
Source 3 Data
Mart 3
Mart 3
12
Components of a DWH [Cont’d]
Information Delivery Component
This is to enable the process of subscribing for data warehouse information and
having it delivered to one or more destinations according to some user-specified
scheduling algorithm
1. Ad-hoc Reports
2. Complex Query tools
3. Multidimensional analysis
4. OLAP (Online Analytical Processing)
5. Data Mining
6. EIS (Executive Information Systems)
7. Web Browsers
13
Components of a DWH [Cont’d]
Metadata Component:
Metadata is data about data that describes the data warehouse. It is used for building,
maintaining, managing and using the data warehouse; classified into:-
Technical Metadata: contains information about warehouse data for use by
warehouse designers and administrators when carrying out warehouse
development and management tasks
Operational (Data about the operational data sources)
Extraction and Transformation (Information about all the data transformations
conducted at the staging area)
Business Metadata: contains information that gives users an easy-to-understand
perspective of the information stored in the data warehouse
14
Metadata
Metadata is simply defined as data about data.
The data that are used to represent other data is known as metadata.
In other words, we can say that metadata is the summarized data that leads us to the
detailed data.
Metadata is a road-map to data warehouse.
Metadata in data warehouse defines the warehouse objects.
Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It contains the
following metadata:
Business metadata - It contains the data ownership information, business definition,
and changing policies.
Operational metadata - It includes currency of data and data lineage. Currency of
data refers to the data being active, archived, or purged. Lineage of data means
history of data migrated and transformation applied on it.
Mapping metadata-Data for mapping from operational environment to data
warehouse. This metadata includes source databases and their contents, data
extraction, data partition, cleaning, transformation rules, data refresh and purging
rules.
Algorithm metadata-These are algorithms for summarization. It includes dimension
algorithms, data on granularity, aggregation, summarizing, etc
Components of a DWH [Cont’d]
Management and Control Component:
This component coordinates the services and activities within the data warehouse. Its
source of information is the metadata.
Security and priority management
Monitoring updates from the multiple sources
Data quality checks
Managing and updating meta data
Auditing and reporting data warehouse usage and status
Purging data, replicating, sub-setting and distributing data
Backup & recovery and data warehouse storage management
17
Approaches to Building Data Warehouses [Cont’d]
Top down (by Bill Inmon)
Extract data from the Operational systems
Transform, clean, integrate, store data in the Data warehouse
Derive the respective Data Marts
Bottom up (Ralph Kimball)
Build individual Data Marts one by one from operational systems
Combine the Data Marts into a Data Warehouse
Hybrid
A compromise drawing from the advantages and disadvantages of each of the Top-
down and Bottom-up
18
Approaches to Building Data Warehouses [Cont’d]
Top – down Approach
Advantages Disadvantages
Enterprise view of Data Takes longer to build (even with an
Inherently architected – not just a iterative method)
union of disparate Data Marts
Single, central storage of data about High Exposure to failure
the content Needs high level of cross-functional
Centralized rules and control skills
May see quick results if High outlay without proof of concept
implemented with iterations
20
Approaches to Building DWH [Cont’d]
Hybrid Approach
A compromise drawing from the advantages and disadvantages of each of the
Top-down and Bottom-up.
Accommodates both views.
Benefits to an organization
Enterprise long term goals can easily be achieved
Fast Data Marts
Proof of concept
21
Approaches to Building Data Warehouses [Cont’d]
Steps (Hybrid Approach)
1. Plan and define requirements at the overall corporate level
2. Create a surrounding architecture for a complete warehouse
3. Conform and Standardize the data content (Data types, Field lengths, precision,
and semantics)
4. Implement the Data Warehouse as a series of super marts (carefully architected
Data Marts), one at a time
22
Integrating Heterogeneous Databases
To integrate heterogeneous databases, we have two approaches:
1) Query-driven Approach
2) Update-driven Approach
Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases.
This approach was used to build wrappers and integrators on top of multiple
heterogeneous databases.
These integrators are also known as mediators.
Process of Query-Driven Approach
When a query is issued to a client side, a metadata dictionary translates the query into an
appropriate form for individual heterogeneous sites involved.
Now these queries are mapped and sent to the local query processor.
The results from heterogeneous sites are integrated into a global answer set.
Integrating Heterogeneous Databases
Disadvantages Query-Driven Approach
Query-driven approach needs complex integration and filtering processes.
This approach is very inefficient.
It is very expensive for frequent queries.
This approach is also very expensive for queries that require aggregations.
Update-Driven Approach
This approach provide high performance.
The data is copied, processed, integrated, annotated, summarized and restructured in
semantic data store in advance.
Query processing does not require an interface to process data at local sources.
DWH Lifecycle Overview
Technical Product
Architecture Selection &
Design Installation
Business
Dimensional Data Staging Maintenance
Project Physical
Requirement Modeling Design & Deployment and
Planning Design
Development Growth
Definition
End-User End-User
Application Application
Specification Development
Project Management
25
Program/Project Planning
Kimball’s view of programs and projects
Project refers to a single iteration of the Kimball Lifecycle
from launch through deployment
Program refers to the broader, ongoing coordination of resources, infrastructure,
timelines, and communication across multiple projects
a program contains multiple projects
In real world, programs do not necessarily start before projects although ideally
they should be.
26
Program/Project Planning [Cont’d]
Project planning
Scope definition understanding business requirements
Tasks’ identification
Scheduling
Resource planning
Workload assignment
The end document represents a blueprint of the project
Enforces the project plan
Activities:
Status monitoring
Issue tracking
Development of a comprehensive communication plan that addresses both the
business and IT units
27
Business Requirements Definition
Success of the project depends on a solid understanding of the business
requirements!!!
Understanding the key factors driving the business is crucial for successful translation
of the business requirements into design considerations
28
What Follows the Business Requirements Definition?
3 concurrent tracks focusing on
1. Technology
2. Data
3. End user (BI) applications
Arrows in the diagram indicate the activity workflow along each of the parallel tracks
Dependencies between the tasks are illustrated by the vertical alignment of the task
boxes.
29
Technology Track
1. Technical Architecture Design
Overall architectural framework and vision
Considerations:
the business requirements
current technical environment
planned strategic technical directions
30
Technology Track
2. Product Selection and Installation
Based on the designed technical architecture
Evaluation and selection of
Products that will deliver needed capabilities
Hardware platform
Database management system
Extract-transformation-load (ETL) tools
Data access query tools
Reporting tools must be evaluated
Installation of selected products/ components/ tools
Testing of installed products to ensure appropriate end-to-end integration
within the data warehouse environment.
31
Data Track
1. Design of the dimensional model
2. The physical design of the model
3. Extraction, transformation, and loading (ETL) of source data into the target models
(Data Staging design & development).
32
Dimensional Design
Detailed data analysis of a single business process is performed to identify the:
Fact tables,
Associated dimensions and attributes,
And numeric facts.
33
Physical Design
Defining the physical structures.
setting up the database environment
Setting up appropriate security
preliminary performance tuning strategies, from indexing to partitioning and
aggregations.
If appropriate, OLAP databases are also designed during this process.
34
Data Staging (ETL) Design and Development
The MOST important stage
70% of the risk and effort in the DW project is attributed to this stage
ETL system capabilities:
Extraction
Cleansing and conforming
Delivery and management
ETL processes must be architected long before any data is extracted from the source
Kimball calls ETL a “data warehouse back room”
35
End User (Business Intelligence) Application Track
End user (BI) Application Design
Identify the candidate BI applications and appropriate navigation interfaces to
address the users’ needs and needed capabilities.
Produce BI application specification
End user (BI) Application Development
Configuration of the business metadata and tool infrastructure
Construction and validation of the specified analytic and operational BI
applications and the navigational portal
36
Deployment
It is crucial that adequate planning was performed to make sure that:
the results of technology, data, and BI application tracks are tested and fit together
properly
Appropriate education and support infrastructure is in place.
It is critical that deployment be well coordinated.
Deployment should be deferred if all the pieces, such as training, documentation, and
validated data, are not ready for production release.
37
Maintenance
Occurs when the system is in production
Includes:
• technical operational tasks that are necessary to keep the system performing
optimally
usage monitoring
performance tuning
index maintenance
system backup
Ongoing support, education, and communication with business users
38
Growth
DW systems tend to expand (if they were successful)
Is considered as a sign of success
New requests need to be prioritized
Starting the cycle again
Building upon the foundation that has already been established
Focusing on the new requirements
39