Lecture DW 021
Lecture DW 021
Lecture DW 021
Data Warehousing
for BS(CS)
2
Course Material
Course Book
Paulraj Ponniah, Data Warehousing Fundamentals, John Wiley
3
Assignments
Implementation/Research on important concepts.
To be submitted in groups of 2 students.
Include
1. Modeling and Benchmarking of multiple warehouse schemas
2. Implementation of an efficient OLAP cube generation algorithm
3. Data cleansing and transformation of legacy data
4. Literature Review paper on
View Consistency Mechanisms in Data Warehouse
Index design optimization
Advance DW Applications
May add a couple more
4
Lab Work
5
Course Introduction
What this course is about?
Decision Support Cycle
Planning – Designing – Developing - Optimizing – Utilizing
6
Course Introduction
Data Marts
7
Operational Sources (OLTP’s)
Operational computer systems did provide information to run day-to-day
operations, and answer’s daily questions, but…
Also called online transactional processing system (OLTP)
Data is read or manipulated with each transaction
Transactions/queries are simple, and easy to write
Usually for middle management
Examples
Sales systems
Hotel reservation systems
COMSIS
HRM Applications
Etc.
8
Typical decision queries
Data set are mounting everywhere, but not useful for decision
support
Decision-making require complex questions from integrated data.
Enterprise wide data is desired
Decision makers want to know:
Where to build new oil warehouse?
Which market they should strengthen?
Which customer groups are most profitable?
How much is the total sale by month/ year/ quarter for each offices?
Is there any relation between promotion campaigns and sales growth?
9
Information crisis *
Integrated
Must have a single, enterprise-wide view
Data Integrity
Information must be accurate and must conform to business rules
Accessible
Easily accessible with intuitive access paths and responsive for analysis
Credible
Every business factor must have one and only one value
Timely
Information must be available within the stipulated time frame
* Paulraj 2001.
10
Data Driven-DSS *
12
OLTP vs. DSS
Trait OLTP DSS
User Middle management Executives, decision-makers
Function For day-to-day operations For analysis & decision support
DB (modeling) E-R based, after normalization Star oriented schemas
Data Current, Isolated Archived, derived, summarized
Unit of work Transactions Complex query
Access, type DML, read Read
Access frequency Very high Medium to Low
13
Expectations of new soln.
14
DW meets expectations
15
Definition of DW
Inmon defined
“A DW is a subject-oriented, integrated, non-volatile, time-variant
collection of data in favor of decision-making”.
Kelly said
“Separate available, integrated, time-stamped, subject-oriented, non-
volatile, accessible”
Four properties of DW
16
Subject-oriented
In operational sources data is organized by applications, or
business processes.
In DW subject is the organization method
Subjects vary with enterprise
These are critical factors, that affect performance
Example of Manufacturing Company
Sales
Shipment
Inventory etc
17
Integrated Data
Data comes from several applications
Problems of integration comes into play
File layout, encoding, field names, systems, schema, data heterogeneity
are the issues
Bank example, variance: naming convention, attributes for data item,
account no, account type, size, currency
In addition to internal, external data sources
External companies data sharing
Websites
Others
Removal of inconsistency
So process of extraction, transformation & loading
18
Time variant
Operational data has current values
Comparative analysis is one of the best techniques for business
performance evaluation
Time is critical factor for comparative analysis
Every data structure in DW contains time element
In order to promote product in certain, analyst has to know about
current and historical values
The advantages are
Allows for analysis of the past
Relates information to the present
Enables forecasts for the future
19
Non-volatile
Data from operational systems are moved into DW after specific
intervals
Data is persistent/ not removed i.e. non volatile
Every business transaction don’t update in DW
Data from DW is not deleted
Data is neither changed by individual transactions
Properties summary
Khurram Shahzad
mks@ciitlahore.edu.pk
21
Agenda
22
Architecture of DW
23
Components
Major components
Source data component
Data staging component
Information delivery component
Metadata component
Management and control component
24
1. Source Data Components
Source data can be grouped into 4 components
Production data
Comes from operational systems of enterprise
Some segments are selected from it
Narrow scope, e.g. order details
Internal data
Private datasheet, documents, customer profiles etc.
E.g. Customer profiles for specific offering
Special strategies to transform ‘it’ to DW (text document)
Archived data
Old data is archived
DW have snapshots of historical data
External data
Executives depend upon external sources
E.g. market data of competitors, car rental require new
manufacturing. Define conversion
25
Architecture of DW
26
2. Data Staging Components
After data is extracted, data is to be prepared
Data extracted from sources needs to be
changed, converted and made ready in
suitable format
Three major functions to make data ready
Extract
Transform
Load
Staging area provides a place and area with a
set of functions to
Clean
Change
Combine
Convert
27
Architecture of DW
28
3. Data Storage Components
Separate repository
Data structured for efficient processing
Redundancy is increased
Updated after specific periods
Only read-only
29
Architecture of DW
30
4. Information Delivery Component
Authentication issues
Active monitoring services
Performance, DBA note selected aggregates
to change storage
User performance
Aggregate awareness
E.g. mining, OLAP etc
31
DW Design
32
Designing DW
33
Background (ER Modeling)
For ER modeling, entities are collected from
the environment
Each entity act as a table
Success reasons
Normalized after ER, since it removes redundancy
(to handle update/delete anomalies)
But number of tables is increased
Is useful for fast access of small amount of data
34
ER Drawbacks for DW / Need of Dimensional
Modeling
ER Hard to remember, due to increased number of tables
Complex for queries with multiple tables (table joins)
Conventional RDBMS optimized for small number of tables
whereas large number of tables might be required in DW
Ideally no calculated attributes
The DW does not require to update data like in OLTP system
so there is no need of normalization
OLAP is not the only purpose of DW, we need a model that
facilitate integration of data, data mining, historically
consolidated data.
Efficient indexing scheme to avoid screening of all data
De-Normalization (in DW)
Add primary key
Direct relationships
Re-introduce redundancy
35
Dimensional Modeling
Dimensional Modeling focuses subject-
orientation, critical factors of business
Critical factors are stored in facts
Redundancy is no problem, achieve efficiency
Logical design technique for high performance
Is the modeling technique for storage
36
Dimensional Modeling (cont.)
Two important concepts
Fact
37
Dimensional Modeling (cont.)
Facts are stored in fact table
Dimensions are represented by dimension
tables
Dimensions are degrees in which facts can be
judged
Each fact is surrounded by dimension tables
Looks like a star so called Star Schema
38
Example
PRODUCT
TIME
product_key (PK)
time_key (PK)
SKU
SQL_date
description
day_of_week
brand
month FACT
category
time_key (FK)
store_key (FK)
STORE clerk_key (FK) CUSTOMER
store_key (PK) product_key (FK) customer_key (PK)
store_ID customer_key (FK) customer_name
store_name promotion_key (FK) purchase_profile
address dollars_sold credit_profile
district units_sold address
floor_type dollars_cost
CLERK PROMOTION
clerk_key (PK) promotion_key (PK)
clerk_id promotion_name
clerk_name price_type
clerk_grade ad_type 39
Inside Dimensional Modeling
Inside Dimension table
Key attribute of dimension table, for
identification
Large no of columns, wide table
Non-calculated attributes, textual attributes
Attributes are not directly related
Un-normalized in Star schema
Ability to drill-down and drill-up are two ways
of exploiting dimensions
Can have multiple hierarchies
Relatively small number of records
40
Inside Dimensional Modeling
Have two types of attributes
Key attributes, for connections
Facts
Inside fact table
Concatenated key
Grain or level of data identified
Large number of records
Limited attributes
Sparse data set
Degenerate dimensions (order number
Average products per order)
Fact-less fact table
41
Star Schema Keys
Primary keys
Identifying attribute in dimension table
Relationship attributes combine together to form P.K
Surrogate keys
Replacement of primary key
System generated
Foreign keys
Collection of primary keys of dimension tables
Primary key to fact table
System generated
Collection of P.Ks
42
Advantage of Star Schema
Ease for users to understand
Optimized for navigation (less joins
fast)
Most suitable for query processing
43
Normalization [1]
44
1st Normal Form [2]
“A relation is in first normal form if and only if
every attribute is single-valued for each tuple”
STU_ID STU_NAME MAJOR CREDITS CATEGORY
45
1st Normal Form (Cont.)
STU_ID STU_NAME MAJOR CREDITS CATEGORY
46
Another Example (composite key:
SID, Course) [1]
47
1st Normal Form Anomalies [1]
Update anomaly: Need to update all six rows
for student with ID=1if we want to change his
location from Islamabad to Karachi
Delete anomaly: Deleting the information about
a student who has graduated will remove all of
his information from the database
Insert anomaly: For inserting the information
about a student, that student must be
registered in a course
48
Solution 2nd Normal Form
dependencies [1]
SID —> campus
Campus degree
49
Example in 2nd Normal Form [1]
50
Anomalies [1]
51
Solution 3rd Normal Form
Campus degree
52
Example in 3rd Normal Form [1]
53
Denormalization [1]
54
Five techniques to denormalize
relations [1]
Collapsing tables
Pre-joining
Splitting tables (horizontal, vertical)
Adding redundant columns
Derived attributes
55
Collapsing tables (one-to-one) [1]
57
Splitting tables [1]
58
Redundant columns [1]
59
Updates to Dimension Tables
60
Updates to Dimension Tables (Cont.)
61
Updates to Dimension Tables (Cont.)
62
Updates to Dimension Tables (Cont.)
Proposed solution:
63
Updates to Dimension Tables (Cont.)
64
Updates to Dimension Tables (Cont.)
Solution: Add a new column of attribute
65
Updates to Dimension Tables (Cont.)
66
Rapidly Changing Dimension
67
Rapidly Changing Dimension (Cont.)
“For example, an important attribute for customers might
be their account status (good, late, very late, in arrears,
suspended), and the history of their account status” [4]
“If this attribute is kept in the customer dimension table
and a type 2 change is made each time a customer's
status changes, an entire row is added only to track this
one attribute” [4]
“The solution is to create a separate account_status
dimension with five members to represent the account
states” [4] and join this new table or dimension to the
fact table.
68
Example
69
Junk Dimensions
70
Junk Dimension Example [3]
71
Junk Dimension Example (Cont.) [3]
72
The Snowflake Schema
73
Example 1 of Snowflake Schema
74
Example 2 of Snowflake Schema
75
Aggregate Fact Tables
76
Example
78
A way of making aggregates
Example:
79
Making Aggregates
80
Families of Stars
81
Families of Stars (Cont.)
Transaction (day to day) and snapshot tables (data after
some specific intervals)
82
Families of Stars (Cont.)
Core and custom tables
83
Families of Stars (Cont.)
Conformed Dimension: The attributes of a dimension
must have the same meaning for all those fact tables
with which the dimension is connected.
84
Extract, Transform, Load (ETL)
86
Data Extraction
Data can be extracted using third party tools
or in-house programs or scripts
Data extraction issues:
1. Identify sources
2. Method of extraction for each source (manual,
automated)
3. When and how much frequently data will be extracted for
each source
4. Time window
5. Sequencing of extraction processes
87
How data is stored in operational
systems
Current value: Values continue to changes as
daily transactions are performed. We need to
monitor these changes to maintain history for
decision making process, e.g., bank balance,
customer address, etc.
Periodic status: sometimes the history of
changes is maintained in the source system
88
Example
89
Data Extraction Method
90
Incremental data extraction
Immediate data extraction: involves data extraction in
real time.
Possible options:
1. Capture through transactions logs
2. Make triggers/Stored procedures
3. Capture via source application
4. Capture on the basis of time and date stamps
5. Capture by comparing files
91
Data Transformation
92
Data Transformation (Cont.)
93
Data Loading
Determine when (time) and how (as a whole or in
chunks) to load data
Four modes to load data
1. Load: removes old data if available otherwise load data
2. Append: The old data is not removed, the new data is
appended with the old data
3. Destructive Merge: If primary key of the new record
matched with the primary key of and old record then
update old record
4. Constructive Merge: If primary key of the new record
matched with the primary key of and old record then do
not update old record just add the new record and mark
it as superseding record
Data Loading (Cont.)
Data Refresh Vs. Data Update
Full refresh reloads whole data after deleting old data and
data updates are used to update the changing attributes
Data Loading (Cont.)
Loading for dimensional tables: You need to define
a mapping between source system key and system
generated key in data warehouse, otherwise you will
not be able to load/update data correctly
Data Loading (Cont.)
Updates to dimension table
Data Quality Management
It is important to ensure that the data is correct to
make right decisions
Imagine the user working on operational system is
entering wrong regions’ codes of customers.
Imagine that the relevant business has never sent
an invoice using these regions codes (so they are
ignorant). But what will happen if the data
warehouse will use these codes to make decisions?
You need to put proper time and effort to ensure
data quality
Data Quality
http://www.humaninference.com/master-data-management/data-qu
ality/data-cleansing
Google Refine,
http://code.google.com/p/google-refine/
Wrangler
http://vis.stanford.edu/wrangler/
Text Pipe Pro
http://www.datamystic.com/textpipe.html#.UKjm9eQ3vyo
Information Supply
Naive users
Regular users: daily users but cannot make
queries and reports themselves. They need
query templates and predefined reports
Power users: Technically sound users, who
can make queries, reports, scripts, import
and export data themselves
User classes from other perspective
High-level Executives and Managers: Need
standard pre-processed reports to make strategic
decisions
Technical Analysts: complex analysis, statistical
analysis, drill-down, slice, dice
Business Analysts: comfortable with technology but
cannot make queries, can modify reports to get
information from different angles.
Business-Oriented users: Predefined GUIs, and
might be support for some ad-hoc queries
The ways of interaction
Preprocessed reports: routine reports which are
delivered at some specific interval
Predefined queries and templates: The users can
use own parameters with predefined queries
templates and reports with predefined format
Limited ad-hoc access: few and simple queries
which are developed from scratch
Complex ad-hoc access: complicated queries and
analysis. Can be used as a basis for predefined
reports and queries
Information delivery framework
Online Analytical Processing
(OLAP)
used for fast and complex analysis on data warehouse
It is not a database design technique but is only a category of
applications
Definition by Dr. E. F. Codd
OLAP?
OLAP (cont.)
What is the solution?
Dimensional Analysis
1. What are cubes?
2. What are hyper-cubes?
Drill-down and Roll-up
Drill through
Slice and Dice
Pivoting
Dimensional Analysis
OLAP supports multi-dimensional analysis
Cube: have three dimensions
Z-axis
X-axis Y-axis
Same view on spreadsheet
Multidimensional Analysis (Spreadsheet
view) / Hypercube
Drill-down and Roll-up [5]
Drill through
Slicing [5]
Dicing [5]
Pivoting [5]
A SQL server view for cube [6]
OLAP Implementations [1]
Dense Index
Sparse Index
Multilevel Index (B-Tree)
Dense Indexing
For each record store the key of that record and the
pointer where the record is actually placed on disk
If it fits in memory it requires one I/O operation if not then
performance will degrade
Sparse Index
The index is kept for a block of data items
It takes less space but at the expense of
some efficiency in terms of time
Multilevel Indexing
It uses a little more space but it increases efficiency in
terms of time
It is good for queries posing some conditions or range
conditions
Example of B-Tree
B-Tree limitations
193
References
[1] Abdullah, A.: “Data warehousing handouts”, Virtual
University of Pakistan
[2] Ricardo, C. M.: “Database Systems: Principles
Design and Implementation”, Macmillan Coll Div.
[3] Junk Dimension,
http://www.1keydata.com/datawarehousing/junk-dimensi
on.html
[4] Advanced Topics of Dimensional Modeling
https://mis.uhcl.edu/rob/Course/DW/Lectures/Advanced
%20Dimensional%20Modeling.pdf
[5] http://en.wikipedia.org/wiki/OLAP_cube
194
References
[6]
http://www.mssqltips.com/sqlservertutorial/2011/processi
ng-dimensions-and-cube/
[7] http://dev.mysql.com/doc/refman/5.1/en/partitioning-
hash.html