Introduction To Data Warehousing and Business Intelligence
Introduction To Data Warehousing and Business Intelligence
Introduction To Data Warehousing and Business Intelligence
Business Intelligence
Slides kindly borrowed from the course
Data Warehousing and Machine Learning
Aalborg University, Denmark
Christian S. Jensen
Torben Bach Pedersen
Christian Thomsen
{csj,tbp,chr}@cs.aau.dk
Course Structure
Business intelligence
Extract knowledge from large amounts of data collected in a
modern enterprise
Data warehousing, machine learning
Purpose
Acquire theoretical background in lectures and literature studies
Obtain practical experience on (industrial) tools in practical
exercises
1
Literature
Overview
2
What is Business Intelligence (BI)?
Combination of technologies
Data Warehousing (DW)
On-Line Analytical Processing (OLAP)
Data Mining (DM)
3
Why is BI Important?
4
Case Study of an Enterprise
Example of a chain (e.g., fashion stores or car dealers)
Each store maintains its own customer records and sales records
Hard to answer questions like: find the total sales of Product X from
stores in Aalborg
The same customer may be viewed as different customers for
different stores; hard to detect duplicate customer information
Imprecise or missing data in the addresses of some customers
Purchase records maintained in the operational system for limited
time (e.g., 6 months); then they are deleted or archived
The same product may have different prices, or different discounts
in different stores
Can you see the problems of using those data for business
analysis?
Heterogeneous sources
Relational DBMS, On-Line Transaction Processing (OLTP)
Unstructured data in files (e.g., MS Word)
Legacy systems
10
5
Data Analysis Problems (cont)
11
Data Warehousing
Solution: new analysis environment (DW) where data are
Subject oriented (versus function oriented)
Integrated (logically and physically)
Time variant (data can always be related to time)
Stable (data not deleted, several versions)
Supporting management decisions (different organization)
6
DW: Purpose and Definition
13
7
Function vs. Subject Orientation
Function-oriented
systems Subject-oriented Sales
Appl. systems
D-Appl.
DB DM
Appl.
D-Appl.
DB
DM
Trans. Costs
Appl.
DB DW
All subjects,
integrated Profit
Appl. D-Appl.
DB DM
Appl.
Selected
subjects
DB
15
Appl.
D-Appl.
DB DM
Appl.
D-Appl.
DB
DM
Trans.
Appl.
DB DW
In-between:
Appl. 1. Design of DW for D-Appl.
DM1 DM
DB
2. Design of DM2 and Bottom-up:
Appl. integration with DW 1. Design of DMs
Top-down: DB 3. Design of DM3 and 2. Maybe integration
1. Design of DW integration with DW of DMs in DW
2. Design of DMs 4. ... 3. Maybe no DW
16
8
Hard/Infeasible Queries for OLTP
Why not use the existing databases (OLTP) for
business analysis?
Business analysis queries
In the past five years, which product is the most profitable?
Which public holiday we have the largest sales?
Which week we have the largest sales?
Does the sales of dairy products increase over time?
Difficult to express these queries in SQL
3rd query: may extract the week value using a function
But the user has to learn many transformation functions
4th query: use a special table to store IDs of all dairy products,
in advance
There can be many different dairy products; there can be many
other product types as well
The need of multidimensional modeling
17
Multidimensional Modeling
Example: sales of supermarkets
Facts and measures
Each sales record is a fact, and its sales value is a
measure
Dimensions
Group correlated attributes into the same dimension
easier for analysis tasks
Each sales record is associated with its values of
Product, Store, Time
9
Multidimensional Modeling
How do we model the Time dimension?
Hierarchies with multiple levels
Attributes, e.g., holiday, event
tid day day week month year work
T # # # day
19
10
OLTP vs. OLAP
OLTP OLAP
21
22
11
On-Line Analytical Processing (OLAP)
102
Dimensions
Time Customer Product Supplier
Sales + + +
Data
Costs + +
marts
Bus architecture Profit + + + +
24
12
Extract, Transform, Load (ETL)
25
Performance Optimization
Sales
tid pid locid sales
1 1 1 10
The data warehouse contains GBytes or 2 1 1 20
3 2 3 40
even TBytes of data!
1 billion rows
OLAP users require fast query response time
They dont want to wait for the result for 1 hour!
Acceptable: answer within 10 seconds
26
13
Materialization Example
Imagine 1 billion sales rows, 1000 products, 100 locations
CREATE VIEW TotalSales (pid, locid, total) AS Sales
SELECT s.pid, s.locid, SUM(s.sales) tid pid locid sales
FROM Sales s 1 1 1 10
GROUP BY s.pid, s.locid 2 1 1 20
1 billion rows
Wish to answer the query:
SELECT p.category, SUM(s.sales)
FROM Products p, Sales s WHERE p.pid=s.pid
GROUP BY p.category VIEW TotalSales
Rewrite the query to use the view: pid locid sales
28
14
Central DW Architecture
Clients
29
Federated DW Architecture
Clients
Data stored in separate data marts,
aimed at special departments
Logical DW (i.e., virtual)
Finance Mrktng Distr.
Data marts contain detail data
mart mart mart
Pros
Performance due to distribution
Cons Logical
More complex
DW
Source Source
30
15
Tiered Architecture
Central DW is materialized
Data is distributed to data marts in
one or more tiers
Only aggregated data in cube tiers
Data is aggregated/reduced as it
moves through tiers
Pros
Best performance due to
Central
redundancy and distribution DW
Cons
Most complex
Hard to manage
31
Common DW Issues
Metadata management
Need to understand data = metadata needed
Greater need in OLAP than in OLTP as raw data is used
Need to know about:
Data definitions, dataflow, transformations, versions, usage, security
DW project management
DW projects are large and different from ordinary SW projects
12-36 months and US$ 1+ million per project
Data marts are smaller and safer (bottom up approach)
Reasons for failure
Lack of proper design methodologies
High HW+SW cost
Deployment problems (lack of training)
Organizational change is hard (new processes, data ownership,..)
Ethical issues (security, privacy,)
32
16
Topics not Covered in the Course
33
Summary
34
17
Multidimensional Databases
Overview
1
ER Model vs. Multidimensional Model
Why dont we use the ER model in data warehousing?
2
The multidimensional model
Data is divided into:
Facts
Dimensions
Facts are the important entity: a sale
Facts have measures that can be aggregated: sales price
Dimensions describe facts
A sale has the dimensions Product, Store and Time
Facts live in a multidimensional cube (dice)
Think of an array from programming languages
Goal for dimensional modeling:
Surround facts with as much context (dimensions) as possible
Hint: redundancy may be ok (in well-chosen places)
But you should not try to model all relationships in the data (unlike
E/R and OO modeling!)
Cube Example
Milk 56 67
Bread
Aalborg 57 45
211
2000 2001
6
3
Cubes
A cube may have many dimensions!
More than 3 - the term hypercube is sometimes used
Theoretically no limit for the number of dimensions
Typical cubes have 4-12 dimensions
But only 2-4 dimensions can be viewed at a time
Dimensionality reduced by queries via projection/aggregation
A cube consists of cells
A given combination of dimension values
A cell can be empty (no data for this combination)
A sparse cube has few non-empty cells
A dense cube has many non-empty cells
Cubes become sparser for many/large dimensions
Dimensions
Dimensions are the core of multidimensional databases
Other types of databases do not support dimensions
Dimensions are used for
Selection of data
Grouping of data at the right level of detail
Dimensions consist of dimension values
Product dimension have values milk, cream,
Time dimension have values 1/1/2001, 2/1/2001,
Dimension values may have an ordering
Used for comparing cube data across values
Example: percent sales increase compared with last month
Especially used for Time dimension
4
Dimensions
Dimensions have hierarchies with levels
Typically 3-5 levels (of detail)
Dimension values are organized in a tree structure
Product: Product->Type->Category
Store: Store->Area->City->County
Time: Day->Month->Quarter->Year
Dimensions have a bottom level and a top level (ALL)
Levels may have attributes
Simple, non-hierarchical information
Day has Workday as attribute
Dimension Example
Location
T T
Schema Instance
10
5
Facts
Facts represent the subject of the desired analysis
The important in the business that should be analyzed
A fact is identified via its dimension values
A fact is a non-empty cell
Generally, a fact should
Be attached to exactly one dimension value in each dimension
Only be attached to dimension values in the bottom levels
Some models do not require this
11
Types of Facts
Event fact (transaction)
A fact for every business event (sale)
Fact-less facts
A fact per event (customer contact)
No numerical measures
An event has happened for a given dimension value combination
Snapshot fact
A fact for every dimension combination at given time intervals
Captures current status (inventory)
Cumulative snapshot facts
A fact for every dimension combination at given time intervals
Captures cumulative status up to now (sales in year to date)
Every type of facts answers different questions
Often both event facts and both kinds of snapshot facts exist
12
6
Granularity
Granularity of facts is important
What does a single fact mean?
Level of detail
Given by combination of bottom levels
Example: total sales per store per day per product
Important for number of facts
Scalability
Often the granularity is a single business transaction
Example: sale
Sometimes the data is aggregated (total sales per store per day
per product)
Might be necessary due to scalability
Generally, transaction detail can be handled
Except perhaps huge clickstreams etc.
13
Measures
Measures represent the fact property that the users want
to study and optimize
Example: total sales price
A measure has two components
Numerical value: (sales price)
Aggregation formula (SUM): used for aggregating/combining a
number of measure values into one
14
7
Types of Measures
Three types of measures
Additive
Can be aggregated over all dimensions
Example: sales price
Often occur in event facts
Semi-additive
Cannot be aggregated over some dimensions - typically time
Example: inventory
Often occur in snapshot facts
Non-additive
Cannot be aggregated over any dimensions
Example: average sales price
Occur in all types of facts
15
Schema Documentation
Store Product Customer Time
dimension dimension dimension dimension
T Store
T Product T Customer TTime
Year
County Category Cust. group
Month
8
Why the schema cannot answer question X
Possible reasons
Certain measures not included in fact table
Granularity of facts too coarse
Particular dimensions not in DW
Descriptive attributes missing from dimensions
Meaning of attributes/measures deviate from the
expectation of data analysts (users)
17
ROLAP
Relational OLAP
Data stored in relational tables
Star (or snowflake) schemas used for modeling
SQL used for querying
Pros
Leverages investments in relational technology
Scalable (billions of facts)
Flexible, designs easier to change
New, performance enhancing techniques adapted from MOLAP
Indices, materialized views
Product ID Store ID Sales
Cons 1 3 2
Storage use (often 3-4 times MOLAP) 2 1 7
Response times 3 2 3
18
9
MOLAP
Multidimensional OLAP
Data stored in special multidimensional data structures
E.g., multidimensional array on hard disk
MOLAP data cube
Pros
d2 \ d1 1 2 3
Less storage use (foreign keys not stored)
1 0 7 0
Faster query response times
2 2 0 0
Cons 3 0 0 3
Up till now not so good scalability
Less flexible, e.g., cube must be re-computed when design changes
Does not reuse an existing investment (but often bundled with
RDBMS)
Not as open technology
19
HOLAP
Hybrid OLAP
Detail data stored in relational tables (ROLAP)
Aggregates stored in multidimensional structures (MOLAP)
Pros
Scalable (as ROLAP)
Fast (as MOLAP)
Cons
High complexity
20
10
Relational Implementation
21
Relational Design
22
11
Star Schema Example
Star schemas
One fact table
De-normalized dimension tables
One column per level/attribute
Relational Implementation
The fact table stores facts
One column for each measure
One column for each dimension (foreign key to dimension table)
Dimensions keys make up composite primary key
A dimension table stores a dimension
24
12
Snowflake Schema Example
25
Question Time
T Country
County County
City City
Store Store
Store Schema A Store Schema B
26
13
Star vs Snowflake
Star Schemas
+ Simple and easy overview ease-of-use
+ Relatively flexible
+ Dimension tables often relatively small
+ Recognized by many RDBMSes -> good performance
- Hierarchies are hidden in the columns
- Dimension tables are de-normalized
Snowflake schemas
+ Hierarchies are made explicit/visible
+ Very flexible
+ Dimension tables use less space
- Harder to use due to many joins
- Worse performance
27
Redundancy in the DW
Only very little or no redundancy in fact tables
The same fact data only stored in one fact table
Redundancy is mostly in dimension tables
Star dimension tables have redundant entries for the higher levels
Redundancy problems?
Inconsistent data the central load process helps with this
Update time the DW is optimized for querying, not updates
Space use: dimension tables typically take up less than 5% of DW
So: controlled redundancy is good
Up to a certain limit
28
14
(Relational) OLAP Queries
Two kinds of queries
Navigation queries examine one dimension
SELECT DISTINCT l FROM d [WHERE p]
Aggregation queries summarize fact data
SELECT d1.l1, d2.l2, SUM(f.m) FROM d1, d2, f
WHERE f.dk1 = d1.dk1 AND f.dk2 = d2.dk2 [AND p]
GROUP BY d1.l1,d2.l2
Fast, interactive analysis of large amounts of data
Milk 56 67
Bread
Aalborg 57 45
211
2000 2001 29
OLAP Queries
Starting level Milk 56 67 Slice/Dice:
(City, Year, Product) Bread
Milk
Bread
Aalborg 57 45
211
Aalborg
Aalborg Aalborg
Copenhagen Copenhagen
15
OLAP Cube in MS Analysis Services Project
drill down
31
32
16
DW Design Steps
33
Time dimension
Explicit time dimension is needed (events, holidays,..)
Product dimension
Many-level hierarchy allows drill-down/roll-up
Many descriptive attributes (often more than 50)
Store dimension
Many descriptive attributes
Promotion dimension
Example of a causal dimension
Used to see if promotions work/are profitable
Ads, price reductions, end-of-aisle displays, coupons
34
17
The Grocery Store Measures
All additive across all dimensions
Dollar_sales
Unit_sales
Dollar_cost
Gross profit (derived)
Computed from sales and cost: sales cost
Additive
Gross margin (derived)
Computed from gross profit and sales: (sales cost)/cost
Non-additive across all dimensions
Customer_count
Additive across time, promotion, and store
Non-additive across product. Why?
Semi-additive
35
36
18
Summary
37
19
Advanced MD Modeling and
MD Database Implementation
Overview
1
Changing Dimensions
2
Example
Store dim.
Time dim. StoreID
Attribute values in
TimeID Address
Sales fact dimensions vary over time
Weekday City A store changes Size
TimeID District
Week A product changes
StoreID Size Description
Month
ProductID SCategory Districts are changed
Quarter
Year
ItemsSold Problems
DayNo Product dim.
Amount Dimensions not updated
Holiday ProductID DW is not up-to-date
Description Dimensions updated in a
Brand straightforward way
incorrect information in
change PCategory historical data
? ?
timeline
6
Example
Store dim.
Time dim. StoreID
Address
Sales fact
City
TimeID District The store in Aalborg has
StoreID the size of 250 sq. metres.
Size
ProductID SCategory On a certain day,
Product dim.
customers bought 2000
ItemsSold apples from that store.
Amount
3
Solution 1: No Special Handling
Sales fact table Store dimension table
StoreID ItemsSold StoreID Size
001 2000 001 250
Solution 1
Solution 1: Overwrite the old values in the
dimension tables
Consequences
Old facts point to rows in the dimension tables with
incorrect information!
New facts point to rows with correct information
Pros
Easy to implement
Useful if the updated attribute is not significant, or the old
value should be updated for error correction
Cons
Old facts may point to incorrect rows in dimensions
4
Solution 2
Solution 2: Versioning of rows with changing attributes
The key that links dimension and fact table, identifies a version of a
row, not just a row
Surrogate keys make this easier to implement
what if we had used, e.g., the shops zip code as key?
Consequences
Larger dimension tables
Pros
Correct information captured in DW
No problems when formulating queries
Cons
Cannot capture the development over time of the subjects the
dimensions describe
e.g., relationship between the old store and the new store not
captured
10
5
Solution 3
Solution 3: Create two versions of each changing attribute
One attribute contains the current value
The other attribute contains the previous value
Consequences
Two values are attached to each dimension row
Pros
Possible to compare across the change in dimension value (which
is a problem with Solution 2)
Such comparisons are interesting when we need to work
simultaneously with two alternative values
Example: Categorization of stores and products
Cons
Not possible to see when the old value changed to the new
Only possible to capture the two latest values
12
versions of an attribute
StoreID ItemsSold StoreID DistrictOld DistrictNew
001 2000 001 37 73
6
Solution 2A
7
Solution 2B
16
attributes: From, To
StoreID TimeID ItemsSold StoreID Size From To
001 234 2000 001 250 98 99
002 450 00 -
17
8
Example of Using Solution 2B
18
19
9
Solution 4: Dimension Splitting
Customer dimension (original) CustID Customer
CustID Name dimension (new):
Name PostalAddress
PostalAddress Gender relatively static
attributes
Gender DateofBirth
DateofBirth Customerside
Customerside
DemographyID Demographics
NoKids dimension:
NoKids
MaritialStatus
MaritialStatus often-changing
CreditScore
CreditScoreGroup attributes
BuyingStatus
BuyingStatusGroup
Income
IncomeGroup
Education
EducationGroup
20
Solution 4
Solution 4
Make a minidimension with the often-changing (demograhic)
attributes
Convert (numeric) attributes with many possible values into
attributes with few discrete or banded values
E.g., Income group: [0,10K), [0,20K), [0,30K), [0,40K)
Why? Any Information Loss?
Insert rows for all combinations of values from these new domains
With 6 attributes with 10 possible values each, the dimension gets
106=1,000,000 rows
If the minidimension is too large, it can be further split into more
minidimensions
Here, synchronous/correlated attributes must be considered (and
placed in the same minidimension)
The same attribute can be repeated in another minidimension
21
10
Solution 4 (Changing Dimensions)
Pros
DW size (dimension tables) is kept down
Changes in a customers demographic values do not
result in changes in dimensions
Cons
More dimensions and more keys in the star schema
Navigation of customer attributes is more cumbersome
as these are in more than one dimension
Using value groups gives less detail
The construction of groups is irreversible
22
23
11
Coordinating Data Cubes / Data Marts
24
DW Bus Architecture
What method for DW construction?
Everything at once, top-down DW (monoliths)
Separate, independent marts (stovepipes, data islands)
None of these methods work in practice
Both have different built-in problems
Architecture-guided step-by-step method
Combines the advantages of the first two methods
A data mart can be built much faster than a DW
ETL is always the hardest - minimize risk with a simple mart
But: data marts must be compatible
Otherwise, incomparable views of the enterprise result
Start with single-source data marts
Facts from only one source makes everything easier
25
12
DW Bus Architecture
Data marts built independently by departments
Good (small projects, focus, independence,)
Problems with stovepipes (reuse across marts impossible)
Conformed dimensions and facts/measures
Conformed dimensions
Same structure and content across data marts
Take data from the best source
Dimensions are copied to data marts (not a space problem)
Conformed fact definitions
The same definition across data marts (price excl. sales tax)
Observe units of measurement (also currency, etc.)
Use the same name only if it is exactly the same concept
Facts are not copied between data marts (facts > 95% of data)
This allows several data marts to work together
Combining data from several fact tables is no problem
26
DW Bus Architecture
Dimension content managed by dimension owner
The Customer dimension is made and published in one place
Tools query each data mart separately
Separate queries to each data mart
Results combined by tool or OLAP server
It is hard to make conformed dimensions and facts
Organizational and political challenge, not technical
Get everyone together and
Get a top manager (CIO) to back the conformance decision.
No-one must be allowed to escape
Exception: if business areas are totally separate
No common management/control
27
13
Large Scale Cube Design
The design is never finished
The dimensional modeler is always looking for new information to
include in dimensions and facts
A sign of success!
New dimensions and measures introduced gracefully
Existing queries will give same result
Example: Location dimension can be added for old+new facts
Can usually be done if data has sufficiently fine granularity
Data mart granularity
Always as fine as possible (transaction level detail)
Makes the mart insensitive to changes
28
29
14
Matrix Method
DW Bus Architecture Matrix
Two-dimensional matrix
X-axis: dimensions
Y-axis: data marts
Planning Process
Make list of data marts
Make list of dimensions
Mark co-occurrences (which marts have which dimensions)
Time dimension occurs in (almost) all marts
30
Matrix Example
Sales + + +
Costs + + +
Profit + + + +
31
15
Multidimensional database implementation
MS SQL Server
MS Analysis Services
32
Microsofts RDBMS
Runs on Windows OS only
Nice features built-in
Analysis Services
Integration Services
Reporting Services
Easy to use
Graphical Management Studio and BI Developer
Studio
Watch the demonstration videos from Microsoft to get a
quick introduction
33
16
MS Analysis Services
34
Summary
35
17
Extract, Transform, Load (ETL)
ETL Overview
1
The ETL Process
Phases
Design phase
Modeling, DB design, source selection,
Loading phase
First load/population of the DW
Based on all data in sources
Refreshment phase
Keep the DW up-to-date wrt. source data changes
2
ETL/DW Refreshment
DM
DW
Integration
phase
Preparation
phase
3
Data Staging Area (DSA)
Transit storage for data in the ETL process
Transformations/cleansing done here
No user queries
Sequential operations on large data volumes
Performed by central ETL logic
Easily restarted
No need for locking, logging, etc.
RDBMS or flat files? (DBMS have become better at this)
Finished dimensions copied from DSA to relevant marts
Allows centralized backup/recovery
Backup/recovery facilities needed
Better to do this centrally in DSA than in all data marts
4
High-level diagram
1) Make high-level diagram of source-destination flow
Mainly used for communication purpose
One page only, highlight sources and destinations
Steps: extract, transform, load
Raw-Product Raw-Sales
Source (Spreadsheet) (RDBMS)
Check R.I.
Add product
type
Aggregate sales
Extract time
per product per day
Building Dimensions
Key mapping for the
Static dimension table Product dimension
DW key assignment: production keys to DW keys pid DW_pid Time
using table 11 1 100
Check one-one and one-many relationships (using 22 2 100
sorting) 35 3 200
10
5
Building Fact Tables
11
Extract
12
6
Types of Data Sources
Non-cooperative sources
Snapshot sources provides only full copy of source, e.g., files
Specific sources each is different, e.g., legacy systems
Logged sources writes change log, e.g., DB log
Queryable sources provides query interface, e.g., RDBMS
Cooperative sources
Replicated sources publish/subscribe mechanism
Call back sources calls external code (ETL) when changes occur
Internal action sources only internal actions when changes occur
DB triggers is an example
Extract strategy depends on the source types
13
Extract
14
7
Computing Deltas
Delta = changes since last load
Store sorted total extracts in DSA
Delta can easily be computed from current + last
extract
+ Always possible
+ Handles deletions
- High extraction time
Put update timestamp on all rows (in sources)
Updated by DB trigger
- Source system must be changed, operational overhead
Extract only where timestamp > time for last extract Timestamp DKK
100 10
+ Reduces extract time
Last extract 200 20
- Cannot (alone) handle deletions. time: 300 300 15
400 60
500 33
15
16
8
Transform
17
Common Transformations
18
9
Data Quality
19
Cleansing
20
10
Types of Cleansing
21
Cleansing
Data Status
Dimension
Dont use special values (e.g., 0, -1) in your data
SID Status
They are hard to understand in query/analysis operations
1 Normal
2 Abnormal
Mark facts with Data Status dimension 3 Out of bounds
Normal, abnormal, outside bounds, impossible,
Facts can be taken in/out of analyses
Sales fact table
Uniform treatment of NULL Sales SID
Use NULLs only for measure values (estimates instead?) 10 1
22
11
Improving Data Quality
23
Load
24
12
Load
25
Load
26
13
ETL Tools
Issues
Pipes
Redirect output from one process to input of another process
cat payments.dat | grep 'payment' | sort -r
Files versus streams/pipes
Streams/pipes: no disk overhead, fast throughput
Files: easier restart, often only possibility
Use ETL tool or write ETL code
Code: easy start, co-existence with IT infrastructure, maybe the
only possibility
Tool: better productivity on subsequent projects,
self-documenting
Load frequency
ETL time dependent of data volumes
Daily load is much faster than monthly
Applies to all steps in the ETL process
28
14
SQL Server Integration Services
29
30
15
Packages
A package is a collection of
Data flows (Sources Trans-
formations Destinations)
Connections
Control flow: Tasks, Workflows
Variables
31
A package
Arrows show precendence
constraints
Constraint values:
success (green)
failure (red)
completion (blue)
32
16
Package Control Flow
Containers provide
Structure to packages
Services to tasks
Control flow
Foreach loop container
Repeat tasks by using an enumerator
For loop container
Repeat tasks by testing a condition
Sequence container
Groups tasks and containers into
control flows that are subsets of the
package control flow
Task host container
An abstract container class which is
used implicitly
33
Tasks
A task is a unit of work
Workflow Tasks
Execute package execute other SSIS packages, good for structure!
Execute Process run external application/batch file
SQL Servers Tasks
Bulk insert fast load of data
Execute SQL execute any SQL query
Data Preparation Tasks
File System operations on files
FTP up/download data
Scripting Tasks
Script execute .NET code
Maintenance Tasks DB maintenance
Data Flow Tasks run data flows from sources through
transformations to destinations (this is where the work is done)
34
17
Data Flow Elements
Sources
Make external data available
All ODBC/OLE DB data sources:
RDBMS, Excel, Text files,
Transformations
Update, summarize, cleanse,
merge
Destinations
Write data to specific store
Input, Output, Error output
36
Transformations
Row Transformations
Character Map - applies string functions to character data
Derived Column populates columns using expressions
Rowset Transformations (rowset = tabular data)
Aggregate - performs aggregations
Sort - sorts data
Percentage Sampling - creates sample data set by setting %
Split and Join Transformations
Conditional Split - routes data rows to different outputs
Merge - merges two sorted data sets
Lookup Transformation - looks up ref values by exact match
Other Transformations
Export Column - inserts data from a data flow into a file
Import Column - reads data from a file and adds it to a data flow
Slowly Changing Dimension - configures update of a SCD
37
18
A Few Hints on ETL Design
Dont implement all transformations in one step!
Build first step and check that result is as expected
Add second step and execute both, check result (How to check?)
Add third step
Test SQL statements before putting into IS
Do one thing at the time
Copy source data one-by-one to the data staging area (DSA)
Compute deltas
Only if doing incremental load
Handle versions and DW keys
Versions only if handling slowly changing dimensions
Implement complex transformations
Load dimensions
Load facts
38
Summary
Extract
Transformations/cleansing
Load
39
19