Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Module 3

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 3

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

MODULE 3: DATA WAREHOUSING

Introduction

This module explains the necessary concepts related to data warehousing.

Learning Objectives

At the end of this module, the students should be able to:

1. Define Data Warehousing.

2. Identify the types and functions of Data Warehousing.

3. Enumerate the Data Warehousing Delivery Process and Data Warehousing System Process.

4. Identify the Data Warehousing Architecture.

Lesson 1: Data Warehousing Overview

The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data
warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data
helps analysts to take informed decisions in an organization.

Because of the transactions that occur on a daily basis, an operational database undergoes frequent
changes. If a business executive wants to analyze previous feedback on any data, such as a product, a
supplier, or consumer data, the executive will have no data to analyze because the previous data has
been updated as a result of transactions.

A data warehouse provides us with aggregated and consolidated data in a multidimensional format. A
data warehouse provides us with Online Analytical Processing (OLAP) tools in addition to a generalized
and consolidated view of data. These tools enable us to conduct interactive and effective data analysis in
a multidimensional space. Data generalization and data mining occur as a result of this analysis.

To improve interactive knowledge mining at multiple levels of abstraction, data mining functions such as
association, clustering, classification, and prediction can be combined with OLAP operations. As a result,
data warehouses have become increasingly important as a platform for data analysis and online
analytical processing.

Why is a Data Warehouse distinct from operational databases?

For the following reasons, a data warehouse is kept separate from operational databases:

 An operational database is designed for common tasks and workloads such as searching for
specific files, indexing, and so on. Contract data warehouse queries are commonly complex and
contain a wide variety of data.

 Multiple transactions can be processed concurrently in operational databases. Concurrency


control and recovery mechanisms are required for operational databases to ensure the
database's robustness and consistency.
 An operational database query can read and modify operations, whereas an OLAP query can
only read stored data.

 An operational database keeps current data. A data warehouse, on the other hand, stores
historical data.

Data Warehouse Features

 Subject Oriented- A data warehouse is subject oriented because it offers information about a
specific subject rather than the ongoing operations of the organization. A product, customers,
suppliers, sales, revenue, and so on are examples of such topics. A data warehouse is not
concerned with ongoing operations; instead, it is concerned with data analysis and design for
decision making.

 Integrated − A data warehouse is built by combining composed of different sources such as


relational databases, flat files, and so on. This integration improves the efficiency with which
data can be analyzed.

 Time Variant − A data warehouse's data is associated with a specific time period. A data
warehouse's data provides historical information.

 Non-volatile − When new data is added to a non-volatile storage device, the previous data is not
erased. Because a data warehouse is kept separate from an operational database, frequent
changes in the operational database are not reflected in the data warehouse.

Note − because a data warehouse is physically stored and separate from the operational database, it
does not require transaction processing, recovery, or concurrency controls.

Data Warehouse Applications

A data warehouse assists business executives in organizing, analyzing, and making decisions
based on their data. A data warehouse is the sole component of an enterprise management's plan-
execute-assess "closed-loop" feedback system. Data warehouses are commonly used in the following
industries:

 Financial services

 Banking services

 Consumer goods

 Retail sectors

 Controlled manufacturing

Types of Data Warehouse

The three types of data warehouse applications discussed below are information processing,
analytical processing, and data mining.

Information Processing- A data warehouse enables the processing of data stored in it. Data can be
processed using querying, basic statistical analysis, and reporting via crosstabs, tables, charts, or graphs.
Analytical Processing- A data warehouse facilitates the analytical processing of the information it stores.
Basic OLAP operations such as slice-and-dice, drill down, drill up, and pivoting can be used to analyze the
data.

Data mining contributes to knowledge discovery by revealing hidden patterns and associations,
developing analytical models, and performing classification and prediction. Mining results can be
presented using visualization tools.

SR. Operational Database (OLTP)


Data Warehouse (OLAP)
NO.
It involves historical processing of
1 It involves day-to-day processing
information.
OLAP systems are used by knowledge
OLTP systems are used by clerks,
2 workers such as executives, managers,
DBAs, or database professionals.
and analysts.
3 It is used to analyze the business. It is used to run the business.
4 It focuses on Information out. It focuses on Data in.
It is based on Star Schema, Snowflake It is based on Entity Relationship
5
Schema, and Fact Constellation Schema. Model.
6 It focuses on Information out. It is application oriented.
7 It contains historical data. It contains current data.
It provides summarized and consolidated It provides primitive and highly
8
data. detailed data.
It provides summarized and It provides detailed and flat relational
9
multidimensional view of data. view of data.
10 The number of users is in hundreds. The number of users is in thousands.
The number of records accessed is in The number of records accessed is in
11
millions. tens.
The database size is from 100GB to 100 The database size is from 100 MB to
12
TB. 100 GB.
13 These are highly flexible. It provides high performance.

Lesson 2: Data Warehousing Concepts

The process of creating and utilizing a data warehouse is known as data warehousing. A data
warehouse is built by combining data from disparate sources to support analytical reporting, structured
and/or ad hoc queries, and decision making. Data warehousing entails cleaning, integrating, and
consolidating data.

Using Data Warehouse Information

There are decision support technologies that can help you make use of the data in a data
warehouse. These technologies assist executives in making efficient and effective use of the warehouse.
They can gather information in the warehouse, evaluate it, and make choices based on it. The data
gathered in a warehouse can be used in any of the following domains:
 Tuning Production Strategies − Product strategies can be fine-tuned by repositioning products
and managing product portfolios by comparing quarterly or yearly sales.

 Customer Analysis − Customer analysis is performed by evaluating the customer's purchasing


preferences, purchasing time, budget cycles, and so on.

 Operations Analysis − Data warehousing and operations analysis also aid in customer
relationship management and environmental corrections. We can also use the data to analyze
business operations.

Integrating Heterogeneous Databases

We have two approaches for integrating heterogeneous databases –

 Query-driven Approach

 Update-driven Approach

Query-Driven Approach

This is the standard method for trying to integrate disparate databases. This method was used
to build wrappers and integrators on top of multiple heterogeneous databases. These integrators are
also known as mediators.

Process of Query-Driven Approach

 When a query is issued to a client, a metadata dictionary transforms the query into an
appropriate form for each of the involved heterogeneous sites.

 These queries are now mapped and routed to the local query processor.

 The results from various sites are combined to form a global answer set.

Disadvantages

 A query-driven approach necessitates complex integration and filtering processes.

 This method is extremely inefficient.

 It is prohibitively expensive for frequent queries.

 This method is also prohibitively expensive for queries that require aggregations.

Update-Driven Approach

This is an alternative approach to the traditional one. Today's data warehouse systems use an
update-driven approach rather than the traditional approach discussed earlier. In an update-driven
approach, information from multiple heterogeneous sources is integrated in advance and stored in a
warehouse. This information is available for direct querying and analysis.
Advantages

This approach has the following benefits:

 It provides high performance

 The data is copied, processed, integrated, annotated, summarized, and restructured in


advance in a semantic data store.

 To process data from local sources, query processing does not necessitate the use of an
interface.

Data Warehouse Tools and Utilities Functions

The functions of data warehouse tools and utilities are as follows:

 Data Extraction − It entails gathering data from a variety of heterogeneous sources.

 Data Cleaning − Involves finding and correcting the errors in data.

 Data Transformation − Involves converting the data from legacy format to warehouse format.

 Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and building
indices and partitions.

 Refreshing − Involves updating from data sources to warehouse.

Note − Data cleaning and data transformation are important steps in improving the quality of data and
data mining results.

Lesson 3: Data Warehousing Delivery Process

A data warehouse is never static; it evolves in tandem with the growth of the business. As a
business evolves, so do its requirements, and a data warehouse must be designed to keep up. As a
result, flexibility is required in a data warehouse system.

To deliver a data warehouse, ideally, there should be a delivery process. However, data
warehouse projects are frequently plagued by a slew of issues that make it difficult to complete tasks
and deliverables in the strict and orderly manner required by the waterfall method.

Frequently, the requirements are not completely understood. Only after gathering and studying
all of the requirements can architectures, designs, and build components be completed.

Delivery Method

The method of delivery is a variation on the joint application development approach used for
data warehouse delivery. To reduce risks, we have staged the delivery of the data warehouse.

The approach we will discuss here does not shorten overall delivery times, but ensures that
business benefits are delivered incrementally throughout the development process.

Note − The delivery process is broken into phases to reduce the project and delivery risk.
The following diagram explains the stages in the delivery process –

IT Strategy

Data
warehouse
are strategic
investments
that require
a business
process to
generate
benefits. IT
Strategy is
Image source: www.tutorialspoint.com/dwh/dwh_delivery_process.htm required to
procure and
retain funding for the project.

Business Case

The goal of a business case is to estimate the business benefits of implementing a data
warehouse.

These benefits may not be quantifiable, but the projected benefits must be stated clearly. If a
data warehouse lacks a clear business case, the company is likely to face credibility issues at some point
during the delivery process. As a result, in data warehouse projects, we must comprehend the business
case for investment.

Education and Prototyping

Before settling on a solution, organizations experiment with the concept of data analysis and
educate themselves on the importance of having a data warehouse.

Prototyping addresses this issue. It aids in comprehending the feasibility and benefits of a data
warehouse. Prototyping on a small scale can help the educational process as long as

 The prototype addresses a defined technical objective.

 The prototype can be thrown away after the feasibility concept has been shown.

 The activity addresses a small subset of eventual data content of the data warehouse.

 The activity timescale is non-critical.

The following points are to be kept in mind to produce an early release and deliver business benefits.

 Identify the architecture that is capable of evolving.

 Focus on business requirements and technical blueprint phases.

 Limit the scope of the first build phase to the minimum that delivers business benefits.
 Understand the short-term and medium-term requirements of the data warehouse.

Business Requirements

We must ensure that the overall requirements are understood in order to provide high-quality
deliverables. We can design a solution to meet short-term requirements if we understand the business
requirements for both the short and medium term. The short-term solution can then be expanded into a
comprehensive solution.

This stage determines the following aspects:

 The business rule to be applied on data.

 The logical model for information within the data warehouse.

 The query profiles for the immediate requirement.

 The source systems that provide this data.

Technical Blueprint

This phase must produce an overall architecture that meets the long-term requirements. This
phase also provides the components that must be implemented quickly in order to generate any
business benefits.

The blueprint needs to identify the following.

 The overall system architecture.

 The data retention policy.

 The backup and recovery strategy.

 The server and data mart architecture.

 The capacity plan for hardware and infrastructure.

 The components of database design.

Building the Version

In this stage, the first production deliverable is produced. This production deliverable is the
smallest component of a data warehouse. This smallest component adds business benefit.

History Load

This is the phase where the remainder of the required history is loaded into the data warehouse. In this
phase, we do not add new entities, but additional physical tables would probably be created to store
increased data volumes.

Let us take an example. Suppose the build version phase has delivered a retail sales analysis
data warehouse with 2 months’ worth of history. This information will allow the user to analyze only the
recent trends and address the short-term issues. The user in this case cannot identify annual and
seasonal trends.

To help him do so, last 2 years’ sales history could be loaded from the archive. Now the 40GB
data is extended to 400GB.

Note − The backup and recovery procedures may become complex, therefore it is recommended to
perform this activity within a separate phase.

Ad hoc Query

In this phase, we configure an ad hoc query tool that is used to operate a data warehouse. These tools
can generate the database query.

Note − It is recommended not to use these access tools when the database is being substantially
modified.

Automation

In this phase, operational management processes are fully automated. These would include −

 Transforming the data into a form suitable for analysis.

 Monitoring query profiles and determining appropriate aggregations to maintain system


performance.

 Extracting and loading data from different source systems.

 Generating aggregations from predefined definitions within the data warehouse.

 Backing up, restoring, and archiving the data.

Extending Scope

In this phase, the data warehouse is extended to address a new set of business requirements. The scope
can be extended in two ways −

 By loading additional data into the data warehouse.

 By introducing new data marts using the existing information.

Note − This phase should be performed separately, since it involves substantial efforts and complexity.

Requirements Evolution

The requirements are always changeable from the standpoint of the delivery process. They are
not inactive. This must be supported by the delivery process, which must allow these changes to be
reflected in the system.

This problem is solved by designing the data warehouse around the use of data within business
processes rather than the data requirements of existing queries.

The architecture is intended to change and grow to meet the needs of the business; the process
operates as a pseudo-application development process, in which new requirements are constantly fed
into the development activities and partial deliverables are produced. These partial deliverables are fed
back to users and then reworked to ensure that the overall system is constantly updated to meet
business needs.

Lesson 4: Data Warehousing System Process

Process Flow in Data Warehouse

There are four major processes that contribute to a data warehouse −

 Extract and load the data.

 Cleaning and transforming the data.

 Backup and archive the data.

Image source: www.tutorialspoint.com/dwh/dwh_system_processes.htm


 Managing queries and directing
them to the appropriate data
sources.

1. Extract and Load Process

Data extraction takes data from the source systems. Data load takes the extracted data and loads it into
the data warehouse.

Note − Before loading the data into the data warehouse, the information extracted from the external
sources must be reconstructed.

Controlling the Process

Controlling the process involves determining when to start data extraction and the consistency check on
data. Controlling process ensures that the tools, the logic modules, and the programs are executed in
correct sequence and at correct time.

When to Initiate Extract

Data needs to be in a consistent state when it is extracted, i.e., the data warehouse should represent a
single, consistent version of the information to the user.

For example, in a customer profiling data warehouse in telecommunication sector, it is illogical to merge
the list of customers at 8 pm on Wednesday from a customer database with the customer subscription
events up to 8 pm on Tuesday. This would mean that we are finding the customers for whom there are
no associated subscriptions.

Loading the Data

After extracting the data, it is loaded into a temporary data store where it is cleaned up and made
consistent.

Note − Consistency checks are executed only when all the data sources have been loaded into the
temporary data store.

2. Clean and Transform Process


Once the data is extracted and loaded into the temporary data store, it is time to perform Cleaning and
Transforming. Here is the list of steps involved in Cleaning and Transforming −

 Clean and transform the loaded data into a structure

 Partition the data

 Aggregation

Clean and Transform the Loaded Data into a Structure

Cleaning and transforming the loaded data help speed up the queries. It can be done by making the data
consistent –

 within itself.

 with other data within the same data source.

 with the data in other source systems.

 with the existing data present in the warehouse.

Transforming involves converting the source data into a structure. Structuring the data increases the
query performance and decreases the operational cost. The data contained in a data warehouse must
be transformed to support performance requirements and control the ongoing operational costs.

Partition the Data

It will optimize the hardware performance and simplify the management of data warehouse. Here we
partition each fact table into multiple separate partitions.

Aggregation

Aggregation is required to speed up common queries. Aggregation relies on the fact that most common
queries will analyze a subset or an aggregation of the detailed data.

3. Backup and Archive the Data

In order to recover the data in the event of data loss, software failure, or hardware failure, it is
necessary to keep regular backups. Archiving involves removing the old data from the system in a
format that allow it to be quickly restored whenever required.

For example, in a retail sales analysis data warehouse, it may be required to keep data for 3 years with
the latest 6 months data being kept online. In such a scenario, there is often a requirement to be able to
do month-on-month comparisons for this year and last year. In this case, we require some data to be
restored from the archive.

4. Query Management Process

This process performs the following functions −


 manages the queries.

 helps speed up the execution time of queries.

 directs the queries to their most effective data sources.

 ensures that all the system sources are used in the most effective way.

 monitors actual query profiles.

The information generated in this process is used by the warehouse management process to determine
which aggregations to generate. This process does not generally operate during the regular load of
information into data warehouse.

Lesson 5: Data Warehousing Architecture

Business Analysis Framework

The business analyst gets the information from the data warehouses to measure the performance and
make critical adjustments in order to win over other business holders in the market.

Having a data warehouse offers the following advantages −

 Since a data warehouse can gather information quickly and efficiently, it can enhance business
productivity.

 A data warehouse provides us a consistent view of customers and items; hence, it helps us
manage customer relationship.

 A data warehouse also helps in bringing down the costs by tracking trends, patterns over a long
period in a consistent and reliable manner.

To design an effective and efficient data warehouse, we need to understand and analyze the business
needs and construct a business analysis framework. Each person has different views regarding the
design of a data warehouse. These views are as follows −

 The top-down view − This view allows the selection of relevant information needed for a data
warehouse.

 The data source view − This view presents the information being captured, stored, and
managed by the operational system.

 The data warehouse view − This view includes the fact tables and dimension tables. It
represents the information stored inside the data warehouse.

 The business query view − It is the view of the data from the viewpoint of the end-user.

Three-Tier Data Warehouse Architecture


Generally, a data warehouses adopts a three-tier architecture. Following are the three tiers of the data
warehouse architecture.

 Bottom Tier − The bottom tier of the architecture is the data warehouse database server. It is
the relational database system. We use the back-end tools and utilities to feed data into the
bottom tier. These back-end tools and utilities perform the Extract, Clean, Load, and refresh
functions.

 Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either of
the following ways.

o By Relational OLAP (ROLAP), which is an extended relational database management


system. The ROLAP maps the operations on multidimensional data to standard
relational operations.

o By Multidimensional OLAP (MOLAP) model, which directly implements the


multidimensional data and operations.

 Top-Tier − This tier is the front-end client layer. This layer holds the query tools and reporting
tools, analysis tools and data mining tools.

The following diagram depicts the three-tier architecture of data warehouse –

Image source: www.tutorialspoint.com/dwh/dwh_architecture.htm


Data
Warehouse Models

From the perspective of data warehouse architecture, we have the following data warehouse models −

 Virtual Warehouse

 Data mart

 Enterprise Warehouse

Virtual Warehouse

The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a
virtual warehouse. Building a virtual warehouse requires excess capacity on operational database
servers.
Data Mart

Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups
of an organization.

In other words, we can claim that data marts contain data specific to a particular group. For example,
the marketing data mart may contain data related to items, customers, and sales. Data marts are
confined to subjects.

Points to remember about data marts −

 Window-based or Unix/Linux-based servers are used to implement data marts. They are
implemented on low-cost servers.

 The implementation data mart cycles are measured in short periods of time, i.e., in weeks rather
than months or years.

 The life cycle of a data mart may be complex in long run, if its planning and design are not
organization-wide.

 Data marts are small in size.

 Data marts are customized by department.

 The source of a data mart is departmentally structured data warehouse.

 Data mart are flexible.

Enterprise Warehouse

 An enterprise warehouse collects all the information and the subjects spanning an entire
organization

 It provides us enterprise-wide data integration.

 The data is integrated from operational systems and external information providers.

 This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.

Load Manager

This component performs the operations required to extract and load process.

The size and complexity of the load manager varies between specific solutions from one data warehouse
to other.

Load Manager Architecture

The load manager performs the following functions −

 Extract the data from source system.

 Fast Load the extracted data into temporary data store.


 Perform simple transformations into structure similar to the one in the data warehouse.

Image source: www.tutorialspoint.com/dwh/dwh_architecture.htm


Extract Data from Source

The data is extracted from the operational databases or the external information providers.

Gateways is the application programs that are used to extract data. It is supported by underlying DBMS
and allows client program to generate SQL to be executed at a server. Open Database Connection
(ODBC), Java Database Connection (JDBC), are examples of gateway.

Fast Load

 In order to minimize the total load window, the data need to be loaded into the warehouse in
the fastest possible time.

 The transformations affect the speed of data processing.

 It is more effective to load the data into relational database prior to applying transformations
and checks.

 Gateway technology proves to be not suitable, since they tend not be preformat when large
data volumes are involved.

Simple Transformations

While loading it may be required to perform simple transformations. After this has been completed, we
are in position to do the complex checks. Suppose we are loading the EPOS sales transaction we need to
perform the following checks:

 Strip out all the columns that are not required within the warehouse.

 Convert all the values to required data types.

Warehouse Manager

A warehouse manager is responsible for the warehouse management process. It consists of third-party
system software, C programs, and shell scripts.

The size and complexity of warehouse managers varies between specific solutions.

Warehouse Manager Architecture


A warehouse manager includes the following −

 The controlling process

 Stored procedures or C with SQL

 Backup/Recovery tool

 SQL Scripts

Image source: www.tutorialspoint.com/dwh/dwh_architecture.htm

Operations Performed
by Warehouse Manager

 A warehouse manager analyzes the data to perform consistency and referential integrity checks.

 Creates indexes, business views, partition views against the base data.

 Generates new aggregations and updates existing aggregations. Generates normalizations.

 Transforms and merges the source data into the published data warehouse.

 Backup the data in the data warehouse.

 Archives the data that has reached the end of its captured life.

Note − A warehouse Manager also analyzes query profiles to determine index and aggregations are
appropriate.

Query Manager

 Query manager is responsible for directing the queries to the suitable tables.

 By directing the queries to appropriate tables, the speed of querying and response generation
can be increased.

 Query manager is responsible for scheduling the execution of the queries posed by the user.

Query Manager Architecture

The following screenshot shows the architecture of a query manager. It includes the following:

 Query redirection via C tool or RDBMS

 Stored procedures
 Query management tool

 Query scheduling via C tool or RDBMS

 Query scheduling via third-party software

Image source: www.tutorialspoint.com/dwh/dwh_architecture.htm

Detailed Information

Detailed information is not kept online, rather it is aggregated to the next level of detail and then
archived to tape. The detailed information part of data warehouse keeps the detailed information in the
starflake schema. Detailed information is loaded into the data warehouse to supplement the aggregated
data.

The following diagram shows a pictorial impression of where detailed information is stored and how it is
used.

Image source: www.tutorialspoint.com/dwh/dwh_architecture.htm


Note − If detailed information is held offline to minimize disk storage, we should make sure that the data
has been extracted, cleaned up, and transformed into starflake schema before it is archived.

Summary Information

Summary Information is a part of data warehouse that stores predefined aggregations. These
aggregations are generated by the warehouse manager. Summary Information must be treated as
transient. It changes on-the-go in order to respond to the changing query profiles.

The points to note about summary information are as follows −

 Summary information speeds up the performance of common queries.

 It increases the operational cost.

 It needs to be updated whenever new data is loaded into the data warehouse.

 It may not have been backed up, since it can be generated fresh from the detailed
information.

You might also like