Unit 1
Unit 1
DATA WAREHOUSING
● Financial services
● Banking services
● Consumer goods
● Retail sectors
● Controlled manufacturing
● Query-driven Approach
● Update-driven Approach
Query-Driven Approach:
This is the traditional approach to integrate heterogeneous databases.
This approach was used to build wrappers and integrators on top of
multiple heterogeneous databases. These integrators are also known as
mediators.
● Now these queries are mapped and sent to the local query
processor.
Disadvantages
Update-Driven Approach:
This is an alternative to the traditional approach. Today's data
warehouse systems follow update-driven approach rather than the
traditional approach discussed earlier. In update-driven approach, the
information from multiple heterogeneous sources are integrated in
advance and are stored in a warehouse. This information is available for
direct querying and analysis.
Advantages:
The following are the functions of data warehouse tools and utilities-
The business analyst get the information from the data warehouses to
measure the performance and make critical adjustments in order to win
over other business. holders in the market. Having a data warehouse
offers the following advantages-
● The data source view- This view presents the information being
captured, stored, and managed by the operational system.
● The data warehouse view- This view includes the fact tables and
dimension tables. It represents the information stored inside the
data warehouse.
● The business query view- It is the view of the data from the
viewpoint of the end-user.
● Virtual Warehouse
● Data mart
● Enterprise Warehouse
Virtual Warehouse
Data Mart
In other words, we can claim that data marts contain data specific to a
particular group. For example, the marketing data mart may contain data
related to items, customers, and sales. Data marts are confined to
subjects.
Enterprise Warehouse
● An enterprise warehouse collects all the information and the
subjects spanning an entire organization
● It provides us enterprise-wide data integration.
● The data is integrated from operational systems and external
information providers.
● This information can vary from a few gigabytes to hundreds of
gigabytes, terabytes or beyond.
Fast Load
● In order to minimize the total load window the data need to be
loaded into the warehouse in the fastest possible time.
● The transformations affects the speed of data processing.
● It is more effective to load the data into relational database prior to
applying transformations and checks.
● Gateway technology proves to be not suitable, since they tend not
be performant when large data volumes are involved.
Simple Transformations
Warehouse Manager
Query Manager:
● Query manager is responsible for directing the queries to the
suitable tables.
● By directing the queries to appropriate tables, the speed of
querying and response generation can be increased.
● Query manager is responsible for scheduling the execution of the
queries posed by the user.
Detailed Information:
Summary Information
Database System:
Database System is used in traditional way of storing and retrieving
data. The major task of database system is to perform query processing.
These systems are generally referred as online transaction processing
system. These systems are used day to day operations of an
organization.
Characteristics of Database
● Offers security and removes redundancy
● Allow multiple views of the data
● Database system follows the ACID compliance (Atomicity,
Consistency,Isolation, and Durability).
● Allows insulation between programs and data
● Sharing of data and multiuser transaction processing
● Relational Database support multi-user environment
Data Warehouse:
Data Warehouse is the place where huge amount of data is stored. It is
meant for users or knowledge workers in the role of data analysis and
decision making. These systems are supposed to organize and present
data in different format and different forms in order to serve the need of
the specific user for specific purpose. These systems are referred as
online analytical processing.
● Competitive advantage
The huge returns on investment for those companies that have
successfully implemented a data warehouse is evidence of the
enormous competitive advantage. that accompanies this technology.
The competitive advantage is gained by allowing decision-makers
access to data that can reveal previously unavailable, unknown, and
untapped information on, for example, customers, trends, and demands.
Metadata is simply defined as data about data. The data that is used to
represent other data is known as metadata. For example, the index of a
book serves as a metadata for the contents in the book. In other words,
we can say that metadata is the summarized data that leads us to
detailed data. In terms of data warehouse, we can define metadata as
follows.
● Metadata is the road-map to a data warehouse.
● Metadata in a data warehouse defines the warehouse objects.
● Metadata acts as a directory. This directory helps the decision
support system to locate the contents of a data warehouse.
Categories of Metadata:
Role of Metadata:
Metadata Repository:
Data Cubes
The data cube is used to represent data (sometimes called facts) along
some dimensions of interest. For example, in OLAP such dimensions
could be the subsidiaries a company has, the products the company
offers, and time; in this setup, a fact would be a sales event where a
particular product has been sold in a particular subsidiary at a particular
time. In satellite image timeseries dimensions would be Latitude and
Longitude coordinates and time, a fact (sometimes called measure)
would be a pixel at a given space and time as taken by the satellite
(following some processing that is not of concern here). Even though it is
called a cube (and the examples provided above happen to be
3-dimensional for brevity), a data cube generally is a multi-dimensional
concept which can be 1-dimensional, 2-dimensional, 3-dimensional, or
higher-dimensional. In any case, every dimension divides data into
groups of cells whereas each cell in the cube represents a single
measure of interest. Sometimes cubes hold only few values with the rest
being empty, i.e. undefined, sometimes most or all cube coordinates
hold a cell value. In the first case such data are called sparse, in the
second case they are called dense, although there is no hard delineation
between both.
Applications:
Stars:
Star schema is the fundamental schema among the data mart schema
and it is simplest This schema is widely used to develop or build a data
warehouse and dimensional data marts. It includes one or more fact
tables indexing any number of dimensional tables. The star schema is a
necessary cause of the snowflake schema. It is also efficient for handling
basic queries
Simpler Queries
Join logic of star scherna is quite cinch in comparison to other join logic
which are needed to fetch data from a transactional schema that is
highly normalized
Star schema is widely used by all OLAP systems to design OLAP cubes
efficiently. In fact major OLAP systems deliver a ROLAP mode of
operation which can use a star schema as a source without designing a
cube structure.
Snow Flakes
Characteristics of Snowflake:
Advantage:
Disadvantage:
Concept Hierarchy
Extraction:
The first step of the ETL process is extraction. In this step, data from
various source systems is extracted which can be in various formats like
relational databases, No SQL, XML, and flat files into the staging area. It
is important to extract the data from various source systems and store it
into the staging area first and not directly into the data warehouse
because the extracted data is in various formats and can be corrupted
also. Hence loading it directly into the data warehouse may damage it
and rollback will be much more difficult. Therefore, this is one of the
most important steps of ETL process.
Transformation:
The second step of the ETL process is transformation. In this step, a set
of rules or functions are applied on the extracted data to convert it into a
single standard format. It may involve following processes/tasks:
● Filtering: Loading only certain attributes into the data warehouse.
● Cleaning: Filling up the NULL values with some default values,
mapping U.S.A, United States, and America into USA, etc.
● Joining: Joining multiple attributes into one.
● Splitting: Splitting a single attribute into multiple attributes.
● Sorting: Sorting tuples on the basis of some attribute (generally
key- attribute).
Loading:
The third and final step of the ETL process is loading. In this step, the
transformed data is finally loaded into the data warehouse. Sometimes
the data is updated by loading into the data warehouse very frequently
and sometimes it is done after longer but regular intervals. The rate and
period of loading solely depends on the requirements and varies from
system to system.
Data Marting
Data Mart usually draws data from only a few sources compared to a
Data warehouse. Data marts are small in size and are more flexible
compared to a Datawarehouse.
Hybrid: This type of data marts can take data from data warehouses or
operational systems.
Finance:
With the perfect Data Warehousing solution, bankers can manage all
their available resources more effectively. They can better analyze their
consumer data, government regulations, and market trends to facilitate
better decision-making.
Education:
Healthcare:
Insurance:
Retailing: