Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

2.data Warehouse and OLAP

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Unit-2 Data Warehouse and OLAP

2.1 Data Warehouse


A data warehouse is a type of data management system that is designed to enable and
support business intelligence (BI) activities, especially analytics. A data warehouse
centralizes and consolidates large amounts of data from multiple sources. Data is populated
into the DW through the processes of extraction, transformation and loading.

Characteristics:

There are some characteristics of Data warehouse: -

 Subject-Oriented:

A data warehouse is always a subject oriented as it delivers information about a


theme instead of organization’s current operations. It can be achieved on specific
theme. That means the data warehousing process is proposed to handle with a
specific theme which is more defined. These themes can be sales, distributions,
marketing etc.

A data warehouse never put emphasis only current operations. Instead, it focuses on
demonstrating and analysis of data to make various decision. It also delivers an easy
and precise demonstration around particular theme by eliminating data which is not
required to make the decisions.
 Integrated: A data warehouse combines data from various sources. These may
include a cloud, relational databases, flat files, structured and semi-structured data,
metadata, and master data. The sources are combined in a manner that’s consistent,
relatable, and ideally certifiable, providing a business with confidence in the data’s
quality.

 Time-Variant: In this data is maintained via different intervals of time such as


weekly, monthly, or annually etc. It founds various time limit which are structured
between the large datasets and are held in online transaction process (OLTP). The
time limits for data warehouse is wide-ranged than that of operational systems. The
data resided in data warehouse is predictable with a specific interval of time and
delivers information from the historical perspective. It comprises elements of time
explicitly or implicitly. Another feature of time-variance is that once data is stored
in the data warehouse then it cannot be modified, alter, or updated.

 Non-Volatile: As the name defines the data resided in data warehouse is permanent.
It also means that data is not erased or deleted when new data is inserted. It includes
the mammoth quantity of data that is inserted into modification between the selected
quantity on logical business. It evaluates the analysis within the technologies of
warehouse.

2.2 DBMS vs Data Warehouse


The main difference between database and data warehouse is that a database is an
organized collection of related data which stores the data in a tabular format while data
warehouse is a central location which stores consolidated data from multiple databases.
Database Data warehouse
An organized collection of related data A central location which stores
which stores data in a tabular form. consolidated data from multiple database
It contains detailed data It contains summarized data
It uses OLTP It uses OLAP
Helps to perform fundamental operation of Helps to analyze the business
business
Less fast and less accurate Fast and accurate
Application oriented Subject Oriented
Tables and joins are complex because they Tables and joins are simple because they
are normalized are denormalized
2.3 Data Cube:
A data cube refers is a three-dimensional (3D) (or higher) range of values that are generally
used to explain the time sequence of an image's data. It is a data abstraction to evaluate
aggregated data from a variety of viewpoints. It is also useful for imaging spectroscopy as
a spectrally-resolved image is depicted as a 3-D volume.
A data cube is generally used to easily interpret data. It is especially useful when
representing data together with dimensions as certain measures of business requirements.
A cube's every dimension represents certain characteristic of the database, for example,
daily, monthly or yearly sales. The data included inside a data cube makes it possible
analyze almost all the figures for virtually any or all customers, sales agents, products, and
much more. Thus, a data cube can help to establish trends and analyze performance.
Data cubes are mainly categorized into two categories:
 Multidimensional Data Cube: Most OLAP products are developed based on a
structure where the cube is patterned as a multidimensional array. These
multidimensional OLAP (MOLAP) products usually offers improved performance
when compared to other approaches mainly because they can be indexed directly
into the structure of the data cube to gather subsets of data. When the number of
dimensions is greater, the cube becomes sparser. That means that several cells that
represent particular attribute combinations will not contain any aggregated data. This
in turn boosts the storage requirements, which may reach undesirable levels at times,
making the MOLAP solution untenable for huge data sets with many dimensions.
Compression techniques might help; however, their use can damage the natural
indexing of MOLAP.
 Relational OLAP: Relational OLAP make use of the relational database model.
The ROLAP data cube is employed as a bunch of relational tables (approximately
twice as many as the quantity of dimensions) compared to a multidimensional array.
Each one of these tables, known as a cuboid, signifies a specific view.
2.4 Data Warehouse Schemas:
A schema is a logical description that describes the entire database. In the data warehouse
there includes the name and description of records. It has all data items and also different
aggregates associated with the data. Like a database has a schema, it is required to maintain
a schema for a data warehouse as well. There are different schemas based on the setup and
data which are maintained in a data warehouse.
Types of Data Warehouse Schemas:
a) Star Schema
 Each dimension in a star schema is represented with only one-dimension
table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the
four dimensions, namely time, item, branch, and location.
 There is a fact table at the center. It contains the keys to each of four
dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.

Note: Each dimension has only one-dimension table and each table holds a set
of attributes. For example, the location dimension table contains the attribute
set {location_key, street, city, province_or_state, country}. This constraint
may cause data redundancy. For example, "Vancouver" and "Victoria" both
the cities are in the Canadian province of British Columbia. The entries for
such cities may cause data redundancy along the attributes province_or_state
and country.
b) Snowflake Schema

 Some dimension tables in the Snowflake schema are normalized.


 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimension’s table in a snowflake schema are
normalized. For example, the item dimension table in star schema is
normalized and split into two dimension tables, namely item and supplier
table
 Now the item dimension table contains the attributes item_key, item_name,
type, brand, and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier
dimension table contains the attributes supplier_key and supplier_type.

Note: Due to normalization in the Snowflake schema, the redundancy is


reduced and therefore, it becomes easy to maintain and the save storage space.
c) Fact Constellation Schema (Galaxy)

 A fact constellation has multiple fact tables. It is also known as galaxy


schema.
 The above diagram shows two fact tables, namely sales and shipping.
 The sales fact table is same as that in the star schema.
 The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
 The shipping fact table also contains two measures, namely dollars sold and
units sold.
 It is also possible to share dimension tables between fact tables. For example,
time and item tables are shared between the sales and shipping fact table.
2.5 OLAP Operations:
OLAP (Online Analytical Processing) is a software technology that allows users to analyze
information from multiple database systems at the same time. It is based on
multidimensional data model and allows the user to query on multi-dimensional data (eg.
Delhi -> 2018 -> Sales data). OLAP databases are divided into one or more cubes and these
cubes are known as Hyper-cubes.
OLAP Operation:
a) Drill Down

In drill-down operation, the less detailed data is converted into highly detailed data.
It can be done by:
 Moving down in the concept hierarchy
 Adding a new dimension
In the cube given in overview section, the drill down operation is performed by
moving down in the concept hierarchy of Time dimension (Quarter -> Month).
b) Roll up
It is just opposite of the drill-down operation. It performs aggregation on the OLAP
cube. It can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by
climbing up in the concept hierarchy of Location dimension (City -> Country).

c) Dice
It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In
the cube given in the overview section, a sub-cube is selected by selecting following
dimensions with criteria:
 Location = “Delhi” or “Kolkata”
 Time = “Q1” or “Q2”
 Item = “Car” or “Bus”
d) Slice

It selects a single dimension from the OLAP cube which results in a new sub-
cube creation. In the cube given in the overview section, Slice is performed on
the dimension Time = “Q1”.

e) Pivot
It is also known as rotation operation as it rotates the current view to get a new view
of the representation. In the sub-cube obtained after the slice operation, performing
pivot operation gives a new view of it.
2.6 OLAP Servers:
Online Analytical Processing Server (OLAP) is based on the multidimensional data model.
It allows managers, and analysts to get an insight of the information through fast,
consistent, and interactive access to information.
There are different types of OLAP Servers:
a) ROLAP
Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored
in a relational database, where both the base data and dimension tables are stored as
relational tables. ROLAP servers are used to bridge the gap between the relational
back-end server and the client’s front-end tools. ROLAP servers store and manage
warehouse data using RDBMS, and OLAP middleware fills in the gaps.
Benefits:
 It is compatible with data warehouses and OLTP systems.
 Highly scalable.
 The data size limitation of ROLAP technology is determined by the
underlying RDBMS. As a result, ROLAP does not limit the amount of
data that can be stored.
Limitations:
 SQL functionality is constrained.
 It’s difficult to keep aggregate tables up to date.
 Required experienced.
b) MOLAP

MOLAP uses array-based multidimensional storage engines for multidimensional


views of data. With multidimensional data stores, the storage utilization may be low
if the data set is sparse. Therefore, many MOLAP server use two levels of data
storage representation to handle dense and sparse data sets.

Benefits:
 Very easy to use.
 Suitable for slicing and dicing operations.
 Information retrieval is fast.
 Capable of performing complex calculations.
Limitations:
 It is difficult to change the dimensions without re-aggregating.
 DBMS facility is weak.
 Since all calculations are performed when the cube is built, a large amount of
data cannot be stored in the cube itself.
c) HOLAP

Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher


scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows to
store the large data volumes of detailed information. The aggregations are stored
separately in MOLAP store.

Benefits:
 HOLAP combines the benefits of MOLAP and ROLAP.
 Provide quick access at all aggregation levels.
Limitations:
 HOLAP architecture is extremely complex.
 There is a greater likelihood of overlap, particularly in their functionalities.
2.7 Data warehouse architecture:
 A data warehouse architecture is a method of defining the overall architecture of
data communication processing and presentation that exist for end-clients computing
within the enterprise. Each data warehouse is different, but all are characterized by
standard vital components.

 Production applications such as payroll accounts payable product purchasing and


inventory control are designed for online transaction processing (OLTP). Such
applications gather detailed data from day to day operations.
 Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP).
These include applications such as forecasting, profiling, summary reporting, and
trend analysis.

 Production databases are updated continuously by either by hand or via OLTP


applications. In contrast, a warehouse database is updated from operational systems
periodically, usually during off-hours. As OLTP data accumulates in production
databases, it is regularly extracted, filtered, and then loaded into a dedicated
warehouse server that is accessible to users. As the warehouse is populated, it must
be restructured tables de-normalized, data cleansed of errors and redundancies and
new fields and keys added to reflect the needs to the user for sorting, combining, and
summarizing data.

 A set of data that defines and gives information about other data. Meta Data
summarizes necessary information about data, which can make finding and work
with particular instances of data more accessible. For example, author, data build,
and data changed, and file size are examples of very basic document metadata

 Data warehouses and their architectures very depending upon the elements of an
organization's situation.

You might also like