Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit2 Olap

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

2.

1 Data warehouse basic concept


A Data Warehouse consists of data from multiple heterogeneous data sources and is
used for analytical reporting and decision making. Data Warehouse is a central place
where data is stored from different data sources and applications.
The term Data Warehouse was first invented by Bill Inmom in 1990. A Data Warehouse
is always kept separate from an Operational Database.
The data in a DW system is loaded from operational transaction systems like −

 Sales
 Marketing
 HR
 SCM, etc.
It may pass through operational data store or other transformations before it is loaded to
the DW system for information processing.
A Data Warehouse is used for reporting and analyzing of information and stores both
historical and current data. The data in DW system is used for Analytical reporting,
which is later used by Business Analysts, Sales Managers or Knowledge workers for
decision-making.
In the above image, you can see that the data is coming from multiple heterogeneous
data sources to a Data Warehouse. Common data sources for a data warehouse
includes −

 Operational databases
 SAP and non-SAP Applications
 Flat Files (xls, csv, txt files)
Data in data warehouse is accessed by BI (Business Intelligence) users for Analytical
Reporting, Data Mining and Analysis. This is used for decision making by Business
Users, Sales Manager, Analysts to define future strategy.

22.data warehouse modeling –data cube and olap


data cube
A data cube enables data to be modeled and viewed in several dimensions. It is
represented by dimensions and facts. In other terms, dimensions are the views or
entities related to which an organization is required to keep records.
For instance, AllElectronics can create a sales data warehouse to maintain records of
the store’s sales-related dimensions time, item, branch, and location. These
dimensions enable the store to maintain track of things like monthly sales of items and
the branches and locations at which the items were sold.
Each dimension can have a table related to it. It is known as a dimension table, which
further represents the dimension. For instance, a dimension table for an item can
include the attributes item name, brand, and type. Dimension tables can be determined
by users or professionals, or automatically created and adjusted established on data
distributions.
A multidimensional data model is generally organized around a central design, like
sales, for instance. This design is defined by a fact table. Facts are mathematical
measures. Examples of facts for a sales data warehouse contains dollars sold (sales
amount in dollars), units sold (number of units sold), and the amount budgeted. The
fact table includes the names of the facts or measures and keys to each of the
associated dimension tables.
A data cube is generated from a subset of attributes in the database. Specific attributes
are selected to be measure attributes, i.e., the attributes whose values are of interest.
Other attributes are chosen as dimensions or functional attributes. The measure
attributes are aggregated as per the dimensions.
For instance, XYZ can make a sales data warehouse to maintain records of the store's
sales for the dimensions time, item, branch, and location. These dimensions allow the
store to maintain track of things like monthly sales of items, and the branches and
locations at which the items were sold.
Each dimension can have a table recognized with it. It is known as a dimensional table,
which defines the dimensions. For example, a dimension table for items can include the
attributes item_name, brand, and type.
Data cube techniques are interesting methods with several applications. Data cubes
can be sparse in some cases because not every cell in each dimension can have
corresponding information in the database. If a query includes constants at even lower
levels than those supported in a data cube, it is not clear how to develop the best use
of the pre-calculated results saved in the data cube.

OLAP
Online Analytical Processing Server (OLAP) is based on the multidimensional data
model. It allows managers, and analysts to get an insight of the information through fast,
consistent, and interactive access to information. This chapter cover the types of OLAP,
operations on OLAP, difference between OLAP, and statistical databases and OLTP.

Types of OLAP Servers


We have four types of OLAP servers −

 Relational OLAP (ROLAP)


 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
 Specialized SQL Servers

Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end
tools. To store and manage warehouse data, ROLAP uses relational or extended-
relational DBMS.
ROLAP includes the following −

 Implementation of aggregation navigation logic.


 Optimization for each DBMS back end.
 Additional tools and services.

Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views
of data. With multidimensional data stores, the storage utilization may be low if the data
set is sparse. Therefore, many MOLAP server use two levels of data storage
representation to handle dense and sparse data sets.

Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large
data volumes of detailed information. The aggregations are stored separately in MOLAP
store.

OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations −

 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −

 By climbing up a concept hierarchy for a dimension


 By dimension reduction
The following diagram illustrates how roll-up works.
 Roll-up is performed by climbing up a concept hierarchy for the dimension
location.
 Initially the concept hierarchy was "street < city < province < country".
 On rolling up, the data is aggregated by ascending the location hierarchy from the
level of city to the level of country.
 The data is grouped into cities rather than countries.
 When roll-up is performed, one or more dimensions from the data cube are
removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following
ways −

 By stepping down a concept hierarchy for a dimension


 By introducing a new dimension.
The following diagram illustrates how drill-down works −

 Drill-down is performed by stepping down a concept hierarchy for the dimension


time.
 Initially the concept hierarchy was "day < month < quarter < year."
 On drilling down, the time dimension is descended from the level of quarter to the
level of month.
 When drill-down is performed, one or more dimensions from the data cube are
added.
 It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a
new sub-cube. Consider the following diagram that shows how slice works.

 Here Slice is performed for the dimension "time" using the criterion time = "Q1".
 It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.

 (location = "Toronto" or "Vancouver")


 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows
the pivot operation.
2.3Data warehouse Design and Usage
A data warehouse can be built using three approaches −
 A top-down approach
 A bottom-up approach
 A combination of both approaches
The top-down approach starts with the complete design and planning. It is helpful in
cases where the technology is sophisticated and familiar, and where the business
issues that must be solved are clear and well-understood.
The bottom-up approach starts with experiments and prototypes. This is beneficial in
the beginning phase of business modeling and technology development. It enables an
organisation to move forward at considerably less expense and to compute the
advantage of the technology before creating significant commitments.
In the combined approach, an organisation can exploit the planned and strategic
features of the top-down approach while retaining the rapid execution and opportunistic
software of the bottom-up approach.
In general, the warehouse design process consists of the following steps −
 It can choose a business process to model, e.g., orders, invoices, shipments,
inventory, account administration, sales, and the general ledger. If the business
process is organisational and involves multiple, complex object collections, a data
warehouse model should be followed. But, if the process is departmental and
focuses on the analysis of one type of business process, a data mart model must
be selected.
 It can choose the grain of the business process. The grain is the fundamental,
atomic level of data to be represented in the fact table for this process, e.g.,
individual transactions, individual daily snapshots, etc.
 It can choose the dimensions that will apply to each fact table record. Typical
dimensions are time, item, customer, supplier, warehouse, transaction type, and
status.
 It can choose the measures that will populate each fact table record. Typical
measures are numeric additive quantities like dollars-sold and units-sold.
Since a data warehouse is designed and constructed, the original deployment of the
warehouse contains initial installation, rollout planning, training, and orientation.
Platform updates and maintenance should also be treated.
Data warehouse administration will include data refreshment, data source
synchronization, planning for disaster recovery, managing access control and security,
managing data growth, managing database performance, and data warehouse
enhancement and extension.
Scope management contains controlling the number and range of queries, dimensions,
and documents; limiting the size of the data warehouse; or limiting the schedule,
budget, or resources.
There are various kinds of data warehouse design tools are available. Data warehouse
development tools support functions to define and edit metadata repository contents
(including schemas, scripts, or rules), answer queries, output reports, and ship
metadata to and from relational database system catalogs.
Planning and analysis tools study the impact of schema changes and refresh
performance when changing refresh rates or time windows.

2.4Data warehouse Implementation


Data warehouses contain huge volumes of data. OLAP servers demand that decision
support queries be acknowledged in the order of seconds. Thus, it is essential for data
warehouse systems to provide highly effective cube computation techniques, access
techniques, and query processing techniques.

Efficient Computation of Data Cubes


At the core of multidimensional data analysis is the efficient computation of
aggregations across many sets of dimensions. In SQL terms, these aggregations are
referred to as group-by’s. Each group-by can be represented by a cuboid, where the
set of group-by’s forms a lattice of cuboids defining a data cube.
There are three choices for data cube materialization given a base cuboid −
 No materialization − It does not precompute any of the “nonbase” cuboids. This
leads to computing expensive multidimensional aggregates on the fly, which can
be extremely slow.
 Full materialization − It can Pre-compute all of the cuboids. The resulting lattice
of computed cuboids is defined as the full cube. This choice typically requires
huge amounts of memory space to store all of the precomputed cuboids.
 Partial materialization − It can selectively calculate a proper subset of the whole
set of possible cuboids. Alternatively, it can calculate a subset of the cube, which
includes only those cells that satisfy some user-specified criterion, including
where the tuple count of each cell is following some threshold.

Indexing OLAP Data


It can support efficient data accessing, some data warehouse systems provide index
structures and materialized views (using cuboids). The bitmap indexing approaches is
famous in OLAP products because it enables fast searching in data cubes. The bitmap
index is an alternative representation of the record ID (RID) list.
In the bitmap index for a given attribute, there is a distinct bit vector, Bv, for each value
v in the domain of the attribute. If the domain of a given attribute includes n values, then
n bits are required for each entry in the bitmap index (i.e., there are n bit vectors). If the
attribute has the value v for a given row in the data table, then the bit defining that value
is set to 1 in the corresponding row of the bitmap index. All other bits for that row are
set to 0.

Efficient Processing of OLAP Queries


The goals of materializing cuboids and constructing OLAP index structures is to speed
up query processing in data cubes.
 Determine which operations should be performed on the available
cuboids − This contains transforming some selection, projection, roll-up (group-
by), and drill-down operations represented in the query into the corresponding
SQL and/or OLAP operations. For instance, slicing and dicing a data cube can
correspond to selection and projection operations on a materialized cuboid.
 Determine to which materialized cuboid(s) the relevant operations should
be applied − This contains identifying some materialized cuboids that can
potentially be used to answer the query, pruning the following set using
knowledge of “dominance” relationships between the cuboids, estimating the
values of using the remaining materialized cuboids and choosing the cuboid with
the minimum cost.
2.5data Generalization by Attribute Oriented Induction
AOI stands for Attribute-Oriented Induction. The attribute-oriented induction approach
to concept description was first proposed in 1989, a few years before the introduction of
the data cube approach. The data cube approach is essentially based on materialized
views of the data, which typically have been pre-computed in a data warehouse.
In general, it implements off-line aggregation earlier an OLAP or data mining query is
submitted for processing. In other words, the attribute-oriented induction approach is
generally a query-oriented, generalization-based, on-line data analysis methods.
The general idea of attribute-oriented induction is to first collect the task-relevant data
using a database query and then perform generalization based on the examination of
the number of distinct values of each attribute in the relevant collection of data.
The generalization is implemented by attribute removal or attribute generalization.
Aggregation is implemented by combining identical generalized tuples and
accumulating their specific counts. This decreases the size of the generalized data set.
The resulting generalized association can be mapped into several forms for
presentation to the user, including charts or rules.
The process of attribute-oriented induction which are as follows −
 First, data focusing must be implemented before attribute-oriented induction. This
step corresponds to the description of the task-relevant records (i.e., data for
analysis). The data are collected based on the data supported in the data mining
query.
 Because a data mining query is usually relevant to only a portion of the database,
selecting the relevant set of data not only makes mining more efficient, but also
changes more significant results than mining the whole database.
 It can be specifying the set of relevant attributes (i.e., attributes for mining, as
indicated in DMQL with the in relevance to clause) may be difficult for the user. A
user can choose only a few attributes that it is important, while missing others that
can also play a role in the representation.
 For example, suppose that the dimension birth place is defined by the attributes
city, province or state, and country. It can allow generalization on the birth place
dimension, the other attributes defining this dimension should also be included.
 In other terms, having the system automatically involve province or state and
country as relevant attributes enables city to be generalized to these larger
conceptual levels during the induction phase.
 At the other extreme, suppose that the user may have introduced too many
attributes by specifying all of the possible attributes with the clause “in relevance
to *”. In this case, all of the attributes in the relation specified by the from clause
would be included in the analysis.
 Some attributes are unlikely to contribute to an interesting representation. A
correlation-based or entropy-based analysis method can be used to perform
attribute relevance analysis and filter out statistically irrelevant or weakly relevant
attributes from the descriptive mining process.
2.6Data Cube Computation
Data cube computation. ... The multi-way array aggregation method computes full data
cube by using a multidimensional array as its basic data structure 1. Partition array into the
chunks 2. Compute aggregate by visiting (i.e. accessing the values at) cube cells Advantage
the queries run on the cube will be very fast

The following are general optimization techniques for efficient computation of data
cubes which as follows −
Sorting, hashing, and grouping − Sorting, hashing, and grouping operations must be
used to the dimension attributes to reorder and cluster associated tuples. In cube
computation, aggregation is implemented on the tuples that share the similar set of
dimension values. Therefore, it is essential to analyse sorting, hashing, and grouping
services to access and group such data to support evaluation of such aggregates.
It can calculate total sales by branch, day, and item. It can be more effective to sort
tuples or cells by branch, and thus by day, and then group them as per the item name.
An effective performance of such operations in huge data sets have been widely
considered in the database research community.
Such performance can be continued to data cube computation. This method can also
be continued to implement shared-sorts (i.e., sharing sorting costs across different
cuboids when sort-based techniques are used), or to implement shared-partitions (i.e.,
sharing the partitioning cost across different cuboids when hash-based algorithms are
utilized).
Simultaneous aggregation and caching of intermediate results − In cube
computation, it is effective to calculate higher-level aggregates from earlier computed
lower-level aggregates, instead of from the base fact table. Furthermore, simultaneous
aggregation from cached intermediate computation results can lead to the decline of
high-priced disk input/output (I/O) operations.
It can compute sales by branch, for instance, it can use the intermediate results
changed from the computation of a lower-level cuboid including sales by branch and
day. This methods can be continued to implement amortized scans (i.e., computing as
several cuboids as possible simultaneously to amortize disk reads).
Aggregation from the smallest child when there exist multiple child cuboids −
When there exist several child cuboids, it is generally more effective to calculate the
desired parent (i.e., more generalized) cuboid from the smallest, formerly computed
child cuboid.
The Apriori pruning method can be explored to compute iceberg cubes
efficiently − The Apriori property in the context of data cubes, defined as follows: If a
given cell does not fulfil minimum support, therefore no descendant of the cell (i.e.,
more specific cell) will satisfy minimum support. This property can be used to largely
decrease the computation of iceberg cubes.
The description of iceberg cubes includes an iceberg condition, which is a constraint on
the cells to be materialized. A general iceberg condition is that the cells should satisfy a
minimum support threshold including a minimum count or sum. In this term, the Apriori
property can be used to shorten away the inspection of the cell’s descendants.

You might also like