Datawarehouse Unit-2
Datawarehouse Unit-2
Datawarehouse Unit-2
A data warehouse is a collection of data marts representing historical data from different
operations in the company. It collects the data from multiple heterogeneous database files(flat,
text etc). It stores 5 to 10 years of huge amounts of data.
This data is stored in a structure optimized for querying and data analysis as a data warehouse.
Listed below are some of the major differences between data warehouses and databases:
● A database is mostly utilized and built for recording data. A data warehouse, in
contrast, is useful for data analysis. The data warehouse is used for large analytical
queries, whereas databases are often geared for read-write operations when it comes to
single-point transactions.
● The database is basically a collection of data that is totally application-oriented. The
data warehouse, in contrast, focuses on a certain type of data. While databases are
often confined to single applications and just target a single process at a time, data
warehouses store data from any number of apps. They can target/contain an endless
number of processes/applications as needed.
● Another distinction between data warehouses and databases refers to the latter being a
real-time data supplier. Simultaneously, the former acts as a data source and records
that may be conveniently accessible for decision-making and analysis.
Operational systems are designed to support Data warehousing systems are typically designed
high-volume transaction processing. to support high-volume analytical processing (i.e.,
OLAP).
Operational systems are usually concerned with Data warehousing systems are usually concerned
current data. with historical data.
Data within operational systems are mainly Non-volatile, new data may be added regularly.
updated regularly according to need. Once Added rarely changed.
It is designed for real-time business dealing and It is designed for analysis of business measures by
processes. subject area, categories, and attributes.
It is optimized for a simple set of transactions, It is optimized for extent loads and high, complex,
generally adding or retrieving a single row at a unpredictable queries that access many rows per
time per table. table.
Operational systems are widely process-oriented. Data warehousing systems are widely
subject-oriented
Operational systems are usually optimized to Data warehousing systems are usually optimized to
perform fast inserts and updates of associatively perform fast retrievals of relatively high volumes of
small volumes of data. data.
Relational databases are created for on-line Data Warehouse designed for on-line Analytical
transactional Processing (OLTP) Processing (OLAP)
Integrated: Data that is gathered into the data warehouse from a variety of sources and
merged into a coherent whole.
Time-variant: All data in the data warehouse is identified with a particular time period.
Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed.
It can be Used for decision Support Used to manage and control business Used by managers
and end-users to understand the business and make judgments.
It is a database designed for analytical tasks. Its content is periodically updated. It contains
current and historical data to provide a historical perspective of information.
The data warehouse architecture is based on the data base management system server.
The data entered into the data warehouse transformed into an integrated structure and format.
The transformation process involves conversion, summarization, filtering and condensation.
The data warehouse must be capable of holding and managing large volumes of data as well as
different data structures over the time.
1. Source layer: A data warehouse system uses a heterogeneous source of data. That data
is stored initially to corporate relational databases or legacy databases, or it may come
from an information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one
standard schema. The so-called Extraction, Transformation, and Loading Tools (ETL)
can combine heterogeneous schemata, extract, transform, cleanse, validate, filter, and
load source data into a data warehouse.
They perform conversions, summarization, key changes, structural changes. The data
transformation is required to be used by decision support tools. The transformation produces
programs, control statements. It moves the data into a data warehouse from multiple operational
systems. The functionalities of these tools are listed below:
To remove unwanted data from operational db, Converting to common data names and
attributes, Calculating summaries and derived data, Establishing defaults for missing data,
Accommodating source data definition changes.
4.Meta data
It is data about data. It is used for maintaining, managing and using the data warehouse. It is
classified into two:
Technical Metadata: It contains information about data warehouse data used by warehouse
designer, administrator to carry out development and management tasks. It includes Info about
data stores Transformation descriptions. That is mapping methods from operational db to
warehouse data .Warehouse Object and data structure definitions for target data The rules used
to perform clean up, and data enhancement. Data mapping operations. Access authorization,
backup history, archive history, info delivery history, data acquisition history, data access etc.,
Business Metadata: It contains info that gives info stored in a data warehouse to users. It
includes Subject areas, and info object type including queries, reports, images, video, audio clips
etc. Internet home pages, Info related to info delivery system, Data warehouse operational info
such as ownerships, audit trails etc.,
Metadata helps the users to understand content and find the data. Metadata is stored in a separate
data store which is known as informational directory or Metadata repository which helps to
integrate, maintain and view the contents of the data warehouse.
It is the gateway to the data warehouse environment. It supports easy distribution and
replication of content for high performance and availability It should be searchable by
business oriented keywords.
It should act as a launch platform for end user to access data and analysis tools. It should
support the sharing of info.
It should support end user monitoring of the status of the data warehouse environment
5 Data marts
It is an inexpensive tool and alternative to the data warehouse. It is based on the subject area.
Data mart is used in the following situation:
6 Access tools
Its purpose is to provide info to business users for decision making. There are five categories:
Data query and reporting tools Application development tools Executive info system tools
(EIS) OLAP tools, Data mining tools
Query and reporting tools: used to generate query and report. There are two types of reporting
tools. They are: Production reporting tools used to generate regular operational reports Desktop
report writers are inexpensive desktop tools designed for end users.
Managed Query tools: used to generate SQL query. It uses Meta layer software in between
users and databases which offers a point-and-click creation of SQL statements.
Application development tools: This is a graphical data access environment which integrates
OLAP tools with data warehouse and can be used to access all db systems.
OLAP Tools: Are used to analyze the data in multi dimensional and complex views. Data
mining tools: are used to discover knowledge from the data warehouse data
Security and priority management Monitoring updates from multiple sources Data
quality checks. Managing and updating metadata. Auditing and reporting data warehouse usage
and status. Purging data. Replicating, sub setting and distributing data. Backup and recovery.
Data warehouse storage management which includes capacity planning, hierarchical storage
management and purging of aged data etc.,
What is ETL?
The mechanism of extracting information from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for Extraction, Transformation and
Loading.
The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs to change
with business changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse
system and needs to be agile, automated, and well documented.
How ETL Works?
Extraction
○ Extraction is the operation of extracting information from a source system for further use
in a data warehouse environment. This is the first stage of the ETL process.
○ Extraction process is often one of the most time-consuming tasks in the ETL.
○ The source systems might be complicated and poorly documented, and thus determining
which data needs to be extracted can be difficult.
○ The data has to be extracted several times in a periodic manner to supply all changed data
to the warehouse and keep it up-to-date.
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to improve
data quality. The primary data cleansing features found in ETL tools are rectification and
homogenization. They use specific dictionaries to rectify typing mistakes and to recognize
synonyms, as well as rule-based cleansing to enforce domain-specific rules and defines
appropriate associations between values.
If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-date
list of contact addresses, email addresses and telephone numbers must be available.
If a client or supplier calls, the staff responding should be quickly able to find the person in the
enterprise database, but this need that the caller's name or his/her company name is listed in the
database.
If a user appears in the databases with two or more slightly different names or different account
numbers, it becomes difficult to update the customer's information.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its operational
source format into a particular data warehouse format. If we implement a three-layer
architecture, this phase outputs our reconciled data layer.
○ Loose texts may hide valuable information. For example, XYZ PVT Ltd does not
explicitly show that this is a Limited Partnership company.
○ Different formats can be used for individual data. For example, data can be saved as a
string or as three integers.
Following are the main transformation processes aimed at populating the reconciled data layer:
○ Conversion and normalization that operate on both storage formats and units of measure
to make data uniform.
Cleansing and Transformation processes are often closely linked in ETL tools.
Loading
The Load is the process of writing the data into the target database. During the load step, it is
necessary to ensure that the load is performed correctly and with as little resources as possible.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying pre existing
data. This method is used in combination with incremental extraction to update data
warehouses regularly.
Selection of an appropriate ETL Tools is an important decision that has to be made in choosing
the importance of an ODS or data warehousing application. The ETL tools are required to
provide coordinated access to multiple data sources so that relevant data may be extracted from
them. An ETL tool would generally contain tools for data cleansing, re-organization,
transformations, aggregation, calculation and automatic loading of information into the object
database.
An ETL tool should provide a simple user interface that allows data cleansing and data
transformation rules to be specified using a point-and-click approach. When all mappings and
transformations have been defined, the ETL tool should automatically generate the data
extract/transformation/load programs, which typically run in batch mode.
8. It gives a useful data administration tool to manage corporate information assets with the
data dictionary.
Data Mart
The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to
gather, store, access, and analyze record. It can be used by smaller businesses to utilize the data
they have accumulated since it is less expensive than implementing a data warehouse.
○ Ease of creation
○ Potential clients are more clearly defined than in a comprehensive data warehouse
A dependent data marts is a logical subset of a physical subset of a higher data warehouse.
According to this technique, the data marts are treated as the subsets of a data warehouse. In this
technique, firstly a data warehouse is created from which further various data marts can be
created. These data mart are dependent on the data warehouse and extract the essential record
from it. In this technique, as the data warehouse creates the data mart; therefore, there is no need
for data mart integration. It is also known as a top-down approach.
Other than these two categories, one more type exists that is called "Hybrid Data Marts."
It allows us to combine input from sources other than a data warehouse. This could be helpful for
many situations; especially when Adhoc integrations are needed, such as after a new group or
product is added to the organizations.
It may hold multiple subject areas. It holds only one subject area. For example,
Finance or Sales.
It holds very detailed information. It may hold more summarized data.
In data warehousing, Fact constellation is used. In Data Mart, Star Schema and Snowflake
Schema are used.
Dimensional modeling
Dimensional modeling represents data with a cube operation, making more suitable logical data
representation with OLAP data management. The perception of Dimensional Modeling was
developed by Ralph Kimball and consists of "fact" and "dimension" tables.
In dimensional modeling, the transaction record is divided into either "facts," which are
frequently numerical transaction data, or "dimensions," which are the reference information that
gives context to the facts. For example, a sale transaction can be damaged into facts such as the
number of products ordered and the price paid for the products, and into dimensions such as
order date, user name, product number, order ship-to, and bill-to locations, and salesman
responsible for receiving the order.
1. To produce database architecture that is easy for end-clients to understand and write
queries.
2. To maximize the efficiency of queries. It achieves these goals by minimizing the number
of tables and relationships between them.
Advantages of Dimensional Modeling
Following are the benefits of dimensional modeling are:
Dimensional modeling promotes data quality: The star schema enables warehouse
administrators to enforce referential integrity checks on the data warehouse. Since the fact
information key is a concatenation of the essentials of its associated dimensions, a factual record
is actively loaded if the corresponding dimensions records are duly described and also exist in
the database.
By enforcing foreign key constraints as a form of referential integrity check, data warehouse
DBAs add a line of defense against corrupted warehouse data.
Performance optimization is possible through aggregates: As the size of the data warehouse
increases, performance optimization develops into a pressing concern. Customers who have to
wait for hours to get a response to a query will quickly become discouraged with the warehouses.
Aggregates are one of the easiest methods by which query performance can be optimized.
1. To maintain the integrity of fact and dimensions, loading the data warehouses with a
record from various operational systems is complicated.
2. It is severe to modify the data warehouse operation if the organization adopting the
dimensional technique changes the method in which it does business.
Fact
It is a collection of associated data items, consisting of measures and context data. It typically
represents business items or business transactions.
Dimensions
It is a collection of data which describes one business dimension. Dimensions decide the
contextual background for the facts, and they are the framework over which OLAP is performed.
Measure
Considering the relational context, there are two basic models which are used in dimensional
modeling:
○ Star Model
○ Snowflake Model
The star model is the underlying structure for a dimensional model. It has one broad central table
(fact table) and a set of smaller tables (dimensions) arranged in a radial design around the
primary table. The snowflake model is the conclusion of decomposing one or more of the
dimensions.
Fact Table
Fact tables are used to data facts or measures in the business. Facts are the numeric data elements
that are of interest to the company.
The fact table includes numerical values of what we measure. For example, a fact value of 20
might means that 20 widgets have been sold.
Each fact table includes the keys to associated dimension tables. These are known as foreign
keys in the fact table.
When it is compared to dimension tables, fact tables have a large number of rows.
Dimension Table
Dimension tables establish the context of the facts. Dimensional tables store fields that describe
the facts.
The dimension tables include descriptive data about the numerical values in the fact table. That
is, they contain the attributes of the facts. For example, the dimension tables for a marketing
analysis function might include attributes such as time, marketing region, and product type.
Since the record in a dimension table is denormalized, it usually has a large number of columns.
The dimension tables include significantly fewer rows of information than the fact table.
The attributes in a dimension table are used as row and column headings in a document or query
results display.
Example: A city and state can view a store summary in a fact table. Item summary can be
viewed by brand, color, etc. Customer information can be viewed by name and address.
Fact Table
4 17 2 1
8 21 3 2
8 4 1 1
In this example, Customer ID column in the facts table is the foreign keys that join with the
dimension table. By following the links, we can see that row 2 of the fact table records the fact
that customer 3, Gaurav, bought two items on day 8.
Dimension Tables
1 Rohan Male 2 3 4
2 Sandeep Male 3 5 1
3 Gaurav Male 1 7 3
Hierarchy
A hierarchy is a directed tree whose nodes are dimensional attributes and whose arcs model
many to one association between dimensional attributes. It contains a dimension, positioned at
the tree's root, and all of the dimensional attributes that define it.
Multidimensional model
A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records.
For example, a shop may create a sales data warehouse to keep records of the store's sales for the
dimension, time, item, and location. These dimensions allow the save to keep track of things, for
example, monthly sales of items and the locations at which the items were sold. Each dimension
has a table related to it, called a dimensional table, which describes the dimension further. For
example, a dimensional table for an item may contain the attributes item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in
the table. In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an item sold).
The fact or measure displayed in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:
When data is grouped or combined in multidimensional matrices called Data Cubes. The data
cube method has a few alternative names or a few variants, such as "Multidimensional
databases," "materialized views," and "OLAP (On-Line Analytical Processing)."
The general idea of this approach is to materialize certain expensive computations that are
frequently inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be
materialized into a set of eight views as shown in fig, where psc indicates a view consisting of
aggregate function value (such as total-sales) computed by grouping three attributes part,
supplier, and customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific attributes are chosen to
be measure attributes, i.e., the attributes whose values are of interest. Another attributes are
selected as dimensions or functional attributes. The measure attributes are aggregated according
to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's sales for the
dimensions time, item, branch, and location. These dimensions enable the store to keep track of
things like monthly sales of items, and the branches and locations at which the items were sold.
Each dimension may have a table identify with it, known as a dimensional table, which describes
the dimensions. For example, a dimension table for items may contain the attributes item_name,
brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse
in many cases because not every cell in each dimension may have corresponding data in the
database.
If a query contains constants at even lower levels than those provided in a data cube, it is not
clear how to make the best use of the precomputed results stored in the data cube.
The model views data in the form of a data cube. OLAP tools are based on the multidimensional
data model. Data cubes usually model n-dimensional data.
A data cube enables data to be modeled and viewed in multiple dimensions. A multidimensional
data model is organized around a central theme, like sales and transactions. A fact table
represents this theme. Facts are numerical measures. Thus, the fact table contains measures (such
as Rs_sold) and keys to each of the related dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for
analyzing the relationship between dimensions.
Example: In the 2-D representation, we will look at the All Electronics sales data for items
sold per quarter in the city of Vancouver. The measured display in dollars sold (in thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example, suppose
we would like to view the data according to time, item as well as the location for the cities
Chicago, New York, Toronto, and Vancouver. The measured display in dollars sold (in
thousands). These 3-D data are shown in the table. The 3-D data of the table are represented as a
series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:
Let us suppose that we would like to view our sales data with an additional fourth dimension,
such as a supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level
of summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location,
and supplier dimensions.
Figure is shown a 4-D data cube representation of sales data, according to the dimensions time,
item, location, and supplier. The measure displayed is dollars sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex
cuboid. In this example, this is the total sales, or dollars sold, summarized over all four
dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data
cubes for the dimension time, item, location, and supplier. Each cuboid represents a different
degree of summarization.
Schema of Dimension model
There are three schemas used to design dimension model
1. Star schema
2. Snowflake schema
3. Fact constellation
Star Schema
A star schema is the elementary form of a dimensional model, in which data are organized into
facts and dimensions. A fact is an event that is counted or measured, such as a sale or log in. A
dimension includes reference data about the fact, such as date, item, or customer.
A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is known
as star schema because the entity-relationship diagram of this schemas simulates a star, with
points, diverge from a central table. The center of the schema consists of a large fact table, and
the points of the star are the dimension tables.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table has two
types of columns: those that include fact and those that are foreign keys to the dimension table.
The primary key of the fact tables is generally a composite key that is made up of all of its
foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables
that include aggregated fact are often instead called summary tables). A fact table generally
contains facts with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data.
If a dimension has not got hierarchies and levels, it is called a flat dimension or list. The
primary keys of each of the dimensions table are part of the composite primary keys of the fact
table. Dimensional attributes help to define the dimensional value. They are generally
descriptive, textual values. Dimensional tables are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region
(markets, cities), clients, products, times, channels.
○ It provides a flexible design that can be changed easily or added to throughout the
development cycle, and as the database grows.
○ It provides a parallel in design to how end-users typically think of and use the data.
A star schema database has a limited number of table and clear join paths, the query run faster
than they do against OLTP systems. Small single-table queries, frequently of a dimension table,
are almost instantaneous. Large join queries that contain multiple tables takes only seconds or
minutes to run.
In a star schema database design, the dimension is connected only through the central fact table.
When the two-dimension table is used in a query, only one join path, intersecting the fact tables,
exist between those two tables. This design feature enforces authentic and consistent query
results.
Example: Suppose a star schema is composed of a fact table, SALES, and several dimension
tables connected to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has
columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH table has
columns for each branch_key, branch_name, branch_type. The LOCATION table has columns of
geographic data, including street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the dimension
tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for time data, four
columns for ITEM data, three columns for BRANCH data, and four columns for LOCATION
data. Thus, the size of the fact table is significantly reduced. When we need to change an item,
we need only make a single change in the dimension table, instead of making many changes in
the fact table.
Snowflake Schema
A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one
or more dimension tables do not connect directly to the fact table but must join through other
dimension tables."
The snowflake schema is an expansion of the star schema where each point of the star explodes
into more points. It is called snowflake schema because the diagram of snowflake schema
resembles a snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR
schemas. When we normalize all the dimension tables entirely, the resultant structure resembles
a snowflake with the fact table in the middle.
Snowflaking is used to develop the performance of specific queries. The schema is diagramed
with each fact surrounded by its associated dimensions, and those dimensions are related to other
dimensions, branching out into a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension tables,
which can be linked to other dimension tables through a many-to-one relationship. Tables in a
snowflake schema are generally normalized to the third normal form. Each dimension table
performs exactly one level in a hierarchy.
The following diagram shows a snowflake schema with two dimensions, each having three
levels. A snowflake schemas can have any number of dimension, and each dimension can have
any number of levels.
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time,
Product, Line, and Family dimension tables. The Market dimension has two dimension tables
with Store as the primary dimension table, and Location as the outrigger dimension table. The
product dimension has three dimension tables with Product as the primary dimension table, and
the Line and Family table are the outrigger dimension tables.
A star schema store all attributes for a dimension into one denormalized table. This needed more
disk space than a more normalized snowflake schema. Snowflaking normalizes the dimension by
moving attributes with low cardinality into separate dimension tables that relate to the core
dimension table by using foreign keys. Snowflaking for the sole purpose of minimizing disk
space is not recommended, because it can adversely impact query performance.
In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables
are damaged into multiple dimension tables.
Figure shows a simple STAR schema for sales in a manufacturing company. The sales fact table
include quantity, price, and other relevant metrics. SALESREP, CUSTOMER, PRODUCT, and
TIME are the dimension tables.
The STAR schema for sales, as shown above, contains only five tables, whereas the normalized
version now extends to eleven tables. We will notice that in the snowflake schema, the attributes
with low cardinality in each original dimension tables are removed to form separate tables. These
new tables are connected back to the original dimension table through artificial keys.
A snowflake schema is designed for flexible querying across more complex dimensions and
relationship. It is suitable for many to many and one to many relationships between dimension
levels.
Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact
Constellation Schema can design with a collection of de-normalized FACT, Shared, and
Conformed Dimension tables.
The primary disadvantage of the fact constellation schema is that it is a more challenging design
because many variants for specific kinds of aggregation must be considered and selected.
It deals with querying, statistical analysis, and reporting via tables, charts, or graphs. Nowadays,
information processing of data warehouse is to construct a low cost, web-based accessing tools
typically integrated with web browsers.
Analytical Processing
It supports various online analytical processing such as drill-down, roll-up, and pivoting. The
historical data is being processed in both summarized and detailed format.
OLAP is implemented on data warehouses or data marts. The primary objective of OLAP is to
support ad-hoc querying needed for support DSS. The multidimensional view of data is
fundamental to the OLAP application. OLAP is an operational view, not a data structure or
schema. The complex nature of OLAP applications requires a multidimensional view of the data.
Data Mining
It helps in the analysis of hidden design and association, constructing scientific models,
operating classification and prediction, and performing the mining results using visualization
tools.
Data mining is the technique of designing essential new correlations, patterns, and trends by
changing through high amounts of a record save in repositories, using pattern recognition
technologies as well as statistical and mathematical techniques.
It is the phase of selection, exploration, and modeling of huge quantities of information to
determine regularities or relations that are at first unknown to access precise and useful results
for the owner of the database.
OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is rapidly
enhancing the essential foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis,
Simulation-Models, Knowledge Discovery, and Data Warehouses Reporting. OLAP enables
end-clients to perform ad hoc analysis of record in multiple dimensions, providing the insight
and understanding they require for better decision making.
○ Budgeting
○ Activity-based costing
○ Customer analysis
Production
○ Production planning
○ Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a data model
more intuitive to them than a tabular model. This model is called a Dimensional Model.
The second purpose is to enable fast query response that is usually difficult to achieve using
tabular models.
3) Accessibility: It provides access only to the data that is actually required to perform the
particular analysis, present a single, coherent, and consistent view to the clients. The OLAP
system must map its own logical schema to the heterogeneous physical data stores and perform
any necessary transformations. The OLAP operations should be sitting between data sources
(e.g., data warehouses) and an OLAP front-end.
4) Consistent Reporting Performance: To make sure that the users do not feel any significant
degradation in documenting performance as the number of dimensions or the size of the database
increases. That is, the performance of OLAP should not suffer as the number of dimensions is
increased. Users must observe consistent run time, response time, or machine utilization every
time a given query is run.
5) Client/Server Architecture: Make the server component of OLAP tools sufficiently
intelligent that the various clients to be attached with a minimum of effort and integration
programming. The server should be capable of mapping and consolidating data between
dissimilar databases.
7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific analytical
model being created and loaded that optimizes sparse matrix handling. When encountering the
sparse matrix, the system must be easy to dynamically assume the distribution of the information
and adjust the storage and access to obtain and maintain a consistent level of performance.
8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity, and
access security.
10) Intuitive Data Manipulation: Data Manipulation fundamental the consolidation direction
like as reorientation (pivoting), drill-down and roll-up, and another manipulation to be
accomplished naturally and precisely via point-and-click and drag and drop methods on the cells
of the scientific model. It avoids the use of a menu or multiple trips to a user interface.
11) Flexible Reporting: It implements efficiency to the business clients to organize columns,
rows, and cells in a manner that facilitates simple manipulation, analysis, and synthesis of data.
12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions should be
unlimited. Each of these common dimensions must allow a practically unlimited number of
customer-defined aggregation levels within any given consolidation path.
Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs aggregation
on a data cube, by climbing down concept hierarchies, i.e., dimension reduction. Roll-up is like
zooming-out on the data cubes. Figure shows the result of roll-up operations performed on the
dimension location. The hierarchy for the location is defined as the Order Street, city, province,
or state, country. The roll-up operation aggregates the data by ascending the location hierarchy
from the level of the city to the level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are removed
from the cube. For example, consider a sales data cube having two dimensions, location and
time. Roll-up may be performed by removing, the time dimensions, appearing in an aggregation
of the total sales by location, relatively than by location and by time.
Example
Consider the following cubes illustrating temperature of certain days recorded weekly:
Temperature 64 65 68 69 70 71 72 75 80 81 83 85
Week1 1 0 1 0 1 0 0 0 0 0 1 0
Week2 0 0 0 1 0 0 1 2 0 1 0 0
Consider that we want to set up levels (hot (80-85), mild (70-75), cool (64-69)) in temperature
from the above cubes.
To do this, we have to group column and add up the value according to the concept hierarchies.
This operation is known as a roll-up.
Week2 2 1 1
Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-down
is like zooming-in on the data cube. It navigates from less detailed record to more detailed data.
Drill-down can be performed by either stepping down a concept hierarchy for a dimension or
adding additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a
concept hierarchy which is defined as day, month, quarter, and year. Drill-down appears by
descending the time hierarchy from the level of the quarter to a more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by adding a
new dimension to a cube. For example, a drill-down on the central cubes of the figure can occur
by introducing an additional dimension, such as a customer group.
Example
Day 1 0 0 0
Day 2 0 0 0
Day 3 0 0 1
Day 4 0 1 0
Day 5 1 0 0
Day 6 0 0 0
Day 7 1 0 0
Day 8 0 0 0
Day 9 1 0 0
Day 10 0 1 0
Day 11 0 1 0
Day 12 0 1 0
Day 13 0 0 1
Day 14 0 0 0
Slice
A slice is a subset of the cubes corresponding to a single value for one or more members of the
dimension. For example, a slice operation is executed when the customer wants a selection on
one dimension of a three-dimensional cube resulting in a two-dimensional site. So, the Slice
operations perform a selection on one dimension of the given cube, thus resulting in a subcube.
For example, if we make the selection, temperature=cool we will obtain the following cube:
Temperature cool
Day 1 0
Day 2 0
Day 3 0
Day 4 0
Day 5 1
Day 6 1
Day 7 1
Day 8 1
Day 9 1
Day 11 0
Day 12 0
Day 13 0
Day 14 0
Dice
The dice operation describes a subcube by operating a selection on two or more dimension.
For example, Implement the selection (time = day 3 OR time = day 4) AND (temperature = cool
OR temperature = hot) to the original cubes we get the following subcube (still two-dimensional)
Temperature cool hot
Day 3 0 1
Day 4 0 0
The dice operation on the cubes based on the following selection criteria involves three
dimensions.
○ (location = "Toronto" or "Vancouver")
Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates the
data axes in view to provide an alternative presentation of the data. It may contain swapping the
rows and columns or moving one of the row-dimensions into the column dimensions.
Other OLAP operations may contain ranking the top-N or bottom-N elements in lists, as well as
calculate moving average, growth rates, and interests, internal rates of returns, depreciation,
currency conversions, and statistical tasks.
OLAP offers analytical modeling capabilities, containing a calculation engine for determining
ratios, variance, etc. and for computing measures across various dimensions. It can generate
summarization, aggregation, and hierarchies at each granularity level and at every dimensions
intersection. OLAP also provide functional models for forecasting, trend analysis, and statistical
analysis. In this context, the OLAP engine is a powerful data analysis tool.
Types of OLAP
There are three main types of OLAP servers are as following:
ROLAP stands for Relational OLAP, an application based on relational DBMSs.
HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional
techniques.
They use a relational or extended-relational DBMS to save and handle warehouse data, and
OLAP middleware to provide missing pieces.
ROLAP servers contain optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services.
ROLAP systems work primarily from the data that resides in a relational database, where the
base data and dimension tables are stored as relational tables. This model permits the
multidimensional analysis of data.
This technique relies on manipulating the data stored in the relational database to give the
presence of traditional OLAP's slicing and dicing functionality. In essence, each method of
slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
○ Database server.
○ ROLAP server.
○ Front-end tool.
Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology segment in the
market. This method allows multiple multidimensional views of two-dimensional relational
tables to be created, avoiding structuring record around the desired view.
Some products in this segment have supported reliable SQL engines to help the complexity of
multidimensional analysis. This includes creating multiple SQL statements to handle user
requests, being 'RDBMS' aware and also being capable of generating the SQL statements based
on the optimizer of the DBMS engine.
Advantages
Can handle large amounts of information: The data size limitation of ROLAP technology is
depends on the data size of the underlying RDBMS. So, ROLAP itself does not restrict the data
amount.
RDBMS already comes with a lot of features. So ROLAP technologies, (works on top of the
RDBMS) can control these functionalities.
Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in the
relational database, the query time can be prolonged if the underlying data size is large.
One of the significant distinctions of MOLAP against a ROLAP is that data are summarized
and are stored in an optimized format in a multidimensional cube, instead of in a relational
database. In MOLAP model, data are structured into proprietary formats by client's reporting
requirements with the calculations pre-generated on the cubes.
MOLAP Architecture
○ Database server.
○ MOLAP server.
○ Front-end tool.
MOLAP structure primarily reads the precompiled data. MOLAP structure has limited
capabilities to dynamically create aggregations or to evaluate results which have not been
pre-calculated and stored.
Applications requiring iterative and comprehensive time-series analysis of trends are well suited
for MOLAP technology (e.g., financial analysis and budgeting).
Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's Lightship
Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's Multiway.
Some of the problems faced by clients are related to maintaining support to multiple subject
areas in an RDBMS. Some vendors can solve these problems by continuing access from
MOLAP tools to detailed data in and RDBMS.
This can be very useful for organizations with performance-sensitive multidimensional analysis
requirements and that have built or are in the process of building a data warehouse architecture
that contains multiple subject areas.
An example would be the creation of sales data measured by several dimensions (e.g., product
and sales region) to be stored and maintained in a persistent structure. This structure would be
provided to reduce the application overhead of performing calculations and building aggregation
during initialization. These structures can be automatically refreshed at predetermined intervals
established by an administrator.
Advantages
Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal
for slicing and dicing operations.
Can perform complex calculations: All evaluation have been pre-generated when the cube is
created. Hence, complex calculations are not only possible, but they return quickly.
Disadvantages
Limited in the amount of information it can handle: Because all calculations are performed
when the cube is built, it is not possible to contain a large amount of data in the cube itself.
Requires additional investment: Cube technology is generally proprietary and does not already
exist in the organization. Therefore, to adopt MOLAP technology, chances are other investments
in human and capital resources are needed.
Advantages of HOLAP
1. HOLAP provide benefits of both MOLAP and ROLAP.
3. HOLAP balances the disk space requirement, as it only stores the aggregate information
on the OLAP server and the detail record remains in the relational database. So no
duplicate copy of the detail record is maintained.
Disadvantages of HOLAP
1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP
servers.
Other Types
There are also less popular types of OLAP styles upon which one could stumble upon every so
often. We have listed some of the less popular brands existing in the OLAP industry.
WOLAP pertains to OLAP application which is accessible via the web browser. Unlike
traditional client/server OLAP applications, WOLAP is considered to have a three-tiered
architecture which consists of three components: a client, a middleware, and a database server.
DOLAP permits a user to download a section of the data from the database or source, and work
with that dataset locally, or on their desktop.
Mobile OLAP enables users to access and work on OLAP data and applications remotely
through the use of their mobile devices.
SOLAP includes the capabilities of both Geographic Information Systems (GIS) and OLAP into
a single user interface. It facilitates the management of both spatial and non-spatial data.
Difference between ROLAP, MOLAP, and HOLAP
ROLAP MOLAP HOLAP
ROLAP stands for MOLAP stands for HOLAP stands for Hybrid
Relational Online Multidimensional Online Online Analytical Processing.
Analytical Processing. Analytical Processing.
The ROLAP storage The MOLAP storage mode The HOLAP storage mode
mode causes the principle the aggregations of the connects attributes of both
aggregation of the division and a copy of its source MOLAP and ROLAP. Like
division to be stored in information to be saved in a MOLAP, HOLAP causes the
indexed views in the multidimensional operation in aggregation of the division to
relational database that analysis services when the be stored in a
was specified in the separation is processed. multidimensional operation in
partition's data source. an SQL Server analysis
services instance.
ROLAP does not This MOLAP operation is highly HOLAP does not causes a
because a copy of the optimize to maximize query copy of the source information
source information to performance. The storage area can to be stored. For queries that
be stored in the be on the computer where the access the only summary
Analysis services data partition is described or on record in the aggregations of a
folders. Instead, when another computer running division, HOLAP is the
the outcome cannot be Analysis services. Because a copy equivalent of MOLAP.
derived from the query of the source information resides
cache, the indexed in the multidimensional operation,
views in the record queries can be resolved without
source are accessed to accessing the partition's source
answer queries. record.
Query response is Query response times can be Queries that access source
frequently slower with reduced substantially by using record for example, if we want
ROLAP storage than aggregations. The record in the to drill down to an atomic
with the MOLAP or partition's MOLAP operation is cube cell for which there is no
HOLAP storage mode. only as current as of the most aggregation information must
Processing time is also recent processing of the retrieve data from the
frequently slower with separation. relational database and will
ROLAP. not be as fast as they would be
if the source information were
stored in the MOLAP
architecture.
Semi-additive
Semi-additive measures can be aggregated across some dimensions, but not all dimensions. For
example, measures such as head counts and inventory are considered semi-additive.
Non-additive
Non-additive measures are measures that cannot be aggregated across any of the dimensions.
These measures cannot be logically aggregated between records or fact rows. Non-additive
measures are usually the result of ratios or other mathematical calculations. The only calculation
that can be made for such a measure is to get a count of the number of rows of such measures.