OLAP
OLAP
OLAP
OLAP
1
Objectives
OLTP
OLTP Applications, benefits
OLTP benchmarks
Data partitioning in OLTP
Comparison between OLTP and OLAP
Multi-Dimensional Data Model: Data Cube
OLAP types and operations
Data modeling: Star and Snowflake schema
Denormalization
2
OLTP
Online transaction processing, or OLTP,
is a class of information systems that
facilitate and manage transaction-oriented
applications, typically for data entry and
retrieval transaction processing.
OLTP has also been used to refer to
processing in which the system responds
immediately to user requests.
3
OLTP
Online transaction processing (OLTP) involves
gathering input information, processing the
information and updating existing information to
reflect the gathered and processed information.
Most organizations use a database management
system to support OLTP
OLTP is carried in a client server system
On line transaction process concerns about
concurrency and atomicity
4
OLTP applications
Online transaction processing applications are high
throughput and insert or update-intensive in
database management.
An automated teller machine (ATM) for a bank is
an example of a commercial transaction processing
application.
These applications are used concurrently by
hundreds of users. The key goals of OLTP
applications are availability, speed, concurrency
and recoverability.
Online banking is completely based on online
transaction processing systems.
5
RDBMS used for OLTP
Database Systems have been used traditionally
for OLTP
◦ clerical data processing tasks
◦ detailed, up to date data
◦ structured repetitive tasks
◦ read/update a few records
◦ isolation, recovery and integrity are critical
The data warehouse and the OLTP data base are
both relational databases. However, the
objectives of both these databases are different.
6
Online transaction processing systems (Advantages)
9
TPC–C benchmark
The term transaction is often applied to a wide variety of
business and computer functions
A transaction could refer to a set of operations including
disk read/writes, operating system calls, or some form of
data transfer from one subsystem to another
TCP-C is a mixture of read-only and update intensive
transactions that simulate the activities found in complex
OLTP application environments.
A typical transaction, as defined by the TPC, would
include the updating to a database system for such things
as inventory control (goods), airline reservations
(services), or banking (money).
10
TPC–C benchmark
In these environments, a number of customers
or service representatives input and manage
their transactions via a terminal or desktop
computer connected to a database.
Typically, the TPC produces benchmarks that
measure transaction processing (TP) and
database (DB) performance in terms of how
many transactions a given system and database
can perform per unit of time, e.g., transactions
per second or transactions per minute.
11
TPC-C Benchmark Bench Example
Workload consists of five OLTP transaction types.
New Order - Enter new order from customer. (45%)
Payment – update customer balance to reflect a payment. (43%)
Delivery – deliver orders.(4%)
The Delivery business transaction consists of processing a batch
of 10 new (not yet delivered) orders.
Order Status- retrieve status of customers most recent order.
(4%)
Stock – monitor warehouse inventory. (4%)
The Stock-Level business transaction determines the number of
recently sold items that have a stock level below a specified
threshold
12
Data partitioning in OLTP
Scalability – is the property of system which can
accommodate changes in transaction volume without
affecting the performance.
Partitioning is a common technique used for scaling
databases, particularly for scaling
updates, by distributing the partitions across a cluster
of nodes,and routing the writes to their respective
partitions.
Data Partitioning is also the process of logically
and/or physically partitioning data into segments that
are more easily maintained or accessed.
13
Different partitioning strategies
Vertical partitioning
Horizontal partitioning
◦ Range partition
◦ Hash partition
◦ List partition
Vertical Partitioning
Resumes SSN Name Address Resume Picture
234234 Mary Huston Clob1… Blob1…
345345 Sue Seattle Clob2… Blob2…
345343 Joan Seattle Clob3… Blob3…
234234 Ann Portland Clob4… Blob4…
T1 T2 T3
SSN Name Address SSN Resume SSN Picture
234234 Mary Huston 234234 Clob1… 234234 Blob1…
345345 Sue Seattle 345345 Clob2… 345345 Blob2…
...
15
Horizontal Partitioning
Customers
CustomersInHouston
SSN Name City Country
SSN Name City Country
234234 Mary Houston USA
234234 Mary Houston USA
345345 Sue Seattle USA CustomersInSeattle
345343 Joan Seattle USA SSN Name City Country
234234 Ann Portland USA 345345 Sue Seattle USA
-- Frank Calgary Canada 345343 Joan Seattle USA
-- Jean Montreal Canada
CustomersInCanada
SSN Name City Country
-- Frank Calgary Canada
-- Jean Montreal Canada
16
Types of Horizontal Partitioning
17
Range partitioning
Range partitioning maps data to partitions
based on ranges of values of the partitioning
key that you establish for each partition. It
is the most common type of partitioning and
is often used with dates. For a table with a
date column as the partitioning key, the
January-2005 partition would contain rows
with partitioning key values from 01-Jan-
2005 to 31-Jan-2005.
18
List partitioning
List partitioning enables you to explicitly
control how rows map to partitions by specifying
a list of discrete values for the partitioning key in
the description for each partition.
E.g. a warehouse table containing sales summary
data by product, state, and month/year could be
partitioned into geographic regions.
The advantage of list partitioning is that you
can group and organize unordered and
unrelated sets of data in a natural way.
19
Hash partitioning
Hash partitioning maps data to partitions based on a
hashing algorithm that Oracle applies to the
partitioning key that you identify. The hashing
algorithm evenly distributes rows among partitions,
giving partitions approximately the same size.
Hash partitioning is the ideal method for distributing
data evenly across devices. Hash partitioning is also
an easy-to-use alternative to range partitioning,
especially when the data to be partitioned is not
historical or has no obvious partitioning key.
20
Online Analytical Processing(OLAP)
OLAP is a category of software tools that provides analysis
of data stored in a database.
OLAP is a category of applications and technologies for
collecting, managing, processing, and presenting
multidimensional data for analysis and management
purposes.
OLAP tools allow the user to query, browse, and summarize
information in a very efficient, interactive, and dynamic way.
Product
Data
Warehouse
Region
Time
Online analytical processing (OLAP)
Multidimensional data analysis
◦ 3-D graphics, Pivot Tables, Crosstabs, etc.
◦ Compatible with Spreadsheets & Statistical packages
◦ Advanced Data Presentation Functions
Advanced Database Support
◦ Access to many kinds of DBMS’s, flat files, and
internal and external data sources
◦ Support for Very Large Databases
◦ Advanced data navigation
Easy-to-use end-user interfaces
Support Client/Server architecture
22
Online Analytical Processing (OLAP)
A widely adopted definition for OLAP used today in five key words is:
Fast Analysis of Shared Multidimensional Information (FASMI).
Fast refers to the speed that an OLAP system is able to deliver most
responses to the end user.
Analysis refers to the ability of an OLAP system to manage any business
logic and statistical analysis relevant for the application and user. In
addition, the system must allow users to define new ad hoc calculations as
part of the analysis and report without having to program them.
Shared refers to the ability of an OLAP system being able to implement all
security requirements necessary for confidentiality and the concurrent
update locking at an appropriate level when multiple write access is
required.
Multidimensional refers an OLAP system must provide a multidimensional
view of data. This includes supporting hierarchies and multiple hierarchies.
Information refers to all of the data and derived data needed, wherever the
data resides and however much of the data is relevant for the application.
Online Analytical Processing
(OLAP)
Implemented in a multi-user client/server
mode
Offers consistently rapid response to
queries, regardless of database size and
complexity
OLAP helps user to synthesize enterprise
information and analyze historical data
Operational v/s Information System
2.An OLAP query reads HUGE amount of data and generates the required result.
The query is very complex too. Thus special primitives have to provided to
support this kind of data access.
3.OLAP systems access historical data and not current volatile data while OLTP
systems access current up-to-date data and do not need historical data.
Multi-Dimensional Data Model: Data Cube
Multidimensional data model views data in the form
of a data cube
A data cube allows data to be modeled and viewed in
multiple dimensions
Dimensions are entities with respect to which an
organization wants to keep records such as time,
item, branch, location etc
◦ Dimension table gives further descriptions about
dimension e.g. time (day, week, month, year etc)
◦ Fact table contains measures and keys to each of the
related dimensions tables e.g. dollars sold
Multi-Dimensional Data Model: Data Cube
33
OLAP Operations
OLAP provides a user-friendly environment for
interactive data analysis.
A number of OLAP data cube operations exist to
materialize different views of data, allowing interactive
querying and analysis of the data.
The most popular end user operations on dimensional
data are:
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
34
Drill Up (Roll up)
Roll-up performs aggregation on a data cube in any of the
following ways:
By climbing up a concept hierarchy for a dimension
By dimension reduction
Product
Time
Drill Up (Roll up)
Roll-up is performed by climbing up a concept
hierarchy for the dimension location.
Initially the concept hierarchy was "street < city <
province < country".
On rolling up, the data is aggregated by ascending
the location hierarchy from the level of city to the
level of country.
The data is grouped into cities rather than countries.
When roll-up is performed, one or more dimensions
from the data cube are removed.
36
Drill Up (Roll up)
37
Drill Down (roll down)
Drill-down is the reverse operation of roll-up. It is performed by either
of the following ways:
By stepping down a concept hierarchy for a dimension
By introducing a new dimension.
Product
Time
Drill Down (roll down)
Drill-down is performed by stepping down a
concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month <
quarter < year."
On drilling down, the time dimension is descended
from the level of quarter to the level of month.
When drill-down is performed, one or more
dimensions from the data cube are added.
It navigates the data from less detailed data to
highly detailed data.
39
Drill Down (roll down)
The result of a drill-down
operation performed on the
central cube by stepping down a
concept hierarchy for
temperature can be defined as
week- -day--cool. Drill-down
occurs by descending the time
hierarchy from the level of week
to the more detailed level of day.
Also new dimensions can be
added to the cube, because drill-
down adds more detail to the
given data.
40
Slice
The slice operation is based on selecting one dimension and focusing on a
portion of a cube
It will form a new sub-cube by selecting one or more dimensions.
Product
Product=Toaster
Region
Region
Time
Time
Slice
Slice performs a
selection on one
dimension of the given
cube, thus resulting in
a subcube. For
example, in the cube
example above, if we
make the selection,
temperature=cool we
will obtain the
following cube:
42
Dice
The dice operation creates a sub-cube by
focusing on two or more dimensions.
Dice selects two or more dimensions from a
given cube and provides a new sub-cube.
The dice operation on the cube based on the
following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
43
Dice
The dice operation defines
a subcube by performing
a selection on two or more
dimensions. For example,
applying the selection
(time = day 3 OR time =
day 4) AND (temperature
= cool OR temperature =
hot) to the original cube
we get the following
subcube (still two-
dimensional):
44
Pivot (rotate)
- reorient the cube, visualization, 3D to series of 2 D planes
Pivoting, or rotation, changes the perspective in presenting the data to the user
Product
Product
Region
Time
Region
Time
Pivot (rotate)
Pivot otherwise known as Rotate changes
the dimensional orientation of the cube,
i.e. rotates the data axes to view the data
from different perspectives.
Pivot groups data with different
dimensions.
46
Pivot (rotate)
47
OLAP Operations
Presentation
Product
Reporting
Tool
Region
Report
Time
Data Warehouse Schema
Data Warehouse environment usually transforms
the relational data model into some special
architectures.
Each Schema has a Fact table that stores all the
facts about the subject/measure.
Each fact is associated with multiple dimension
keys that are linked to Dimension Tables.
The most commonly used Data Warehouse
Schemas are:
Data Warehouse Schema
Star Schema
◦ Single Fact table with n –Dimension tables
linked to it.
Snowflake Schema
◦ Single Fact table with n-Dimension tables
organized as a hierarchy.
Fact Constellation Schema
◦ Multiple Facts table sharing dimension tables.
51
Star Schema
A fact table in the middle connected to a set of
dimension tables
A single, large and central fact table and one
table for each dimension.
Every fact points to one tuple in each of the
dimensions and has additional attributes.
Usually the fact tables in a star schema are in
third normal form(3NF) whereas dimensional
tables are de-normalized.
The star schema is the simplest architecture, it is
most commonly used nowadays and is
recommended by Oracle.
Star Schema
Fact Tables
A fact table typically has two types of columns:
foreign keys to dimension tables and measures
those that contain numeric facts.
Dimension Tables
A dimension is a structure usually composed of one
or more hierarchies that categorizes data. The
primary keys of each of the dimension tables are
part of the composite primary key of the fact table.
Dimension tables are generally small in size than
fact table.
53
Star Schema Example
Store Dimension Fact Table Time Dimension
Store Key Store Key Period Key
Store Name Product Key Year
City Period Key
Quarter
Units
State Month
Price
Region
Product Key
Product Desc
Product Dimension
58
Fact Constellation Schema
Store Key
Store Name
City
State
Region
Fact Constellation Example
62
Fact Constellation schema
64
Concept hierarchy
A concept hierarchy defines a sequence of
mapping from a set of low-level concepts to
higher level, more general concepts
A Concept Hierarchy example
all all
66
Case Study
XYZ Foods & Beverages is a new company
which produces dairy, bread and meat products
with production unit located at Baroda.
There products are sold in North, North West
and Western region of India.
They have sales units at Mumbai, Pune ,
Ahemdabad ,Delhi and Baroda.
The President of the company wants sales
information.
Sales Information
113
Cheese 6 16 6 8
Time
Swiss Rolls 8 25 21
Product
Sales Information
Report: The number of items sold in each City for each
product with time
Cheese 3 16 6
Swiss Rolls 4 16 6
Time
Pune Wheat Bread 3 7
Cheese 3 8
Product
Swiss Rolls 4 9 15
Sales Information
Report: The number of items sold and income in each region for
each product with time.
Fact Table
City Product Month Units Rupees
Mumbai Cheese January 3 7.95
Mumbai Swiss Rolls January 4 7.32
Pune Cheese January 3 7.95
Pune Swiss Rolls January 4 7.32
Mumbai Cheese February 16 42.40
Sales Data Warehouse Model
Fact Table
Product_Category_Id Product_Category
1 Milk
2 Bread
3 Cookies
Sales Data Warehouse Model
Region Dimension Table
Time
Product
Sales Fact Product
Category
Region
Sales Data Warehouse Model: Snowflake
Schema
78
Sales Data Warehouse Model
Select doctor, Sum(charge) From fee Where year = 2004 Group by doctor;
81
Assignment 2
Design a data warehouse for a regional
weather bureau. The weather bureau has
about 1,000 probes, which are scattered
throughout various land and ocean locations
in the region to collect basic weather data,
including air pressure, temperature, and
precipitation at each hour. All data are sent
to the central station, which has collected
such data for over 10 years.
82
Assignment 2 solution
Since the weather bureau has about 1,000 probes scattered throughout various
land and ocean locations, we need to construct a spatial data warehouse so that a
user can view weather patterns on a map by month,
by region, and by di®erent combinations of temperature and precipitation, and
can dynamically drill down or roll up along any dimension to explore desired
patterns.
83
Assignment 3
Suppose that a data warehouse for Big University consists of the
following four dimensions: student, course, semester, and
instructor, and two measures count and avg grade. When at the
lowest conceptual level (e.g., for a given student, course, semester,
and instructor combination), the avg grade measure stores the
actual course grade of the student. At higher conceptual levels, avg
grade stores the average grade for the given student.
Draw a snowflake schema diagram for the data warehouse.
What specific OLAP operations should one perform in order to list
the average grade of CS courses for each Big University student?
To obtain the same list, write an SQL query assuming the data are
stored in a relational database with the schema big_university
(student, course, department, semester, instructor, grade).
84
OLAP Server
In order to offer consistent, rapid response to queries
(regardless of database size and complexity), OLAP
needs to be implemented in a multi-user client/server
mode.
An OLAP Server is a high-capacity, multi-user data
manipulation engine specifically designed to support
and operate on multi-dimensional data structure.
The server design and data structure are optimized for
rapid ad-hoc information retrieval in any orientation.
Types of OLAP Servers:
◦ MOLAP server
◦ ROLAP server
◦ HOLAP server
Multidimensional OLAP (MOLAP
In MOLAP, data is stored in a multidimensional
cube and not in the relational database
It uses specialized data structures to organize,
navigate and analyze data
It uses array technology and efficient storage
techniques that minimize the disk space
requirements.
MOLAP differs significantly in that (in some
software) it requires the pre-computation and
storage of information in the cube — the operation
known as processing.
Multidimensional OLAP (MOLAP)
Advantages:
Excellent performance: MOLAP cubes are built for fast data
retrieval, and is optimal for slicing and dicing operations.
It uses array technology and efficient storage techniques that
minimize the disk space requirements.
Can perform complex calculations: All calculations have been
pre-generated when the cube is created. Hence, complex
calculations are not only doable, but they return quickly.
MOLAP example:
Analysis and budgeting in a financial department
Sales analysis
Multidimensional OLAP (MOLAP)
Disadvantages:
Only a limited amount of data can be efficiently stored and
analyzed because all calculations are performed when the cube
is built, it is not possible to include a large amount of data in the
cube itself
Underlying data structures are limited in their ability to support
multiple subject areas and provide access to detailed data.
Storage, Navigation and analysis of data are limited because
the data is designed according to previously determined
requirements. Data may need to be physically reorganized to
optimally support new requirements.
Requires additional investment: Cube technology are often
proprietary and do not already exist in the organization.
Therefore, to adopt MOLAP technology, chances are additional
investments in human and capital resources are needed.
Relational OLAP (ROLAP)
ROLAP is form of online analytical processing
that performs dynamic multidimensional analysis
of data stored in a relational database rather than
in a multidimensional database.
It is the fastest-growing type of OLAP tool
It does not require the pre-computation and storage
of information.
They stand between relational back-end server and
client front-end tools
Relational OLAP (ROLAP)
Advantages:
Can handle large amounts of data: ROLAP itself places no
limitation on data amount.
Can influence functionalities inherent in the relational database:
Often, relational database already comes with a host of
functionalities. ROLAP technologies, since they sit on top of the
relational database, can therefore leverage these functionalities.
ROLAP technology tends to have greater scalability than
MOLAP technology
ROLAP Examples:
◦ Telecommunication startup: call data records (CDRs)
◦ ECommerce Site
◦ Credit Card Company
Relational OLAP (ROLAP)
Disadvantages:
Performance can be slow: Because each ROLAP
report is essentially a SQL query (or multiple SQL
queries) in the relational database, the query time can
be long if the underlying data size is large.
Limited by SQL functionalities: It is difficult to
perform complex calculations using SQL
Development of middleware to facilitate the
development of multidimensional applications, that is
software that converts two dimensional relational into
multidimensional structure.
91
Hybrid OLAP (HOLAP)
Combine ROLAP and MOLAP technology
Allow large volumes of storing detailed data in RDBMS
and Storing aggregated data in MDBMS
User access via MOLAP tools
Best of both worlds greater data capacity of ROLAP
with superior processing capability of MOLAP
◦ Benefits from greater scalability of ROLAP
◦ Benefits from faster computation of MOLAP
It stores data in both a relational database RDB and a
multidimensional database (MDD) and uses whichever
is suited to the type of processing desired
Hybrid OLAP (HOLAP)
94
Denormalization
A normalized design will often store
different but related pieces of information in
separate logical tables (called relations).
If these relations are stored physically as
separate disk files, completing a database
query that draws information from several
relations (a join operation) can be slow.
If many relations are joined, it may be
prohibitively slow
95
Denormalization
Solution is to denoramlize tables
Data is included in one table from another in
order to eliminate the second table which
reduces the number of JOINS in a query and
thus achieves performance.
It’s important to point out that you don’t need to
use denormalization if there are no performance
issues in the application.
Before going with it, consider other options, like
query optimization and proper indexing.
96
Denormalization Example 1
Example 2 normalized model
98
Example 2 normalized model
The user_account table stores data about users who login into our
application
The client table contains some basic data about our clients.
The product table lists products offered to our clients.
The task table contains all the tasks we have created. each task as a set of
related actions towards clients. Each task has its related calls, meetings,
and lists of offered and sold products.
The call and meeting tables store data about all calls and meetings and
relates them with tasks and users.
The dictionaries task_outcome, meeting_outcome and call_outcome
contain all possible options for the final state of a task, meeting or call.
The product_offered stores a list of all products that were offered to
clients on certain tasks while product_sold contains a list of all the
products that client actually bought.
The supply_order table stores data about all orders we’ve placed and the
products_on_order table lists products and their quantity for specific
orders.
The writeoff table is a list of products that were written off due to 99
Denormalized model
100
Denormalized model: product
The only change in the product table is the addition of the
units_in_stock attribute. In a normalized model we could
compute this data as units ordered – units sold – (units
offered) – units written off. We would repeat the calculation
each time a client asks for that product, which would be
extremely time consuming. Instead, we’ll compute the
value up front; when a customer asks us, we’ll have it
ready. Of course, this simplifies the select query a lot. On
the other hand, the units_in_stock attribute must be
adjusted after every insert, update, or delete in the
products_on_order, writeoff, product_offered and
product_sold tables.
101
Denormalized model : task
In the modified task table, we find two new
attributes: client_name and user_first_last_name.
Both of them store values when the task was
created. The reason is that both of these values can
change during time. We’ll also keep a foreign key
that relates them to the original client and user ID.
There are more values that we would like to store,
like client address, VAT ID, etc.
102
Denormalized model
The denormalized product_offered table has two new
attributes, price_per_unit and price. The price_per_unit
attribute is stored because we need to store the actual price
when the product was offered. The normalized model would
only show its current state, so when the product price changes
our ‘history’ prices would also change. Our change doesn’t just
make the database run faster: it also makes it work better. The
price attribute is the computed value units_sold *
price_per_unit. I added it here to avoid making that calculation
each time we want to take a look at a list of offered products.
It’s a small cost, but it improves performance.
The changes made on the product_sold table are very similar.
The table structure is the same, but it stores a list of sold items.
103
Denormalized model
104
Denormalized model
The statistics_per_year table is completely new to our model. We
should look at it as a denormalized table because all its data can be
computed from the other tables. The idea behind this table is to
store the number of tasks, successful tasks, meetings and calls
related to any given client. It also handles the sum total charged per
each year. After inserting, updating, or deleting anything in the
task, meeting, call and product_sold tables, we should recalculate
this table’s data for that client and corresponding year. We can
expect that we’ll mostly have changes only for the current year.
Reports for previous years shouldn’t need to change.
Values in this table are computed up front, so we’ll spend less time
and resources at the moment we need the calculation result.
105
When to use denormalization
Maintaining history
Speeding up reporting
106
Disadvantages of Denormalization
Disk space: As will have duplicate data
Data anomalies: We must update every piece of
duplicate data . That also applies to computed values and
reports. We can achieve this by using triggers,
transactions and/or procedures for all operations that must
be completed together.
Documentation: We must properly document every
denormalization rule that we have applied.
Slowing other operations: We can expect that we’ll slow
down data insert, modification, and deletion operations.
More coding: It will require additional coding, but at the
same time they will simplify some select queries a lot.
107