Datawarehouse Unit-2

Data Warehouse Introduction
A data warehouse is a collection of data marts representing historical data from different
operations in the company. It collects the data from multiple heterogeneous database files(flat,
text etc). It stores 5 to 10 years of huge amounts of data.
This data is stored in a structure optimized for querying and data analysis as a data warehouse.
―A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data

in support of management’s decision making process.
Difference between database and datawarehouse:
Listed below are some of the major differences between data warehouses and databases:
● A database is mostly utilized and built for recording data. A data warehouse, in
contrast, is useful for data analysis. The data warehouse is used for large analytical
queries, whereas databases are often geared for read-write operations when it comes to
single-point transactions.
● The database is basically a collection of data that is totally application-oriented. The
data warehouse, in contrast, focuses on a certain type of data. While databases are
often confined to single applications and just target a single process at a time, data
warehouses store data from any number of apps. They can target/contain an endless
number of processes/applications as needed.
● Another distinction between data warehouses and databases refers to the latter being a
real-time data supplier. Simultaneously, the former acts as a data source and records
that may be conveniently accessible for decision-making and analysis.
Operational DBMS VS Data Warehouse

Operational Database Data Warehouse
Operational systems are designed to support Data warehousing systems are typically designed
high-volume transaction processing. to support high-volume analytical processing (i.e.,
OLAP).
Operational systems are usually concerned with Data warehousing systems are usually concerned
current data. with historical data.
Data within operational systems are mainly Non-volatile, new data may be added regularly.
updated regularly according to need. Once Added rarely changed.
It is designed for real-time business dealing and It is designed for analysis of business measures by
processes. subject area, categories, and attributes.
It is optimized for a simple set of transactions, It is optimized for extent loads and high, complex,
generally adding or retrieving a single row at a unpredictable queries that access many rows per
time per table. table.
It is optimized for validation of incoming Loaded with consistent, valid information,

information during transactions, uses validation requires no real-time validation.
data tables.
It supports thousands of concurrent clients. It supports a few concurrent clients relative to

OLTP.
Operational systems are widely process-oriented. Data warehousing systems are widely
subject-oriented
Operational systems are usually optimized to Data warehousing systems are usually optimized to
perform fast inserts and updates of associatively perform fast retrievals of relatively high volumes of
small volumes of data. data.
Data In Data Out
Less Number of data accessed. Large Number of data accessed.
Relational databases are created for on-line Data Warehouse designed for on-line Analytical
transactional Processing (OLTP) Processing (OLAP)
Data Warehouse characteristics:

Subject Oriented: Data that gives information about a particular subject instead of about a
company’s ongoing operations.
Integrated: Data that is gathered into the data warehouse from a variety of sources and
merged into a coherent whole.
Time-variant: All data in the data warehouse is identified with a particular time period.
Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed.
It can be Used for decision Support Used to manage and control business Used by managers
and end-users to understand the business and make judgments.
Data Warehouse Components
It is a database designed for analytical tasks. Its content is periodically updated. It contains
current and historical data to provide a historical perspective of information.
Data warehouse Architecture and its seven components

Overall Architecture
The data warehouse architecture is based on the data base management system server.
The central information repository is surrounded by number of key components. Data

warehouse is an environment, not a product which is based on relational database
management system that functions as the central repository for informational data.
The data entered into the data warehouse transformed into an integrated structure and format.
The transformation process involves conversion, summarization, filtering and condensation.
The data warehouse must be capable of holding and managing large volumes of data as well as
different data structures over the time.
1. Source layer: A data warehouse system uses a heterogeneous source of data. That data
is stored initially to corporate relational databases or legacy databases, or it may come
from an information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one
standard schema. The so-called Extraction, Transformation, and Loading Tools (ETL)
can combine heterogeneous schemata, extract, transform, cleanse, validate, filter, and
load source data into a data warehouse.
Sourcing, Acquisition, Clean up, and Transformation Tools
They perform conversions, summarization, key changes, structural changes. The data
transformation is required to be used by decision support tools. The transformation produces
programs, control statements. It moves the data into a data warehouse from multiple operational
systems. The functionalities of these tools are listed below:
To remove unwanted data from operational db, Converting to common data names and
attributes, Calculating summaries and derived data, Establishing defaults for missing data,
Accommodating source data definition changes.
3. Data Warehouse layer: Information is saved to one logically centralized individual

repository: a data warehouse. The data warehouses can be directly accessed, but it can also be
used as a source for creating data marts, which partially replicate data warehouse contents and
are designed for specific enterprise departments. Meta-data repositories store information on
sources, access procedures, data staging, users, data mart schema, and so on.
4.Meta data
It is data about data. It is used for maintaining, managing and using the data warehouse. It is
classified into two:
Technical Metadata: It contains information about data warehouse data used by warehouse
designer, administrator to carry out development and management tasks. It includes Info about
data stores Transformation descriptions. That is mapping methods from operational db to
warehouse data .Warehouse Object and data structure definitions for target data The rules used
to perform clean up, and data enhancement. Data mapping operations. Access authorization,
backup history, archive history, info delivery history, data acquisition history, data access etc.,
Business Metadata: It contains info that gives info stored in a data warehouse to users. It
includes Subject areas, and info object type including queries, reports, images, video, audio clips
etc. Internet home pages, Info related to info delivery system, Data warehouse operational info
such as ownerships, audit trails etc.,
Metadata helps the users to understand content and find the data. Metadata is stored in a separate
data store which is known as informational directory or Metadata repository which helps to
integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Metadata:
It is the gateway to the data warehouse environment. It supports easy distribution and
replication of content for high performance and availability It should be searchable by
business oriented keywords.
It should act as a launch platform for end user to access data and analysis tools. It should
support the sharing of info.
It should support scheduling options for request
It should support and provide interface to other applications
It should support end user monitoring of the status of the data warehouse environment
5 Data marts
It is an inexpensive tool and alternative to the data warehouse. It is based on the subject area.
Data mart is used in the following situation:
● Extremely urgent user requirement.

● The absence of a budget for a full scale data warehouse strategy.
● The decentralization of business needs
6 Access tools
Its purpose is to provide info to business users for decision making. There are five categories:
Data query and reporting tools Application development tools Executive info system tools
(EIS) OLAP tools, Data mining tools
Query and reporting tools: used to generate query and report. There are two types of reporting
tools. They are: Production reporting tools used to generate regular operational reports Desktop
report writers are inexpensive desktop tools designed for end users.
Managed Query tools: used to generate SQL query. It uses Meta layer software in between
users and databases which offers a point-and-click creation of SQL statements.
Application development tools: This is a graphical data access environment which integrates
OLAP tools with data warehouse and can be used to access all db systems.
OLAP Tools: Are used to analyze the data in multi dimensional and complex views. Data
mining tools: are used to discover knowledge from the data warehouse data
7. Data warehouse admin and management
The management of data warehouse includes,
Security and priority management Monitoring updates from multiple sources Data
quality checks. Managing and updating metadata. Auditing and reporting data warehouse usage
and status. Purging data. Replicating, sub setting and distributing data. Backup and recovery.
Data warehouse storage management which includes capacity planning, hierarchical storage
management and purging of aged data etc.,
ETL (Extract, Transform, and Load) Process
What is ETL?
The mechanism of extracting information from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for Extraction, Transformation and
Loading.
The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs to change
with business changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse
system and needs to be agile, automated, and well documented.
How ETL Works?
ETL consists of three separate phases:
Extraction
○ Extraction is the operation of extracting information from a source system for further use
in a data warehouse environment. This is the first stage of the ETL process.
○ Extraction process is often one of the most time-consuming tasks in the ETL.
○ The source systems might be complicated and poorly documented, and thus determining
which data needs to be extracted can be difficult.
○ The data has to be extracted several times in a periodic manner to supply all changed data
to the warehouse and keep it up-to-date.
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to improve
data quality. The primary data cleansing features found in ETL tools are rectification and
homogenization. They use specific dictionaries to rectify typing mistakes and to recognize
synonyms, as well as rule-based cleansing to enforce domain-specific rules and defines
appropriate associations between values.
The following examples show the essential of data cleaning:
If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-date
list of contact addresses, email addresses and telephone numbers must be available.
If a client or supplier calls, the staff responding should be quickly able to find the person in the
enterprise database, but this need that the caller's name or his/her company name is listed in the
database.
If a user appears in the databases with two or more slightly different names or different account
numbers, it becomes difficult to update the customer's information.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its operational
source format into a particular data warehouse format. If we implement a three-layer
architecture, this phase outputs our reconciled data layer.
The following points must be rectified in this phase:
○ Loose texts may hide valuable information. For example, XYZ PVT Ltd does not
explicitly show that this is a Limited Partnership company.
○ Different formats can be used for individual data. For example, data can be saved as a
string or as three integers.
Following are the main transformation processes aimed at populating the reconciled data layer:
○ Conversion and normalization that operate on both storage formats and units of measure
to make data uniform.
○ Matching that associates equivalent fields in different sources.
○ Selection that reduces the number of source fields and records.
Cleansing and Transformation processes are often closely linked in ETL tools.
Loading
The Load is the process of writing the data into the target database. During the load step, it is
necessary to ensure that the load is performed correctly and with as little resources as possible.
Loading can be carried in two ways:

1. Refresh: Data Warehouse data is completely rewritten. This means that older files are
replaced. Refresh is usually used in combination with static extraction to populate a data
warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying pre existing
data. This method is used in combination with incremental extraction to update data
warehouses regularly.
Selecting an ETL Tool
Selection of an appropriate ETL Tools is an important decision that has to be made in choosing
the importance of an ODS or data warehousing application. The ETL tools are required to
provide coordinated access to multiple data sources so that relevant data may be extracted from
them. An ETL tool would generally contain tools for data cleansing, re-organization,
transformations, aggregation, calculation and automatic loading of information into the object
database.
An ETL tool should provide a simple user interface that allows data cleansing and data
transformation rules to be specified using a point-and-click approach. When all mappings and
transformations have been defined, the ETL tool should automatically generate the data
extract/transformation/load programs, which typically run in batch mode.
Benefits of Metadata Repository
1. It provides a set of tools for enterprise-wide metadata management.
2. It eliminates and reduces inconsistency, redundancy, and underutilization.
3. It improves organization control, simplifies management, and accounting of information

assets.
4. It increases coordination, understanding, identification, and utilization of information

assets.
5. It enforces CASE development standards with the ability to share and reuse metadata.
6. It leverages investment in legacy systems and utilizes existing applications.
7. It provides a relational model for heterogeneous RDBMS to share information.
8. It gives a useful data administration tool to manage corporate information assets with the
data dictionary.
9. It increases reliability, control, and flexibility of the application development process.
Data Mart
A Data Mart is a subset of a directorial information store, generally oriented to a specific

purpose or primary data subject which may be distributed to provide business needs. Data Marts
are analytical record stores designed to focus on particular business functions for a specific
community within an organization. Data marts are derived from subsets of data in a data
warehouse, though in the bottom-up data warehouse design methodology, the data warehouse is
created from the union of organizational data marts.
The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to
gather, store, access, and analyze record. It can be used by smaller businesses to utilize the data
they have accumulated since it is less expensive than implementing a data warehouse.
Reasons for creating a data mart
○ Creates collective data by a group of users

○ Easy access to frequently needed data
○ Ease of creation
○ Improves end-user response time
○ Lower cost than implementing a complete data warehouses
○ Potential clients are more clearly defined than in a comprehensive data warehouse
○ It contains only essential business data and is less cluttered.
Types of Data Marts

There are mainly two approaches to designing data marts. These approaches are
○ Dependent Data Marts
○ Independent Data Marts
Dependent Data Marts
A dependent data marts is a logical subset of a physical subset of a higher data warehouse.
According to this technique, the data marts are treated as the subsets of a data warehouse. In this
technique, firstly a data warehouse is created from which further various data marts can be
created. These data mart are dependent on the data warehouse and extract the essential record
from it. In this technique, as the data warehouse creates the data mart; therefore, there is no need
for data mart integration. It is also known as a top-down approach.
Independent Data Marts

The second approach is Independent data marts (IDM) Here, firstly independent data marts are
created, and then a data warehouse is designed using these independent multiple data marts. In
this approach, as all the data marts are designed independently; therefore, the integration of data
marts is required. It is also termed as a bottom-up approach as the data marts are integrated to
develop a data warehouse.
Other than these two categories, one more type exists that is called "Hybrid Data Marts."
Hybrid Data Marts
It allows us to combine input from sources other than a data warehouse. This could be helpful for
many situations; especially when Adhoc integrations are needed, such as after a new group or
product is added to the organizations.
Difference between Data warehouse and data mart
Data Warehouse Data Mart
A Data Warehouse is a vast repository of A data mart is an only subtype of a Data

information collected from various Warehouse. It is architecture to meet the
organizations or departments within a requirement of a specific user group.
corporation.
It may hold multiple subject areas. It holds only one subject area. For example,
Finance or Sales.
It holds very detailed information. It may hold more summarized data.
Works to integrate all data sources It concentrates on integrating data from a

given subject area or set of source systems.
In data warehousing, Fact constellation is used. In Data Mart, Star Schema and Snowflake
Schema are used.
It is a Centralized System. It is a Decentralized

System.
Data Warehousing is data-oriented. Data Marts is project-oriented.
Dimensional modeling
Dimensional modeling represents data with a cube operation, making more suitable logical data
representation with OLAP data management. The perception of Dimensional Modeling was
developed by Ralph Kimball and consists of "fact" and "dimension" tables.
In dimensional modeling, the transaction record is divided into either "facts," which are
frequently numerical transaction data, or "dimensions," which are the reference information that
gives context to the facts. For example, a sale transaction can be damaged into facts such as the
number of products ordered and the price paid for the products, and into dimensions such as
order date, user name, product number, order ship-to, and bill-to locations, and salesman
responsible for receiving the order.
Objectives of Dimensional Modeling

The purposes of dimensional modeling are:
1. To produce database architecture that is easy for end-clients to understand and write
queries.
2. To maximize the efficiency of queries. It achieves these goals by minimizing the number
of tables and relationships between them.
Advantages of Dimensional Modeling
Following are the benefits of dimensional modeling are:
Dimensional modeling is simple: Dimensional modeling methods make it possible for

warehouse designers to create database schemas that business customers can easily hold and
comprehend. There is no need for vast training on how to read diagrams, and there is no
complicated relationship between different data elements.
Dimensional modeling promotes data quality: The star schema enables warehouse
administrators to enforce referential integrity checks on the data warehouse. Since the fact
information key is a concatenation of the essentials of its associated dimensions, a factual record
is actively loaded if the corresponding dimensions records are duly described and also exist in
the database.
By enforcing foreign key constraints as a form of referential integrity check, data warehouse
DBAs add a line of defense against corrupted warehouse data.
Performance optimization is possible through aggregates: As the size of the data warehouse
increases, performance optimization develops into a pressing concern. Customers who have to
wait for hours to get a response to a query will quickly become discouraged with the warehouses.
Aggregates are one of the easiest methods by which query performance can be optimized.
Disadvantages of Dimensional Modeling
1. To maintain the integrity of fact and dimensions, loading the data warehouses with a
record from various operational systems is complicated.
2. It is severe to modify the data warehouse operation if the organization adopting the
dimensional technique changes the method in which it does business.
Elements of Dimensional Modeling
Fact
It is a collection of associated data items, consisting of measures and context data. It typically
represents business items or business transactions.
Dimensions
It is a collection of data which describes one business dimension. Dimensions decide the
contextual background for the facts, and they are the framework over which OLAP is performed.
Measure
It is a numeric attribute of a fact, representing the performance or behavior of the business

relative to the dimensions.
Considering the relational context, there are two basic models which are used in dimensional
modeling:
○ Star Model
○ Snowflake Model
The star model is the underlying structure for a dimensional model. It has one broad central table
(fact table) and a set of smaller tables (dimensions) arranged in a radial design around the
primary table. The snowflake model is the conclusion of decomposing one or more of the
dimensions.
Fact Table
Fact tables are used to data facts or measures in the business. Facts are the numeric data elements
that are of interest to the company.
Characteristics of the Fact table
The fact table includes numerical values of what we measure. For example, a fact value of 20
might means that 20 widgets have been sold.
Each fact table includes the keys to associated dimension tables. These are known as foreign
keys in the fact table.
Fact tables typically include a small number of columns.
When it is compared to dimension tables, fact tables have a large number of rows.
Dimension Table
Dimension tables establish the context of the facts. Dimensional tables store fields that describe
the facts.
Characteristics of the Dimension table

Dimension tables contain the details about the facts. That, as an example, enables the business
analysts to understand the data and their reports better.
The dimension tables include descriptive data about the numerical values in the fact table. That
is, they contain the attributes of the facts. For example, the dimension tables for a marketing
analysis function might include attributes such as time, marketing region, and product type.
Since the record in a dimension table is denormalized, it usually has a large number of columns.
The dimension tables include significantly fewer rows of information than the fact table.
The attributes in a dimension table are used as row and column headings in a document or query
results display.
Example: A city and state can view a store summary in a fact table. Item summary can be
viewed by brand, color, etc. Customer information can be viewed by name and address.
Fact Table
Time ID Product ID Customer ID Unit Sold
4 17 2 1
8 21 3 2
8 4 1 1
In this example, Customer ID column in the facts table is the foreign keys that join with the
dimension table. By following the links, we can see that row 2 of the fact table records the fact
that customer 3, Gaurav, bought two items on day 8.
Dimension Tables
Customer ID Name Gender Income Education Region
1 Rohan Male 2 3 4
2 Sandeep Male 3 5 1
3 Gaurav Male 1 7 3
Hierarchy
A hierarchy is a directed tree whose nodes are dimensional attributes and whose arcs model
many to one association between dimensional attributes. It contains a dimension, positioned at
the tree's root, and all of the dimensional attributes that define it.
Multidimensional model
A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records.
For example, a shop may create a sales data warehouse to keep records of the store's sales for the
dimension, time, item, and location. These dimensions allow the save to keep track of things, for
example, monthly sales of items and the locations at which the items were sold. Each dimension
has a table related to it, called a dimensional table, which describes the dimension further. For
example, a dimensional table for an item may contain the attributes item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in
the table. In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an item sold).
The fact or measure displayed in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:
When data is grouped or combined in multidimensional matrices called Data Cubes. The data
cube method has a few alternative names or a few variants, such as "Multidimensional
databases," "materialized views," and "OLAP (On-Line Analytical Processing)."
The general idea of this approach is to materialize certain expensive computations that are
frequently inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be
materialized into a set of eight views as shown in fig, where psc indicates a view consisting of
aggregate function value (such as total-sales) computed by grouping three attributes part,
supplier, and customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific attributes are chosen to
be measure attributes, i.e., the attributes whose values are of interest. Another attributes are
selected as dimensions or functional attributes. The measure attributes are aggregated according
to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's sales for the
dimensions time, item, branch, and location. These dimensions enable the store to keep track of
things like monthly sales of items, and the branches and locations at which the items were sold.
Each dimension may have a table identify with it, known as a dimensional table, which describes
the dimensions. For example, a dimension table for items may contain the attributes item_name,
brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse
in many cases because not every cell in each dimension may have corresponding data in the
database.
Techniques should be developed to handle sparse cubes efficiently.
If a query contains constants at even lower levels than those provided in a data cube, it is not
clear how to make the best use of the precomputed results stored in the data cube.
The model views data in the form of a data cube. OLAP tools are based on the multidimensional
data model. Data cubes usually model n-dimensional data.
A data cube enables data to be modeled and viewed in multiple dimensions. A multidimensional
data model is organized around a central theme, like sales and transactions. A fact table
represents this theme. Facts are numerical measures. Thus, the fact table contains measures (such
as Rs_sold) and keys to each of the related dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for
analyzing the relationship between dimensions.
Example: In the 2-D representation, we will look at the All Electronics sales data for items
sold per quarter in the city of Vancouver. The measured display in dollars sold (in thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example, suppose
we would like to view the data according to time, item as well as the location for the cities
Chicago, New York, Toronto, and Vancouver. The measured display in dollars sold (in
thousands). These 3-D data are shown in the table. The 3-D data of the table are represented as a
series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:
Let us suppose that we would like to view our sales data with an additional fourth dimension,
such as a supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level
of summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location,
and supplier dimensions.
Figure is shown a 4-D data cube representation of sales data, according to the dimensions time,
item, location, and supplier. The measure displayed is dollars sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex
cuboid. In this example, this is the total sales, or dollars sold, summarized over all four
dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data
cubes for the dimension time, item, location, and supplier. Each cuboid represents a different
degree of summarization.
Schema of Dimension model
There are three schemas used to design dimension model
1. Star schema
2. Snowflake schema
3. Fact constellation
Star Schema
A star schema is the elementary form of a dimensional model, in which data are organized into
facts and dimensions. A fact is an event that is counted or measured, such as a sale or log in. A
dimension includes reference data about the fact, such as date, item, or customer.
A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is known
as star schema because the entity-relationship diagram of this schemas simulates a star, with
points, diverge from a central table. The center of the schema consists of a large fact table, and
the points of the star are the dimension tables.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table has two
types of columns: those that include fact and those that are foreign keys to the dimension table.
The primary key of the fact tables is generally a composite key that is made up of all of its
foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables
that include aggregated fact are often instead called summary tables). A fact table generally
contains facts with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data.
If a dimension has not got hierarchies and levels, it is called a flat dimension or list. The
primary keys of each of the dimensions table are part of the composite primary keys of the fact
table. Dimensional attributes help to define the dimensional value. They are generally
descriptive, textual values. Dimensional tables are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region
(markets, cities), clients, products, times, channels.
Characteristics of Star Schema

The star schema is intensely suitable for data warehouse database design because of the
following features:
○ It creates a DE-normalized database that can quickly provide query responses.
○ It provides a flexible design that can be changed easily or added to throughout the
development cycle, and as the database grows.
○ It provides a parallel in design to how end-users typically think of and use the data.
○ It reduces the complexity of metadata for both developers and end-users.
Advantages of Star Schema

Star Schemas are easy for end-users and application to understand and navigate. With a
well-designed schema, the customer can instantly analyze large, multidimensional data sets.
The main advantage of star schemas in a decision-support environment are:

Query Performance
A star schema database has a limited number of table and clear join paths, the query run faster
than they do against OLTP systems. Small single-table queries, frequently of a dimension table,
are almost instantaneous. Large join queries that contain multiple tables takes only seconds or
minutes to run.
In a star schema database design, the dimension is connected only through the central fact table.
When the two-dimension table is used in a query, only one join path, intersecting the fact tables,
exist between those two tables. This design feature enforces authentic and consistent query
results.
Disadvantage of Star Schema

There are some conditions which cannot be meet by star schemas like the relationship between
the user, and bank account cannot be described as star schema as the relationship between them
is many to many.
Example: Suppose a star schema is composed of a fact table, SALES, and several dimension
tables connected to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has
columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH table has
columns for each branch_key, branch_name, branch_type. The LOCATION table has columns of
geographic data, including street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the dimension
tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for time data, four
columns for ITEM data, three columns for BRANCH data, and four columns for LOCATION
data. Thus, the size of the fact table is significantly reduced. When we need to change an item,
we need only make a single change in the dimension table, instead of making many changes in
the fact table.
Snowflake Schema
A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one
or more dimension tables do not connect directly to the fact table but must join through other
dimension tables."
The snowflake schema is an expansion of the star schema where each point of the star explodes
into more points. It is called snowflake schema because the diagram of snowflake schema
resembles a snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR
schemas. When we normalize all the dimension tables entirely, the resultant structure resembles
a snowflake with the fact table in the middle.
Snowflaking is used to develop the performance of specific queries. The schema is diagramed
with each fact surrounded by its associated dimensions, and those dimensions are related to other
dimensions, branching out into a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension tables,
which can be linked to other dimension tables through a many-to-one relationship. Tables in a
snowflake schema are generally normalized to the third normal form. Each dimension table
performs exactly one level in a hierarchy.
The following diagram shows a snowflake schema with two dimensions, each having three
levels. A snowflake schemas can have any number of dimension, and each dimension can have
any number of levels.
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time,
Product, Line, and Family dimension tables. The Market dimension has two dimension tables
with Store as the primary dimension table, and Location as the outrigger dimension table. The
product dimension has three dimension tables with Product as the primary dimension table, and
the Line and Family table are the outrigger dimension tables.
A star schema store all attributes for a dimension into one denormalized table. This needed more
disk space than a more normalized snowflake schema. Snowflaking normalizes the dimension by
moving attributes with low cardinality into separate dimension tables that relate to the core
dimension table by using foreign keys. Snowflaking for the sole purpose of minimizing disk
space is not recommended, because it can adversely impact query performance.
In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables
are damaged into multiple dimension tables.
Figure shows a simple STAR schema for sales in a manufacturing company. The sales fact table
include quantity, price, and other relevant metrics. SALESREP, CUSTOMER, PRODUCT, and
TIME are the dimension tables.
The STAR schema for sales, as shown above, contains only five tables, whereas the normalized
version now extends to eleven tables. We will notice that in the snowflake schema, the attributes
with low cardinality in each original dimension tables are removed to form separate tables. These
new tables are connected back to the original dimension table through artificial keys.
A snowflake schema is designed for flexible querying across more complex dimensions and
relationship. It is suitable for many to many and one to many relationships between dimension
levels.
Advantage of Snowflake Schema
1. The primary advantage of the snowflake schema is the development in query

performance due to minimized disk storage requirements and joining smaller lookup
tables.
2. It provides greater scalability in the interrelationship between dimension levels and

components.
3. No redundancy, so it is easier to maintain.
Disadvantage of Snowflake Schema

1. The primary disadvantage of the snowflake schema is the additional maintenance efforts
required due to the increasing number of lookup tables. It is also known as a multi fact
star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables join so more query execution time.
Fact Constellation Schema?

A Fact constellation means two or more fact tables sharing one or more dimensions. It is also
called Galaxy schema.
Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact
Constellation Schema can design with a collection of de-normalized FACT, Shared, and
Conformed Dimension tables.
Fact Constellation Schema is a sophisticated database design that is difficult to summarize

information. Fact Constellation Schema can implement between aggregate Fact tables or
decompose a complex Fact table into independent simplex Fact tables.
Example: A fact constellation schema is shown in the figure below.

This schema defines two fact tables, sales, and shipping. Sales are treated along four dimensions,
namely, time, item, branch, and location. The schema contains a fact table for sales that includes
keys to each of the four dimensions, along with two measures: Rupee_sold and units_sold. The
shipping table has five dimensions, or keys: item_key, time_key, shipper_key, from_location,
and to_location, and two measures: Rupee_cost and units_shipped.
The primary disadvantage of the fact constellation schema is that it is a more challenging design
because many variants for specific kinds of aggregation must be considered and selected.
Data Warehouse Applications

The application areas of the data warehouse are:
Information Processing
It deals with querying, statistical analysis, and reporting via tables, charts, or graphs. Nowadays,
information processing of data warehouse is to construct a low cost, web-based accessing tools
typically integrated with web browsers.
Analytical Processing
It supports various online analytical processing such as drill-down, roll-up, and pivoting. The
historical data is being processed in both summarized and detailed format.
OLAP is implemented on data warehouses or data marts. The primary objective of OLAP is to
support ad-hoc querying needed for support DSS. The multidimensional view of data is
fundamental to the OLAP application. OLAP is an operational view, not a data structure or
schema. The complex nature of OLAP applications requires a multidimensional view of the data.
Data Mining
It helps in the analysis of hidden design and association, constructing scientific models,
operating classification and prediction, and performing the mining results using visualization
tools.
Data mining is the technique of designing essential new correlations, patterns, and trends by
changing through high amounts of a record save in repositories, using pattern recognition
technologies as well as statistical and mathematical techniques.
It is the phase of selection, exploration, and modeling of huge quantities of information to
determine regularities or relations that are at first unknown to access precise and useful results
for the owner of the database.
It is the process of inspection and analysis, by automatic or semi-automatic means, of large

quantities of records to discover meaningful patterns and rules.
OLAP (Online Analytical Processing)?

OLAP stands for On-Line Analytical Processing. OLAP is a classification of software
technology which authorizes analysts, managers, and executives to gain insight into information
through fast, consistent, interactive access in a wide variety of possible views of data that has
been transformed from raw information to reflect the real dimensionality of the enterprise as
understood by the clients.
OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is rapidly
enhancing the essential foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis,
Simulation-Models, Knowledge Discovery, and Data Warehouses Reporting. OLAP enables
end-clients to perform ad hoc analysis of record in multiple dimensions, providing the insight
and understanding they require for better decision making.
Who uses OLAP and Why?

OLAP applications are used by a variety of the functions of an organization.
Finance and accounting:
○ Budgeting
○ Activity-based costing
○ Financial performance analysis
○ And financial modeling
Sales and Marketing
○ Sales analysis and forecasting
○ Market research analysis

○ Promotion analysis
○ Customer analysis
○ Market and customer segmentation
Production
○ Production planning
○ Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a data model
more intuitive to them than a tabular model. This model is called a Dimensional Model.
The second purpose is to enable fast query response that is usually difficult to achieve using
tabular models.
How OLAP Works?

Fundamentally, OLAP has a very simple concept. It pre-calculates most of the queries that are
typically very hard to execute over tabular databases, namely aggregation, joining, and grouping.
These queries are calculated during a process that is usually called 'building' or 'processing' of
the OLAP cube. This process happens overnight, and by the time end users get to work - data
will have been updated.
OLAP Guidelines (Dr.E.F.Codd Rule)

Dr E.F. Codd, the "father" of the relational model, has formulated a list of 12 guidelines and
requirements as the basis for selecting OLAP systems:
1) Multidimensional Conceptual View: This is the central features of an OLAP system. By
needing a multidimensional view, it is possible to carry out methods like slice and dice.
2) Transparency: Make the technology, underlying information repository, computing

operations, and the dissimilar nature of source data totally transparent to users. Such
transparency helps to improve the efficiency and productivity of the users.
3) Accessibility: It provides access only to the data that is actually required to perform the
particular analysis, present a single, coherent, and consistent view to the clients. The OLAP
system must map its own logical schema to the heterogeneous physical data stores and perform
any necessary transformations. The OLAP operations should be sitting between data sources
(e.g., data warehouses) and an OLAP front-end.
4) Consistent Reporting Performance: To make sure that the users do not feel any significant
degradation in documenting performance as the number of dimensions or the size of the database
increases. That is, the performance of OLAP should not suffer as the number of dimensions is
increased. Users must observe consistent run time, response time, or machine utilization every
time a given query is run.
5) Client/Server Architecture: Make the server component of OLAP tools sufficiently
intelligent that the various clients to be attached with a minimum of effort and integration
programming. The server should be capable of mapping and consolidating data between
dissimilar databases.
6) Generic Dimensionality: An OLAP method should treat each dimension as equivalent in

both is structure and operational capabilities. Additional operational capabilities may be allowed
to selected dimensions, but such additional tasks should be grantable to any dimension.
7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific analytical
model being created and loaded that optimizes sparse matrix handling. When encountering the
sparse matrix, the system must be easy to dynamically assume the distribution of the information
and adjust the storage and access to obtain and maintain a consistent level of performance.
8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity, and
access security.
9) Unrestricted cross-dimensional Operations: It provides the ability for the methods to

identify dimensional order and necessarily functions roll-up and drill-down methods within a
dimension or across the dimension.
10) Intuitive Data Manipulation: Data Manipulation fundamental the consolidation direction
like as reorientation (pivoting), drill-down and roll-up, and another manipulation to be
accomplished naturally and precisely via point-and-click and drag and drop methods on the cells
of the scientific model. It avoids the use of a menu or multiple trips to a user interface.
11) Flexible Reporting: It implements efficiency to the business clients to organize columns,
rows, and cells in a manner that facilitates simple manipulation, analysis, and synthesis of data.
12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions should be
unlimited. Each of these common dimensions must allow a practically unlimited number of
customer-defined aggregation levels within any given consolidation path.
OLAP Operations in the Multidimensional Data

Model
In the multidimensional model, the records are organized into various dimensions, and each
dimension includes multiple levels of abstraction described by concept hierarchies. This
organization support users with the flexibility to view data from various perspectives. A number
of OLAP data cube operation exist to demonstrate these different views, allowing interactive
queries and search of the record at hand. Hence, OLAP supports a user-friendly environment for
interactive data analysis.
Consider the OLAP operations which are to be performed on multidimensional data. The figure
shows data cubes for sales of a shop. The cube contains the dimensions, location, and time and
item, where the location is aggregated with regard to city values, time is aggregated with respect
to quarters, and an item is aggregated with respect to item types.
Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs aggregation
on a data cube, by climbing down concept hierarchies, i.e., dimension reduction. Roll-up is like
zooming-out on the data cubes. Figure shows the result of roll-up operations performed on the
dimension location. The hierarchy for the location is defined as the Order Street, city, province,
or state, country. The roll-up operation aggregates the data by ascending the location hierarchy
from the level of the city to the level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are removed
from the cube. For example, consider a sales data cube having two dimensions, location and
time. Roll-up may be performed by removing, the time dimensions, appearing in an aggregation
of the total sales by location, relatively than by location and by time.
Example
Consider the following cubes illustrating temperature of certain days recorded weekly:
Temperature 64 65 68 69 70 71 72 75 80 81 83 85
Week1 1 0 1 0 1 0 0 0 0 0 1 0
Week2 0 0 0 1 0 0 1 2 0 1 0 0
Consider that we want to set up levels (hot (80-85), mild (70-75), cool (64-69)) in temperature
from the above cubes.
To do this, we have to group column and add up the value according to the concept hierarchies.
This operation is known as a roll-up.
By doing this, we contain the following cube:
Temperature cool mild hot

Week1 2 1 1
Week2 2 1 1
The roll-up operation groups the information by levels of temperature.
The following diagram illustrates how roll-up works.
Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-down
is like zooming-in on the data cube. It navigates from less detailed record to more detailed data.
Drill-down can be performed by either stepping down a concept hierarchy for a dimension or
adding additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a
concept hierarchy which is defined as day, month, quarter, and year. Drill-down appears by
descending the time hierarchy from the level of the quarter to a more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by adding a
new dimension to a cube. For example, a drill-down on the central cubes of the figure can occur
by introducing an additional dimension, such as a customer group.
Example
Drill-down adds more details to the given data
Temperature cool mild hot
Day 1 0 0 0
Day 2 0 0 0
Day 3 0 0 1
Day 4 0 1 0
Day 5 1 0 0
Day 6 0 0 0
Day 7 1 0 0
Day 8 0 0 0
Day 9 1 0 0
Day 10 0 1 0
Day 11 0 1 0
Day 12 0 1 0
Day 13 0 0 1
Day 14 0 0 0
The following diagram illustrates how Drill-down works.
Slice
A slice is a subset of the cubes corresponding to a single value for one or more members of the
dimension. For example, a slice operation is executed when the customer wants a selection on
one dimension of a three-dimensional cube resulting in a two-dimensional site. So, the Slice
operations perform a selection on one dimension of the given cube, thus resulting in a subcube.
For example, if we make the selection, temperature=cool we will obtain the following cube:
Temperature cool
Day 1 0
Day 2 0
Day 3 0
Day 4 0
Day 5 1
Day 6 1
Day 7 1
Day 8 1
Day 9 1
Day 11 0
Day 12 0
Day 13 0
Day 14 0
The following diagram illustrates how Slice works.

Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
It will form a new sub-cubes by selecting one or more dimensions.
Dice
The dice operation describes a subcube by operating a selection on two or more dimension.
For example, Implement the selection (time = day 3 OR time = day 4) AND (temperature = cool
OR temperature = hot) to the original cubes we get the following subcube (still two-dimensional)
Temperature cool hot
Day 3 0 1
Day 4 0 0
Consider the following diagram, which shows the dice operations.
The dice operation on the cubes based on the following selection criteria involves three
dimensions.
○ (location = "Toronto" or "Vancouver")
○ (time = "Q1" or "Q2")
○ (item =" Mobile" or "Modem")
Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates the
data axes in view to provide an alternative presentation of the data. It may contain swapping the
rows and columns or moving one of the row-dimensions into the column dimensions.
Consider the following diagram, which shows the pivot operation.

Other OLAP Operations
executes queries containing more than one fact table. The drill-through operations make use of
relational SQL facilitates to drill through the bottom level of a data cubes down to its back-end
relational tables.
Other OLAP operations may contain ranking the top-N or bottom-N elements in lists, as well as
calculate moving average, growth rates, and interests, internal rates of returns, depreciation,
currency conversions, and statistical tasks.
OLAP offers analytical modeling capabilities, containing a calculation engine for determining
ratios, variance, etc. and for computing measures across various dimensions. It can generate
summarization, aggregation, and hierarchies at each granularity level and at every dimensions
intersection. OLAP also provide functional models for forecasting, trend analysis, and statistical
analysis. In this context, the OLAP engine is a powerful data analysis tool.
Types of OLAP
There are three main types of OLAP servers are as following:
ROLAP stands for Relational OLAP, an application based on relational DBMSs.
MOLAP stands for Multidimensional OLAP, an application based on multidimensional DBMSs.
HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional
techniques.
Relational OLAP (ROLAP) Server

These are intermediate servers which stand in between a relational back-end server and user
frontend tools.
They use a relational or extended-relational DBMS to save and handle warehouse data, and
OLAP middleware to provide missing pieces.
ROLAP servers contain optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services.
ROLAP technology tends to have higher scalability than MOLAP technology.
ROLAP systems work primarily from the data that resides in a relational database, where the
base data and dimension tables are stored as relational tables. This model permits the
multidimensional analysis of data.
This technique relies on manipulating the data stored in the relational database to give the
presence of traditional OLAP's slicing and dicing functionality. In essence, each method of
slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Relational OLAP Architecture
ROLAP Architecture includes the following components
○ Database server.
○ ROLAP server.
○ Front-end tool.
Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology segment in the
market. This method allows multiple multidimensional views of two-dimensional relational
tables to be created, avoiding structuring record around the desired view.
Some products in this segment have supported reliable SQL engines to help the complexity of
multidimensional analysis. This includes creating multiple SQL statements to handle user
requests, being 'RDBMS' aware and also being capable of generating the SQL statements based
on the optimizer of the DBMS engine.
Advantages
Can handle large amounts of information: The data size limitation of ROLAP technology is
depends on the data size of the underlying RDBMS. So, ROLAP itself does not restrict the data
amount.
RDBMS already comes with a lot of features. So ROLAP technologies, (works on top of the
RDBMS) can control these functionalities.
Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in the
relational database, the query time can be prolonged if the underlying data size is large.
Limited by SQL functionalities: ROLAP technology relies on upon developing SQL

statements to query the relational database, and SQL statements do not suit all needs.
Multidimensional OLAP (MOLAP) Server

A MOLAP system is based on a native logical model that directly supports multidimensional
data and operations. Data are stored physically into multidimensional arrays, and positional
techniques are used to access them.
One of the significant distinctions of MOLAP against a ROLAP is that data are summarized
and are stored in an optimized format in a multidimensional cube, instead of in a relational
database. In MOLAP model, data are structured into proprietary formats by client's reporting
requirements with the calculations pre-generated on the cubes.
MOLAP Architecture
MOLAP Architecture includes the following components
○ Database server.
○ MOLAP server.
○ Front-end tool.
MOLAP structure primarily reads the precompiled data. MOLAP structure has limited
capabilities to dynamically create aggregations or to evaluate results which have not been
pre-calculated and stored.
Applications requiring iterative and comprehensive time-series analysis of trends are well suited
for MOLAP technology (e.g., financial analysis and budgeting).
Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's Lightship
Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's Multiway.
Some of the problems faced by clients are related to maintaining support to multiple subject
areas in an RDBMS. Some vendors can solve these problems by continuing access from
MOLAP tools to detailed data in and RDBMS.
This can be very useful for organizations with performance-sensitive multidimensional analysis
requirements and that have built or are in the process of building a data warehouse architecture
that contains multiple subject areas.
An example would be the creation of sales data measured by several dimensions (e.g., product
and sales region) to be stored and maintained in a persistent structure. This structure would be
provided to reduce the application overhead of performing calculations and building aggregation
during initialization. These structures can be automatically refreshed at predetermined intervals
established by an administrator.
Advantages
Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal
for slicing and dicing operations.
Can perform complex calculations: All evaluation have been pre-generated when the cube is
created. Hence, complex calculations are not only possible, but they return quickly.
Disadvantages
Limited in the amount of information it can handle: Because all calculations are performed
when the cube is built, it is not possible to contain a large amount of data in the cube itself.
Requires additional investment: Cube technology is generally proprietary and does not already
exist in the organization. Therefore, to adopt MOLAP technology, chances are other investments
in human and capital resources are needed.
Hybrid OLAP (HOLAP) Server

HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture.
HOLAP systems save more substantial quantities of detailed data in the relational tables while
the aggregations are stored in the pre-calculated cubes. HOLAP also can drill through from the
cube down to the relational tables for delineated data. The Microsoft SQL Server 2000 provides
a hybrid OLAP server.
Advantages of HOLAP
1. HOLAP provide benefits of both MOLAP and ROLAP.
2. It provides fast access at all levels of aggregation.
3. HOLAP balances the disk space requirement, as it only stores the aggregate information
on the OLAP server and the detail record remains in the relational database. So no
duplicate copy of the detail record is maintained.
Disadvantages of HOLAP
1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP
servers.
Other Types
There are also less popular types of OLAP styles upon which one could stumble upon every so
often. We have listed some of the less popular brands existing in the OLAP industry.
Web-Enabled OLAP (WOLAP) Server
WOLAP pertains to OLAP application which is accessible via the web browser. Unlike
traditional client/server OLAP applications, WOLAP is considered to have a three-tiered
architecture which consists of three components: a client, a middleware, and a database server.
Desktop OLAP (DOLAP) Server
DOLAP permits a user to download a section of the data from the database or source, and work
with that dataset locally, or on their desktop.
Mobile OLAP (MOLAP) Server
Mobile OLAP enables users to access and work on OLAP data and applications remotely
through the use of their mobile devices.
Spatial OLAP (SOLAP) Server
SOLAP includes the capabilities of both Geographic Information Systems (GIS) and OLAP into
a single user interface. It facilitates the management of both spatial and non-spatial data.
Difference between ROLAP, MOLAP, and HOLAP
ROLAP MOLAP HOLAP
ROLAP stands for MOLAP stands for HOLAP stands for Hybrid
Relational Online Multidimensional Online Online Analytical Processing.
Analytical Processing. Analytical Processing.
The ROLAP storage The MOLAP storage mode The HOLAP storage mode
mode causes the principle the aggregations of the connects attributes of both
aggregation of the division and a copy of its source MOLAP and ROLAP. Like
division to be stored in information to be saved in a MOLAP, HOLAP causes the
indexed views in the multidimensional operation in aggregation of the division to
relational database that analysis services when the be stored in a
was specified in the separation is processed. multidimensional operation in
partition's data source. an SQL Server analysis
services instance.
ROLAP does not This MOLAP operation is highly HOLAP does not causes a
because a copy of the optimize to maximize query copy of the source information
source information to performance. The storage area can to be stored. For queries that
be stored in the be on the computer where the access the only summary
Analysis services data partition is described or on record in the aggregations of a
folders. Instead, when another computer running division, HOLAP is the
the outcome cannot be Analysis services. Because a copy equivalent of MOLAP.
derived from the query of the source information resides
cache, the indexed in the multidimensional operation,
views in the record queries can be resolved without
source are accessed to accessing the partition's source
answer queries. record.
Query response is Query response times can be Queries that access source
frequently slower with reduced substantially by using record for example, if we want
ROLAP storage than aggregations. The record in the to drill down to an atomic
with the MOLAP or partition's MOLAP operation is cube cell for which there is no
HOLAP storage mode. only as current as of the most aggregation information must
Processing time is also recent processing of the retrieve data from the
frequently slower with separation. relational database and will
ROLAP. not be as fast as they would be
if the source information were
stored in the MOLAP
architecture.
Additive measure in Dimensional model
The measures in a fact table can be one of the following types:

Full-Additive
Additive measures are measures that can be aggregated across all of the dimensions in the fact
table, and are the most common type of measure. Additive measures are used across several
dimensions for summation purposes.
Since dimensional modeling involves hierarchies in dimensions, aggregation of information over
different members in the hierarchy is a key element in the usefulness of the model. Since
aggregation is an additive process, use additive measures as much as possible.
Semi-additive
Semi-additive measures can be aggregated across some dimensions, but not all dimensions. For
example, measures such as head counts and inventory are considered semi-additive.
Non-additive
Non-additive measures are measures that cannot be aggregated across any of the dimensions.
These measures cannot be logically aggregated between records or fact rows. Non-additive
measures are usually the result of ratios or other mathematical calculations. The only calculation
that can be made for such a measure is to get a count of the number of rows of such measures.

Datawarehouse Unit-2

Uploaded by

Copyright:

Available Formats

Datawarehouse Unit-2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Datawarehouse Unit-2

Uploaded by

Copyright:

Available Formats

Data Warehouse Introduction

―A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data

Difference between database and datawarehouse:

Operational DBMS VS Data Warehouse

It is optimized for validation of incoming Loaded with consistent, valid information,

It supports thousands of concurrent clients. It supports a few concurrent clients relative to

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Data Warehouse characteristics:

Data Warehouse Components

Data warehouse Architecture and its seven components

The central information repository is surrounded by number of key components. Data

Sourcing, Acquisition, Clean up, and Transformation Tools

3. Data Warehouse layer: Information is saved to one logically centralized individual

The following lists the characteristics of info directory/ Metadata:

It should support scheduling options for request

It should support and provide interface to other applications

● Extremely urgent user requirement.

7. Data warehouse admin and management

The management of data warehouse includes,

ETL (Extract, Transform, and Load) Process

ETL consists of three separate phases:

The following examples show the essential of data cleaning:

The following points must be rectified in this phase:

○ Matching that associates equivalent fields in different sources.

○ Selection that reduces the number of source fields and records.

Loading can be carried in two ways:

Selecting an ETL Tool

Benefits of Metadata Repository

1. It provides a set of tools for enterprise-wide metadata management.

2. It eliminates and reduces inconsistency, redundancy, and underutilization.

3. It improves organization control, simplifies management, and accounting of information

4. It increases coordination, understanding, identification, and utilization of information

6. It leverages investment in legacy systems and utilizes existing applications.

7. It provides a relational model for heterogeneous RDBMS to share information.

9. It increases reliability, control, and flexibility of the application development process.

A Data Mart is a subset of a directorial information store, generally oriented to a specific

Reasons for creating a data mart

○ Creates collective data by a group of users

○ Improves end-user response time

○ Lower cost than implementing a complete data warehouses

○ It contains only essential business data and is less cluttered.

Types of Data Marts

○ Dependent Data Marts

○ Independent Data Marts

Dependent Data Marts

Independent Data Marts

Hybrid Data Marts

Difference between Data warehouse and data mart

Data Warehouse Data Mart

A Data Warehouse is a vast repository of A data mart is an only subtype of a Data

Works to integrate all data sources It concentrates on integrating data from a

It is a Centralized System. It is a Decentralized

Data Warehousing is data-oriented. Data Marts is project-oriented.

Objectives of Dimensional Modeling

Dimensional modeling is simple: Dimensional modeling methods make it possible for

Disadvantages of Dimensional Modeling

Elements of Dimensional Modeling

It is a numeric attribute of a fact, representing the performance or behavior of the business

Characteristics of the Fact table