DWM Unit 1

Data warehousing and Data mining Unit-I
UNIT I
DATA WAREHOUSING
Basic Concepts - Data warehousing Components–Building a Data warehouse–Database

architecture for Parallel Processing-Parallel DBMS Vendors- Multidimensional Data Model-
Data warehouse Schemas for Decision Support, Concept hierarchies-Characteristics of OLAP
Systems-Typical OLAP Operations-OLAP and OLTP.
Data Warehouse Introduction

A data warehouse is a collection of data marts representing historical data from different
operations in the company. This data is stored in a structure optimized for querying and data analysis as a
data warehouse. Table design, dimensions and organization should be consistent throughout a data
warehouse so that reports or queries across the data warehouse are consistent.
A data warehouse can also be viewed as a database for historical data from different functions
within a company. The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the
following way: "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of
data in support of management's decision making process". He defined the terms in the sentence as
follows:
 Subject Oriented: Data that gives information about a particular subject instead of about a
company's ongoing operations.
 Integrated: Data that is gathered into the data warehouse from a variety of sources and merged
into a coherent whole.
 Time-variant: All data in the data warehouse is identified with a particular time period.
 Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed.
This enables management to gain a consistent picture of the business. It is a single, complete and
consistent store of data obtained from a variety of different sources made available to end users in
what they can understand and use in a business context. It can be Used for decision Support, Used
to manage and control business, Used by managers and end-users to understand the business and
make judgments.
Data Warehousing is an architectural construct of information systems that provides users with
current and historical decision support information that is hard to access or present in traditional
operational data stores
Other important terminology
 Enterprise Data warehouse: It collects all information about subjects (customers, products, sales,
assets, personnel) that span the entire organization
 Data Mart: Departmental subsets that focus on selected subjects. A data mart is a segment of a
data warehouse that can provide data for reporting and analysis on a section, unit, department or
operation in the company, e.g. sales, payroll, production. Data marts are sometimes complete
Page 1
individual data warehouses which are usually smaller than the corporate data warehouse.
 Decision Support System (DSS): Information technology to help the knowledge worker
(executive, manager, and analyst) makes faster & better decisions
 Drill-down: Traversing the summarization levels from highly summarized data to the underlying
current or old detail
 Metadata: Data about data. Containing location and description of warehouse system
components: names, definition, structure…
Benefits of data warehousing
 Data warehouses are designed to perform well with aggregate queries running on large amounts
of data.
 The structure of data warehouses is easier for end users to navigate, understand and query
against unlike the relational databases primarily designed to handle lots of transactions.
 Data warehouses enable queries that cut across different segments of a company's operation.
E.g. production data could be compared against inventory data even if they were originally stored in
different databases with different structures.
 Queries that would be complex in very normalized databases could be easier to build and
maintain in data warehouses, decreasing the workload on transaction systems.
 Data warehousing is an efficient way to manage and report on data that is from a variety of
sources, non uniform and scattered throughout a company.
 Data warehousing is an efficient way to manage demand for lots of information from lots of
users.
 Data warehousing provides the capability to analyze large amounts of historical data for
nuggets of wisdom that can provide an organization with competitive advantage.
Operational and informational Data

Operational Data:
 Focusing on transactional function such as bank card withdrawals and deposits
 Detailed
 Updateable
 Reflects current data
Page 2
Informational Data:
 Focusing on providing answers to problems posed by decision makers
 Summarized
Non updateable
Data Warehouse Characteristics
• A data warehouse can be viewed as an information system with the following attributes:
– It is a database designed for analytical tasks
– It’s content is periodically updated
– It contains current and historical data to provide a historical perspective of information
Operational data store (ODS)
• ODS is an architecture concept to support day-to-day operational decision support and contains current
value data propagated from operational applications
• ODS is subject-oriented, similar to a classic definition of a Data warehouse
• ODS is integrated
Page 3
A Three Tier Data Warehouse Architecture:
Tier-1:
The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed data into the bottom tier
from operational databases or other external sources (such as customer profile
information provided by external consultants). These tools and utilities perform data
extraction, cleaning, and transformation (e.g., to merge similar data from different
sources into a unified format), as well as load and refresh functions to update the data
warehouse . The data are extracted using application program interfaces known as
gateways. A gateway is
supported by the underlying DBMS and allows client programs to generate SQL code
to be executed at a server.
Examples of gateways include ODBC (Open Database Connection) and OLEDB (Open
Linking and Embedding for Databases) by Microsoft and JDBC (Java Database
Connection).
This tier also contains a metadata repository, which stores information aboutthe data
warehouse and its contents.
Page 4
Tier-2:
The middle tier is an OLAP server that is typically implemented using either a
relational OLAP (ROLAP) model or a multidimensional OLAP.
OLAP model is an extended relational DBMS thatmaps operations on
multidimensional data to standard relational operations.
A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that
directly implements multidimensional data and operations.
Tier-3:
The top tier is a front-end client layer, which contains query and reporting
tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so
on).
Data warehouse Architecture and its seven components
1. Data sourcing, cleanup, transformation, and migration tools

2. Metadata repository
3. Warehouse/database technology
4. Data marts
5. Data query, reporting, analysis, and mining tools
6. Data warehouse administration and management
7. Information delivery system
Data warehouse is an environment, not a product which is based on relational database

management system that functions as the central repository for informational data. The central repository
information is surrounded by number of key components designed to make the environment is functional,
manageable and accessible.
Page 5
The data source for data warehouse is coming from operational applications. The data entered into
the data warehouse transformed into an integrated structure and format. The transformation process
involves conversion, summarization, filtering and condensation. The data warehouse must be capable of
holding and managing large volumes of data as well as different structure of data structures over the time.
1. Data warehouse database
This is the central part of the data warehousing environment. This is the item number 2 in the
above arch. diagram. This is implemented based on RDBMS technology.
2. Sourcing, Acquisition, Clean up, and Transformation Tools
This is item number 1 in the above arch diagram. They perform conversions, summarization, key
changes, structural changes and condensation. The data transformation is required so that the information
can by used by decision support tools. The transformation
Page 6
produces programs, control statements, JCL code, COBOL code, UNIX scripts, and SQL DDL code
etc., to move the data into data warehouse from multiple operational systems.
The functionalities of these tools are listed below:
 To remove unwanted data from operational db
 Converting to common data names and attributes
 Calculating summaries and derived data
 Establishing defaults for missing data
 Accommodating source data definition change.
Issues to be considered while data sourcing, cleanup, extract and transformation:

Data heterogeneity: It refers to DBMS different nature such as it may be in different data modules, it may
have different access languages, it may have data navigation methods, operations, concurrency, integrity
and recovery processes etc.,
3. Meta data
It is data about data. It is used for maintaining, managing and using the data warehouse. It is
classified into two:
1.Technical Meta data: It contains information about data warehouse data used by warehouse
designer, administrator to carry out development and management tasks. It includes,
 Info about data stores
 Transformation descriptions. That is mapping methods from operational db to warehouse db
 Warehouse Object and data structure definitions for target data
 The rules used to perform clean up, and data enhancement
 Data mapping operations
 Access authorization, backup history, archive history, info delivery history, data acquisition
history, data access etc.
2.Business Meta data: It contains info that gives info stored in data warehouse to users. It
includes,
Page 7
 Subject areas, and info object type including queries, reports, images, video, audio clips etc.
 Internet home pages
 Info related to info delivery system
 Data warehouse operational info such as ownerships, audit trails etc.,

Meta data helps the users to understand content and find the data. Meta data are stored in a
separate data stores which is known as informational directory or Meta data repository which helps to
integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
 It is the gateway to the data warehouse environment
 It supports easy distribution and replication of content for high performance and
availability
 It should be searchable by business oriented key words
 It should act as a launch platform for end user to access data and analysis tools
 It should support the sharing of info
 It should support scheduling options for request
 IT should support and provide interface to other applications
 It should support end user monitoring of the status of the data warehouse environment
4. Access tools
Its purpose is to provide info to business users for decision making. There are five
categories:
 Data query and reporting tools
 Application development tools
 Executive info system tools (EIS)
 OLAP tools
 Data mining tools

Query and reporting tools are used to generate query and report. There are two types of
reporting tools. They are:
Page 8
 Production reporting tool used to generate regular operational reports
 Desktop report writer are inexpensive desktop tools designed for end users.
Managed Query tools: used to generate SQL query. It uses Meta layer software in between users and
databases which offers a point-and-click creation of SQL statement. This tool is a preferred choice of
users to perform segment identification, demographic analysis, territory management and preparation of
customer mailing lists etc.
Application development tools: This is a graphical data access environment which integrates OLAP tools
with data warehouse and can be used to access all db systems
OLAP Tools: are used to analyze the data in multi dimensional and complex views. To enable
multidimensional properties it uses MDDB and MRDB where MDDB refers multi dimensional data base
and MRDB refers multi relational data bases.
Data mining tools: are used to discover knowledge from the data warehouse data also can be used for
data visualization and data correction purposes.
5.Data marts
Departmental subsets that focus on selected subjects. They are independent used by dedicated
user group. They are used for rapid delivery of enhanced decision support functionality to end users. Data
mart is used in the following situation:
 Extremely urgent user requirement
 The absence of a budget for a full scale data warehouse strategy
 The decentralization of business needs
 The attraction of easy to use tools and mind sized project

Data mart presents two problems:
1. Scalability: A small data mart can grow quickly in multi dimensions. So that while designing
it, the organization has to pay more attention on system scalability, consistency and manageability
issues
2. Data integration
5. Data warehouse admin and management
The management of data warehouse includes,
Page 9
 Security and priority management
 Monitoring updates from multiple sources
 Data quality checks
 Managing and updating meta data
 Auditing and reporting data warehouse usage and status
 Purging data
 Replicating, sub setting and distributing data
 Backup and recovery
 Data warehouse storage management which includes capacity planning, hierarchical

storage management and purging of aged data etc.,
6. Information delivery system
• It is used to enable the process of subscribing for data warehouse info.
• Delivery to one or more destinations according to specified scheduling algorithm
Building a Data warehouse
There are two reasons why organizations consider data warehousing a critical need. In other
words, there are two factors that drive you to build and use data warehouse. They are: Business
factors:
 Business users want to make decision quickly and correctly using all available data.
Technological factors:
 To address the incompatibility of operational data stores
 IT infrastructure is changing rapidly. Its capacity is increasing and cost is decreasing so that
building a data warehouse is easy
There are several things to be considered while building a successful data warehouse
Business considerations:
Organizations interested in development of a data warehouse can choose one of the following two
approaches:
Page 10
 Top - Down Approach (Suggested by Bill Inmon)
 Bottom - Up Approach (Suggested by Ralph Kimball)

Top - Down Approach
In the top down approach suggested by Bill Inmon, we build a centralized repository to house
corporate wide business data. This repository is called Enterprise Data Warehouse (EDW). The data in
the EDW is stored in a normalized form in order to avoid redundancy. The central repository for
corporate wide data helps us maintain one version of truth of the data. The data in the EDW is stored at
the most detail level. The reason to build the EDW on the most detail level is to leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to cater for future requirements. The
disadvantages of storing data at the detail level are
1. The complexity of design increases with increasing level of detail.
2. It takes large amount of space to store data at detail level, hence increased cost.
Once the EDW is implemented we start building subject area specific data marts which contain
data in a de normalized form also called star schema. The data in the marts are usually summarized based
on the end users analytical requirements. The reason to de normalize the data in the mart is to provide
faster access to the data for the end users analytics. If we were to have queried a normalized schema for
the same analytics, we would end up in a complex multiple level joins that would be much slower as
compared to the one on the de normalized schema.
We should implement the top-down approach when
1. The business has complete clarity on all or multiple subject areas data warehouse
requirements.
2. The business is ready to invest considerable time and money.
The advantage of using the Top Down approach is that we build a centralized repository to
cater for one version of truth for business data. This is very important for the data to be reliable,
consistent across subject areas and for reconciliation in case of data related contention between
subject areas.
Page 11
The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the EDW to be implemented followed by building the data marts
before which they can access their reports.
Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to build a data
warehouse. Here we build the data marts separately at different points of time as and when the specific
subject area requirements are clear. The data marts are integrated or combined together to form a data
warehouse. Separate data marts are combined through the use of conformed dimensions and conformed
facts. A conformed dimension and a conformed fact is one that can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names and
consistent values across separate data marts. The conformed dimension means exact same thing with
every fact table it is joined.
A Conformed fact has the same definition of measures, same dimensions joined to it and at the
same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing and
integrating data marts as and when the requirements are clear. We don’t have to wait for knowing the
overall requirements of the warehouse.
We should implement the bottom up approach when
1. We have initial cost and time constraints.

2. The complete warehouse requirements are not clear. We have clarity to only one data mart. The
advantage of using the Bottom Up approach is that they do not require high initial costs and have a
faster implementation time; hence the business can start using the marts much earlier as compared to
the top-down approach.
The disadvantages of using the Bottom Up approach are that it stores data in the de normalized
format; hence there would be high space usage for detailed data. We have a tendency of not keeping
detailed data in this approach hence losing out on advantage of having detail data
i.e. flexibility to easily cater to future requirements. Bottom up approach is more realistic but the
complexity of the integration may become a serious obstacle.
Page 12
Design considerations
To be a successful data warehouse designer must adopt a holistic approach that is considering all
data warehouse components as parts of a single complex system, and take into account all possible data
sources and all known usage requirements.
Most successful data warehouses that meet these requirements have these common characteristics:
 Are based on a dimensional model
 Contain historical and current data
 Include both detailed and summarized data
 Consolidate disparate data from multiple sources while retaining consistency Data
warehouse is difficult to build due to the following reason:
 Heterogeneity of data sources
 Use of historical data
 Growing nature of data base

Data warehouse design approach muse be business driven, continuous and iterative engineering
approach. In addition to the general considerations there are following specific points relevant to the data
warehouse design:
Data content
The content and structure of the data warehouse are reflected in its data model. The data model is
the template that describes how information will be organized within the integrated warehouse
framework. The data warehouse data must be a detailed data. It must be formatted, cleaned up and
transformed to fit the warehouse data model.
Meta data
It defines the location and contents of data in the warehouse. Meta data is searchable by users to
find definitions or subject areas. In other words, it must provide decision support oriented pointers to
warehouse data and thus provides a logical link between warehouse data and decision support
applications.
Page 13
Data distribution
One of the biggest challenges when designing a data warehouse is the data placement and
distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes necessary to know
how the data should be divided across multiple servers and which users should get access to which types
of data. The data can be distributed based on the subject area, location (geographical region), or time
(current, month, year).
Tools
A number of tools are available that are specifically designed to help in the implementation of the
data warehouse. All selected tools must be compatible with the given data warehouse environment and
with each other. All tools must be able to use a common Meta data repository.
Design steps
The following nine-step method is followed in the design of a data warehouse:
1. Choosing the subject matter

2. Deciding what a fact table represents
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the db
8. The need to track slowly changing dimensions
9. Deciding the query priorities and query models
Technical considerations
A number of technical issues are to be considered when designing a data warehouse environment.
These issues include:
 The hardware platform that would house the data warehouse
 The DBMS that supports the warehouse data
Page 14
 The communication infrastructure that connects data marts, operational systems and end users
 The hardware and software to support meta data repository
 The systems management framework that enables admin of the entire environment
Implementation considerations
The following logical steps needed to implement a data warehouse:
 Collect and analyze business requirements
 Create a data model and a physical design
 Define data sources
 Choose the DB tech and platform
 Extract the data from operational DB, transform it, clean it up and load it into the
warehouse
 Choose DB access and reporting tools
 Choose DB connectivity software
 Choose data analysis and presentation s/w
 Update the data warehouse
Benefits of data warehousing

Data warehouse usage includes,
– Locating the right info

– Presentation of info
– Testing of hypothesis
– Discovery of info
– Sharing the analysis
Page 15
The benefits can be classified into two:
 Tangible benefits (quantified / measureable):It includes,

– Improvement in product inventory
– Decrement in production cost
– Improvement in selection of target markets
– Enhancement in asset and liability management
 Intangible benefits (not easy to quantified): It includes,
– Improvement in productivity by keeping all data in single location and
eliminating rekeying of data
– Reduced redundant processing
– Enhanced customer relation
Types of parallelism
There are two types of parallelism:
 Inter query Parallelism: In which different server threads or processes handle multiple requests at
the same time.
 Intra query Parallelism: This form of parallelism decomposes the serial SQL query into lower
level operations such as scan, join, sort etc. Then these lower level operations are executed
concurrently in parallel.
Intra query parallelism can be done in either of two ways:
 Horizontal parallelism: which means that the data base is partitioned across multiple disks and
parallel processing occurs within a specific task that is performed concurrently on different
processors against different set of data
Page 16
 Vertical parallelism: This occurs among different tasks. All query components such as scan, join,
sort etc are executed in parallel in a pipelined fashion. In other words, an output from one task
becomes an input into another task.
Data partitioning:
Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.
Random portioning includes random data striping across multiple disks on a single server. Another
option for random portioning is round robin fashion partitioning in which each record is placed on the
next disk assigned to the data base.
Intelligent partitioning assumes that DBMS knows where a specific record is located and does not
waste time searching for it across all disks.
The various intelligent partitioning include:
Hash partitioning: A hash algorithm is used to calculate the partition number based on the value of the
partitioning key for each row
Page 17
Key range partitioning: Rows are placed and located in the partitions according to the value of the
partitioning key. That is all the rows with the key value from A to K are in partition 1, L to T are in
partition 2 and so on.
Schema portioning: an entire table is placed on one disk; another table is placed on different disk etc.
This is useful for small reference tables.
User defined portioning: It allows a table to be partitioned on the basis of a user defined expression.
Data base architectures of parallel processing
There are three DBMS software architecture styles for parallel processing:
1. Shared memory or shared everything Architecture

2. Shared disk architecture
3. Shred nothing architecture
Shared Memory Architecture

Tightly coupled shared memory systems, illustrated in following figure have the following
characteristics:
 Multiple PUs share memory.
 Each PU has full access to all shared memory through a common bus.
 Communication between nodes occurs via shared memory.
 Performance is limited by the bandwidth of the memory bus.

Symmetric multiprocessor (SMP) machines are often nodes in a cluster. Multiple SMP nodes can
be used with Oracle Parallel Server in a tightly coupled system, where memory is shared among the
multiple PUs, and is accessible by all the PUs through a memory bus. Examples of tightly coupled
systems include the Pyramid, Sequent, and Sun SparcServer.
Performance is potentially limited in a tightly coupled system by a number of factors. These
include various system components such as the memory bandwidth, PU to PU communication
bandwidth, the memory available on the system, the I/O bandwidth, and the bandwidth of the common
bus.
Page 18
Parallel processing advantages of shared memory systems are these:
 Memory access is cheaper than inter-node communication. This means that internal
synchronization is faster than using the Lock Manager.
 Shared memory systems are easier to administer than a cluster.
A disadvantage of shared memory systems for parallel processing is as follows:
 Scalability is limited by bus bandwidth and latency, and by available memory.

Shared Disk Architecture
Shared disk systems are typically loosely coupled. Such systems, illustrated in following figure,
have the following characteristics:
 Each node consists of one or more PUs and associated memory.
 Memory is not shared between nodes.
 Communication occurs over a common high-speed bus.
 Each node has access to the same disks and other resources.
 A node can be an SMP if the hardware supports it.
 Bandwidth of the high-speed bus limits the number of nodes (scalability) of the system.
Page 19
The cluster illustrated in figure is composed of multiple tightly coupled nodes. The Distributed
Lock Manager (DLM ) is required. Examples of loosely coupled systems are VAX clusters or Sun
clusters.
Since the memory is not shared among the nodes, each node has its own data cache. Cache
consistency must be maintained across the nodes and a lock manager is needed to maintain the
consistency. Additionally, instance locks using the DLM on the Oracle level must be maintained to
ensure that all nodes in the cluster see identical data.
There is additional overhead in maintaining the locks and ensuring that the data caches are
consistent. The performance impact is dependent on the hardware and software components, such as the
bandwidth of the high-speed bus through which the nodes communicate, and DLM performance.
Parallel processing advantages of shared disk systems are as follows:
 Shared disk systems permit high availability. All data is accessible even if one node dies.
 These systems have the concept of one database, which is an advantage over shared
nothing systems.
Page 20
 Shared disk systems provide for incremental growth. Parallel

processing disadvantages of shared disk systems are these:
 Inter-node synchronization is required, involving DLM overhead and greater
dependency on high-speed interconnect.
 If the workload is not partitioned well there may be high synchronization overhead.
 There is operating system overhead of running shared disk software.
Shared Nothing Architecture
Shared nothing systems are typically loosely coupled. In shared nothing systems only one CPU is
connected to a given disk. If a table or database is located on that disk, access depends entirely on the PU
which owns it. Shared nothing systems can be represented as follows:
Shared nothing systems are concerned with access to disks, not access to memory. Nonetheless,
adding more PUs and disks can improve scale up. Oracle Parallel Server can access the disks on a shared
nothing system as long as the operating system provides transparent disk access, but this access is
expensive in terms of latency.
Page 21
Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
 Shared nothing systems provide for incremental growth.
 System growth is practically unlimited.
 MPPs are good for read-only databases and decision support applications.
 Failure is local: if one node fails, the others stay up.

Disadvantages
 More coordination is required.
 More overhead is required for a process working on a disk belonging to another node.
 If there is a heavy workload of updates or inserts, as in an online transaction processing system, it may
be worthwhile to consider data-dependent routing to alleviate contention.
Parallel DBMS features
 Scope and techniques of parallel DBMS operations

 Optimizer implementation
 Application transparency
 Parallel environment which allows the DBMS server to take full advantage of the existing
facilities on a very low level
 DBMS management tools help to configure, tune, admin and monitor a parallel RDBMS as
effectively as if it were a serial RDBMS
 Price / Performance: The parallel RDBMS can demonstrate a non linear speed up and scale up at
reasonable costs.
Page 22
Parallel DBMS vendors

1. Oracle: Parallel Query Option (PQO)
Architecture: shared disk arch
Data partition: Key range, hash, round robin
Parallel operations: hash joins, scan and sort
2. Informix: eXtended Parallel Server (XPS)
Architecture: Shared memory, shared disk and shared nothing models Data
partition: round robin, hash, schema, key range and user defined Parallel
operations: INSERT, UPDATE, DELELTE
3. IBM: DB2 Parallel Edition (DB2 PE)
Architecture: Shared nothing models Data
partition: hash
Parallel operations: INSERT, UPDATE, DELELTE, load, recovery, index creation, backup, table
reorganization
4. SYBASE: SYBASE MPP
Architecture: Shared nothing models Data
partition: hash, key range, Schema
Parallel operations: Horizontal and vertical parallelism
DBMS schemas for decision support

The basic concepts of dimensional modeling are: facts, dimensions and measures. A fact is a
collection of related data items, consisting of measures and context data. It typically represents business
items or business transactions. A dimension is a collection of data that describe one business dimension.
Dimensions determine the contextual background for the facts; they are the parameters over which we
want to perform OLAP. A measure is a numeric attribute of a fact, representing the performance or
behavior of the business relative to the dimensions. Considering Relational context, there are three basic
schemas that are used in dimensional modeling:
Page 23
1. Star schema
2. Snowflake schema
3. Fact constellation schema
Star Schema
 Each dimension in a star schema is represented with only one-dimension table.

 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the four dimensions, namely
time, item, branch, and location.
 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of attributes. For example, the
location dimension table contains the attribute set {location_key, street, city, province_or_state,country}. This
constraint may cause data redundancy. For example, "Vancouver" and "Victoria" both the cities are in the
Canadian province of British Columbia. The entries for such cities may cause data redundancy along the
attributes province_or_state and country.
Snowflake Schema
 Some dimension tables in the Snowflake schema are normalized.

 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the
item dimension table in star schema is normalized and split into two dimension tables, namely item
and supplier table.
Page 24
 Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-
key.
 The supplier key is linked to the supplier dimension table. The supplier dimension table contains the
attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes
easy to maintain and the save storage space.
Fact Constellation Schema
 A fact constellation has multiple fact tables. It is also known as galaxy schema.
Page 25
 The following diagram shows two fact tables, namely sales and shipping.
 The sales fact table is same as that in the star schema.

 The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key,
from_location, to_location.
 The shipping fact table also contains two measures, namely dollars sold and units sold.
 It is also possible to share dimension tables between fact tables. For example, time, item, and location
dimension tables are shared between the sales and shipping fact table.
Concept Hierarchies
concept hierarchy represents a series of mappings from a set of low-level concepts to larger-level, more
general concepts. Concept hierarchy organizes information or concepts in a hierarchical structure or a specific
partial order, which are used for defining knowledge in brief, high-level methods, and creating possible
mining knowledge at several levels of abstraction.
A conceptual hierarchy includes a set of nodes organized in a tree, where the nodes define values of an
attribute known as concepts. A specific node, “ANY”, is constrained for the root of the tree. A number is
created to the level of each node in a conceptual hierarchy. The level of the root node is one. The level of a
non-root node is one more the level of its parent level number.
Because values are defined by nodes, the levels of nodes can also be used to describe the levels of values.
Concept hierarchy enables raw information to be managed at a higher and more generalized level of
abstraction.
Consider a concept hierarchy for the dimension location. City values for location include Vancouver,
Toronto, New York, and Chicago. Each city, however, can be mapped to the province or state to which it
Page 26
belongs. For example, Vancouver can be mapped to British Columbia, and Chicago to Illinois. The provinces
and states can in turn be mapped to the country (e.g., Canada or the United States) to which they belong.
These mappings form a concept hierarchy for the dimension location, mapping a set of low-level concepts
(i.e., cities) to higher-level, more general concepts (i.e., countries). This concept hierarchy is illustrated in
below Figure
There are several types of concept hierarchies which are as follows −
Schema Hierarchy − Schema hierarchy represents the total or partial order between attributes in the
database. It can define existing semantic relationships between attributes. In a database, more than one
schema hierarchy can be generated by using multiple sequences and grouping of attributes.
For example, suppose that the dimension location is described by the attributes number, street, city,
province_or_state, zip_code, and country. These attributes are related by a total order, forming a concept
hierarchy such as “street < city < province_or_state < country.” This hierarchy is shown in Figure 4.10(a).
Alternatively, the attributes of a dimension may be organized in a partial order, forming a lattice. An example
of a partial order for the time dimension based on the attributes day, week, month, quarter, and year is “day
<{month < quarter; week} < year.”1 This lattice structure is shown in Figure 4.10(b). A concept hierarchy
that is a total or partial order among attributes in a database schema is called a schema hierarchy.
Set-Grouping Hierarchy − A set-grouping hierarchy constructs values for a given attribute or dimension into
groups or constant range values. It is also known as instance hierarchy because the partial series of the
Page 27
hierarchy is represented on the set of instances or values of an attribute. These hierarchies have more
functional sense and are so approved than other hierarchies.
A total or partial order can be defined among groups of values. An example of a set-grouping hierarchy is
shown in Figure 4.11 for the dimension price, where an interval ($X…$Y] denotes the range from $X
(exclusive) to $Y (inclusive).
Operation-Derived Hierarchy − Operation-derived hierarchy is represented by a set of operations on the

data. These operations are defined by users, professionals, or the data mining system. These hierarchies are
usually represented for mathematical attributes. Such operations can be as easy as range value comparison, as
difficult as a data clustering and data distribution analysis algorithm.
Rule-based Hierarchy − In a rule-based hierarchy either a whole concept hierarchy or an allocation of it is

represented by a set of rules and is computed dynamically based on the current information and rule
definition. A lattice-like architecture is used for graphically defining this type of hierarchy, in which each
child-parent route is connected with a generalization rule.
The static and dynamic generation of concept hierarchy is based on data sets. In this context, the generation of
a concept hierarchy depends on a static or dynamic data set is known as the static or dynamic generation of
concept hierarchy.
Characteristics of OLAP
The FASMI Test
It can represent the characteristics of an OLAP application in a specific method, without dictating how it
should be performed.
Fast − It defines that the system is targeted to produce most responses to users within about five seconds,
with the understandable analysis taking no more than one second and very few taking more than 20 seconds.
Independent research in the Netherlands has shown that end-users consider that a process has declined if
results are not received with 30 seconds, and they are suitable to hit ‘ALT+Ctrl+Delete’ unless the system
needs them that the report will take longer.
Page 28
Analysis − It defines that the system can manage with any business logic and statistical analysis that is
appropriate for the application and the user, the keep it easy enough for the target user. Although some pre-
programming can be required, it does not think it acceptable if all application definitions have to be
completed using a professional 4GL.
It is necessary to enable the user to represent new ad hoc calculations as part of the analysis and to report on
the data in any desired method, without having to program, so it can exclude products (like Oracle
Discoverer) that do not enable the user to represent new ad hoc calculations as an element of the analysis and
to report on the data in any desired method, without having to program, so it can exclude products (like
Oracle Discoverer) that do not enable adequate end-user oriented calculation flexibility.
Shared − It defines that the system implements all the security requirements for confidentiality (probably
down to cell level) and, multiple write access is required, concurrent update areas at a suitable level. It is not
all applications required users to write data back, but for the increasing number that does, the system must be
able to handle several updates in an appropriate, secure manner. This is a major field of weakness in some
OLAP products, which tend to consider that all OLAP applications will be read-only, with simple security
controls.
Multidimensional − The system should support a multidimensional conceptual view of the data, including
complete support for hierarchies and multiple hierarchies. It is not setting up a specific minimum number of
dimensions that should be managed as it is too software dependent and most products seem to have enough
for their target industry.
Information − Information is all of the data and derived data required, whether it is and however much is
relevant for the software. We are measuring the capacity of several products in terms of how much input data
can manage, not how many Gigabytes they take to save it.
Typical OLAP operations.
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows
managers, and analysts to get an insight of the information through fast, consistent, and interactive access to
information. This chapter cover the types of OLAP, operations on OLAP, difference between OLAP, and
statistical databases and OLTP.
Types of OLAP Servers
We have four types of OLAP servers −
 Relational OLAP (ROLAP)

 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
 Specialized SQL Servers
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To store and manage
warehouse data, ROLAP uses relational or extended-relational DBMS.
ROLAP includes the following −

Page 29
 Implementation of aggregation navigation logic.
 Optimization for each DBMS back end.
 Additional tools and services.
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With
multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore, many
MOLAP server use two levels of data storage representation to handle dense and sparse data sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and
faster computation of MOLAP. HOLAP servers allows to store the large data volumes of detailed
information. The aggregations are stored separately in MOLAP store.
Specialized SQL Servers
Specialized SQL servers provide advanced query language and query processing support for SQL queries
over star and snowflake schemas in a read-only environment.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in
multidimensional data.
Here is the list of OLAP operations −
 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
 By climbing up a concept hierarchy for a dimension

 By dimension reduction
The following diagram illustrates how roll-up works.
Page 30
 Roll-up is performed by climbing up a concept hierarchy for the dimension location.

 Initially the concept hierarchy was "street < city < province < country".
 On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the
level of country.
 The data is grouped into cities rather than countries.
 When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −
 By stepping down a concept hierarchy for a dimension

 By introducing a new dimension.
The following diagram illustrates how drill-down works −
 Drill-down is performed by stepping down a concept hierarchy for the dimension time.
 Initially the concept hierarchy was "day < month < quarter < year."
Page 31
 On drilling down, the time dimension is descended from the level of quarter to the level of month.
 When drill-down is performed, one or more dimensions from the data cube are added.
 It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-cube.
Consider the following diagram that shows how slice works.
 Here Slice is performed for the dimension "time" using the criterion time = "Q1".
 It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following
diagram that shows the dice operation.
Page 32
The dice operation on the cube based on the following selection criteria involves three dimensions.
 (location = "Toronto" or "Vancouver")

 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative
presentation of data. Consider the following diagram that shows the pivot operation.
OLAP vs OLTP
OLAP stands for On-Line Analytical Processing. It is used for analysis of database information from
multiple database systems at one time such as sales analysis and forecasting, market research, budgeting and
Page 33
etc. Data Warehouse is the example of OLAP system.
OLTP stands for On-Line Transactional processing. It is used for maintaining the online transaction and
record integrity in multiple access environments. OLTP is a system that manages very large number of short
online transactions for example, ATM.
Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)

1 Involves historical processing of information. Involves day-to-day processing.
OLAP systems are used by knowledge workers OLTP systems are used by clerks, DBAs, or database
2
such as executives, managers and analysts. professionals.
3 Useful in analyzing the business. Useful in running the business.
4 It focuses on Information out. It focuses on Data in.
Based on Star Schema, Snowflake, Schema and
5 Based on Entity Relationship Model.
Fact Constellation Schema.
6 Contains historical data. Contains current data.
7 Provides summarized and consolidated data. Provides primitive and highly detailed data.
Provides summarized and multidimensional view of
8 Provides detailed and flat relational view of data.
data.
9 Number or users is in hundreds. Number of users is in thousands.
10 Number of records accessed is in millions. Number of records accessed is in tens.
11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.
12 Highly flexible. Provides high performance.
Page 34

DWM Unit 1

Uploaded by

Copyright:

Available Formats

DWM Unit 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWM Unit 1

Uploaded by

Copyright:

Available Formats

Data warehousing and Data mining Unit-I

Basic Concepts - Data warehousing Components–Building a Data warehouse–Database

Data Warehouse Introduction

Other important terminology

Benefits of data warehousing

Operational and informational Data

 Reflects current data

 Focusing on providing answers to problems posed by decision makers

Data warehouse Architecture and its seven components

1. Data sourcing, cleanup, transformation, and migration tools

Data warehouse is an environment, not a product which is based on relational database

 To remove unwanted data from operational db

 Converting to common data names and attributes

 Calculating summaries and derived data

 Establishing defaults for missing data

 Accommodating source data definition change.

Issues to be considered while data sourcing, cleanup, extract and transformation:

 Transformation descriptions. That is mapping methods from operational db to warehouse db

 Warehouse Object and data structure definitions for target data

 The rules used to perform clean up, and data enhancement

 Data mapping operations

 Info related to info delivery system

 Data warehouse operational info such as ownerships, audit trails etc.,

 It is the gateway to the data warehouse environment

 It should support the sharing of info

 It should support scheduling options for request

 IT should support and provide interface to other applications

 Application development tools

 Executive info system tools (EIS)

 Data mining tools

 Production reporting tool used to generate regular operational reports

 The absence of a budget for a full scale data warehouse strategy

 The decentralization of business needs

 The attraction of easy to use tools and mind sized project

 Security and priority management

 Monitoring updates from multiple sources

 Data quality checks

 Managing and updating meta data

 Auditing and reporting data warehouse usage and status

 Replicating, sub setting and distributing data

 Backup and recovery

 Data warehouse storage management which includes capacity planning, hierarchical

Building a Data warehouse

 To address the incompatibility of operational data stores

 Top - Down Approach (Suggested by Bill Inmon)

 Bottom - Up Approach (Suggested by Ralph Kimball)

1. We have initial cost and time constraints.

 Contain historical and current data

 Include both detailed and summarized data

 Use of historical data

 Growing nature of data base

1. Choosing the subject matter

 The DBMS that supports the warehouse data

 Collect and analyze business requirements

 Create a data model and a physical design

 Define data sources

 Choose the DB tech and platform

 Choose DB connectivity software