DM Module 1
DM Module 1
Data warehouses generalize and consolidate data in multidimensional space. The construction of data
warehouses involves data cleaning, data integration, and data transformation, and can be viewed as an
important preprocessing step for data mining. Moreover, data warehouses provide online analytical
processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities,
which facilitates effective data generalization and data mining.
Many other data mining functions, such as association, classification, prediction, and clustering, can be
integrated with OLAP operations to enhance interactive mining of knowledge at multiple levels of
abstraction. Hence, the data warehouse has become an increasingly important platform for data analysis
and OLAP and will provide an effective platform for datamining. Therefore, data warehousing and
OLAP form an essential step in the knowledge discovery process.
Key features:
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source
A and source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.
• Fact table contains measures (such as dollars_sold) and keys to each of the related dimension
tables
• In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D
cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice
of cuboids forms a data cube.
Given a set of dimensions, we can generate a cuboid for each of the possible subsets of the given
dimensions. The result would form a lattice of cuboids, each showing the data at a different level of
summarization, or group-by. The lattice of cuboids is then referred to as a data cube. Figure shows a
lattice of cuboids forming a data cube for the dimensions time, item, location, and supplier. The cuboid
that holds the lowest level of summarization is called the base cuboid.
1.2.2 Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models
The most popular data model for a data warehouse is a multidimensional model, which can exist
in the form of a star schema, a snow flake schema, or a fact constellation schema.
Schemas for multidimensional data models
• Star schema: A fact table in the middle connected to a set of dimension tables
• Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of
stars, therefore called galaxy schema or fact constellation
Star schema: The most common modeling paradigm is the star schema, in which the data warehouse
contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, and (2)
a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph resembles
a starburst, with the dimension tables displayed in a radial pattern around the central fact table.
Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension
tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph
forms a shape similar to a snowflake.
Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables.
This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact
constellation.