Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit 1

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 9

A Three-Tier Data Warehouse Architecture

Data warehouses often adopt a three-tier architecture, as presented in Figure

1. The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed data into the bottom
tier from operational databases or other external sources (such as customer profile
information provided by external consultants). These tools and utilities perform data
extraction, cleaning, and transformation (e.g., to merge similar data from different

sources into a unified format), as well as load and refresh functions to update the
data warehouse. The data are extracted using application program
interfaces known as gateways. A gateway is supported by the underlying DBMS and
allows client programs to generate SQL code to be executed at a server. Examples
of gateways include ODBC (Open Database Connection) and OLEDB (Open Linking
and Embedding for Databases) by Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about
the data warehouse and its contents.

2. The middle tier is an OLAP server that is typically implemented using either
(1) a relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps
operations on multidimensional data to standard relational operations; or
(2) a multidimensional OLAP (MOLAP) model, that is, a special-purpose server
that directly implements multidimensional data and operations.
3. The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

KDD:

Simply stated, data mining refers to extracting or “mining” knowledge from large
amounts
of data. The term is actually a misnomer. Remember that the mining of gold from
rocks
or sand is referred to as gold mining rather than rock or sand mining. Thus, data
mining
should have been more appropriately named “knowledge mining from data,” which is
unfortunately somewhat long. “Knowledge mining,” a shorter term, may not reflect
the
emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term
characterizing the process that finds a small set of precious nuggets from a great
deal of
raw material (Figure 1.3). Thus, such a misnomer that carries both “data” and
“mining”
became a popular choice. Many other terms carry a similar or slightly different
meaning to data mining, such as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging. Many people treat data
mining as a synonymfor another popularly used term, Knowledge Discovery
fromData, or KDD. Alternatively, others view data mining as simply an essential step
in the process of knowledge discovery. Knowledge discovery as a process is depicted
in Figure 1.4 and consists of an iterative sequence of the following steps:

1. Data cleaning (to remove noise and inconsistent data)


2. Data integration (where multiple data sources may be combined)1
3. Data selection (where data relevant to the analysis task are retrieved fromthe
database)
4. Data transformation (where data are transformed or consolidated into forms
appropriate
for mining by performing summary or aggregation operations, for instance)2
5. Data mining (an essential process where intelligent methods are applied in order
to
extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge
based on some interestingness measures; Section 1.5)
7. Knowledge presentation (where visualization and knowledge representation
techniques
are used to present the mined knowledge to the user)
Steps 1 to 4 are different forms of data preprocessing, where the data are prepared
for mining. The data mining step may interact with the user or a knowledge base.
The
interesting patterns are presented to the user and may be stored as new knowledge
in
the knowledge base. Note that according to this view, data mining is only one step in
the
entire process, albeit an essential one because it uncovers hidden patterns for
evaluation.
We agree that data mining is a step in the knowledge discovery process. However, in
industry, in media, and in the database research milieu, the termdata mining is
becoming
more popular than the longer term of knowledge discovery from data. Therefore, in
this
book, we choose to use the term data mining. We adopt a broad view of data mining
functionality: data mining is the process of discovering interesting knowledge
fromlarge
amounts of data stored in databases, data warehouses, or other information
repositories.

A typical data mining architecture:-

Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include
concept
hierarchies, used to organize attributes or attribute values into different levels of
abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included. Other examples
of domain knowledge are additional interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple heterogeneous sources).

Data mining engine: This is essential to the data mining system and ideally
consists of
a set of functional modules for tasks such as characterization, association and
correlation
analysis, classification, prediction, cluster analysis, outlier analysis, and evolution
analysis.

Pattern evaluation module: This component typically employs interestingness


measures
and interacts with the data mining modules so as to focus the search toward
interesting patterns. It may use interestingness thresholds to filter out discovered
patterns. Alternatively, the pattern evaluation module may be integrated with the
mining module, depending on the implementation of the data mining method used.
For efficient data mining, it is highly recommended to push the evaluation of pattern
interestingness as deep as possible into the mining process
so as to confine the search to only the interesting patterns.

User interface: This module communicates between users and the data mining
system,
allowing the user to interact with the system by specifying a data mining query or
task, providing information to help focus the search, and performing exploratory data
mining based on the intermediate data mining results. In addition, this component
allows the user to browse database and data warehouse schemas or data structures,
evaluate mined patterns, and visualize the patterns in different forms.

Database, data warehouse, World Wide Web, or other information


repository: This
is one or a set of databases, data warehouses, spreadsheets, or other kinds of
information
repositories. Data cleaning and data integration techniques may be performed
on the data.
Database or data warehouse server: The database or data warehouse server is
responsible
for fetching the relevant data, based on the user’s data mining request.

Differences between Operational Database Systems


and Data Warehouses:
Because most people are familiar with commercial relational database
systems, it is easy
to understand what a data warehouse is by comparing these two kinds of systems.
The major task of on-line operational database systems is to perform on-line
transaction
and query processing. These systems are called on-line transaction processing
(OLTP) systems. They cover most of the day-to-day operations of an organization,
such
as purchasing, inventory, manufacturing, banking, payroll, registration, and
accounting.
Data warehouse systems, on the other hand, serve users or knowledge workers in the
role
of data analysis and decision making. Such systems can organize and present data in
various
formats in order to accommodate the diverse needs of the different users. These
systems are known as on-line analytical processing (OLAP) systems.
The major distinguishing features between OLTP and OLAP are summarized as
follows:

Users and system orientation: An OLTP system is customer-oriented and is used


for
transaction and query processing by clerks, clients, and information technology
professionals.
An OLAP systemic market-oriented and is used for data analysis by knowledge
workers, including managers, executives, and analysts.

Data contents: An OLTP system manages current data that, typically, are too
detailed
to be easily used for decision making. An OLAP system manages large amounts of
historical data, provides facilities for summarization and aggregation, and stores and
manages information at different levels of granularity. These features make the data
easier to use in informed decision making.

Database design: An OLTP system usually adopts an entity-relationship (ER) data


model and an application-oriented database design. An OLAP system typically adopts
either a star or snowflake model (to be discussed in Section 3.2.2) and a subject
oriented
database design.

View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historical data or data in different organizations.
In contrast, an OLAP system often spans multiple versions of a database schema,
due to the evolutionary process of an organization. OLAP systems also deal with
information that originates from different organizations, integrating information
from many data stores. Because of their huge volume, OLAP data are stored on
multiple storage media.
Access patterns: The access patterns of an OLTP system consist mainly of short,
atomic transactions. Such a system requires concurrency control and recovery
mechanisms. However, accesses to OLAP systems are mostly read-only operations
(because most data warehouses store historical rather than up-to-date information),
although many could be complex queries.

Data Warehouse Usage or Applications:

Data warehouses and data marts are used in a wide range of applications.
Business executives use the data in data warehouses and data marts to perform data
analysis and make strategic decisions. In many firms, data warehouses are used as
an integral part of a plan-execute-assess “closed-loop” feedback system for
enterprise management. Data warehouses are used extensively in banking and
financial services, consumer goods and retail distribution sectors, and controlled
manufacturing, such as demand based production.
Typically, the longer a data warehouse has been in use, the more it will have
evolved. This evolution takes place throughout a number of phases. Initially, the data
warehouse is mainly used for generating reports and answering predefined queries.
Progressively, it is used to analyze summarized and detailed data, where the results
are presented in the form of reports and charts. Later, the data warehouse is used for
strategic purposes, performing multidimensional analysis and sophisticated slice-and-
dice operations. Finally, the data warehouse may be employed for knowledge
discovery and strategic decision making using data mining tools. In this context, the
tools for data warehousing can be categorized into access and retrieval tools,
database reporting tools, data analysis tools, and data mining tools.
Business users need to have the means to know what exists in the data
warehouse (through metadata), how to access the contents of the data warehouse,
how to examine the contents using analysis tools, and how to present the results of
such analysis.
There are three kinds of data warehouse applications: information processing,
analytical
processing, and data mining:

Information processing supports querying, basic statistical analysis, and reporting


using cross tabs, tables, charts, or graphs. A current trend in data warehouse
information processing is to construct low-cost Web-based accessing tools that are
then integrated with Web browsers.

Analytical processing supports basic OLAP operations, including slice-and-dice,


drill-down, roll-up, and pivoting. It generally operates on historical data in both
summarized and detailed forms. The major strength of on-line analytical processing
over information processing is the multidimensional data analysis of data warehouse
data.

Data mining supports knowledge discovery by finding hidden patterns and


associations, constructing analytical models, performing classification and prediction,
and presenting the mining results using visualization tools.

“How does data mining relate to information processing and on-line analytical
processing?” Information processing, based on queries, can find useful information.
However, answers to such queries reflect the information directly stored in databases
or computable by aggregate functions. They do not reflect sophisticated patterns or
regularities buried in the database. Therefore, information processing is not data
mining.

On-line analytical processing comes a step closer to data mining because it


can derive information summarized at multiple granularities from user-specified
subsets of a data warehouse. Such descriptions are equivalent to the class/concept
descriptions discussed in Chapter 1. Because data mining systems can also mine
generalized class/concept descriptions, this raises some interesting questions: “Do
OLAP systems perform data mining? Are OLAP systems actually data mining
systems?”

The functionalities of OLAP and data mining can be viewed as disjoint: OLAP is
a data summarization/aggregation tool that helps simplify data analysis, while data
mining allows the automated discovery of implicit patterns and interesting knowledge
hidden in large amounts of data. OLAP tools are targeted toward simplifying and
supporting interactive data analysis, whereas the goal of data mining tools is to
automate as much of the process as possible, while still allowing users to guide the
process. In this sense, data mining goes one step beyond traditional on-line analytical
processing.

An alternative and broader view of data mining may be adopted in which data
mining covers both data description and data modeling. Because OLAP systems can
present general descriptions of data from data warehouses, OLAP functions are
essentially for user-directed data summary and comparison (by drilling, pivoting,
slicing, dicing, and other operations). These are, though limited, data mining
functionalities. Yet according to this view, data mining covers a much broader
spectrum than simple OLAP operations because it performs not only data summary
and comparison but also association, classification, prediction, clustering, time-series
analysis, and other data analysis tasks.
Data mining is not confined to the analysis of data stored in data warehouses.
It may analyze data existing at more detailed granularities than the summarized data
provided in a data warehouse. It may also analyze transactional, spatial, textual, and
multimedia data that are difficult to model with current multidimensional database
technology. In this context, data mining covers a broader spectrum than OLAP with
respect to data mining functionality and the complexity of the data handled.

Because data mining involves more automated and deeper analysis than
OLAP, data mining is expected to have broader applications. Data mining can help
business managers find and reach more suitable customers, as well as gain critical
business insights that may help drive market share and raise profits. In addition, data
mining can help managers understand customer group characteristics and develop
optimal pricing strategies accordingly, correct item bundling based not on intuition
but on actual item groups derived from customer purchase patterns, reduce
promotional spending, and at the same time increase the overall net effectiveness of
promotions.

Architecture for On-Line Analytical Mining

An OLAM server performs analytical mining in data cubes in a similar manner as an


OLAP server performs on-line analytical processing. An integrated OLAM and OLAP
architecture is shown in Figure 3.18, where the OLAM and OLAP servers both accept
user on-line queries (or commands) via a graphical user interface API and work with
the data cube in the data analysis via a cube API. A metadata directory is used to
guide the access of the data cube. The data cube can be constructed by accessing
and/or integrating multiple databases via an MDDB API and/or by filtering a data
warehouse via a database API that may support OLE DB or ODBC connections. Since
an OLAM server may perform multiple data mining tasks, such as concept
description, association, classification, prediction, clustering, time-series analysis,
and so on, it usually consists of multiple integrated data mining modules and is more
sophisticated than an OLAP server.
Chapter 4 describes data warehouses on a finer level by exploring implementation
issues such as data cube computation, OLAP query answering strategies, and
methods of generalization. The chapters following it are devoted to the study of data
mining techniques. As we have seen, the introduction to data warehousing and OLAP
technology presented in this chapter is essential to our study of data mining. This is
because data warehousing provides users with large amounts of clean, organized,
and summarized data, which greatly facilitates data mining. For example, rather than
storing the details of each sales transaction, a data warehouse may store a summary
of the transactions per item type for each branch or, summarized to a higher level,
for each country. The capability of OLAP to provide multiple and dynamic views of
summarized data in a data warehouse sets a solid foundation for successful data
mining.
Moreover, we also believe that data mining should be a human-centered
process. Rather than asking a data mining system to generate patterns and
knowledge automatically, a user will often need to interact with the system to
perform exploratory data analysis .OLAP sets a good example for interactive data
analysis and provides the necessary preparations for exploratory data mining.
Consider the discovery of association patterns, for example. Instead of mining
associations at a primitive (i.e., low) data level among transactions, users should be
allowed to specify roll-up operations along any dimension. For example, a user may
like to roll up on the item dimension to go from viewing the data for particular TV sets
that were purchased to viewing the brands of these TVs, such as SONY or Panasonic.
Users may also navigate from the transaction level to the customer level or
customer-type level in the search for interesting associations. Such an OLAP style of
data mining is characteristic of OLAP mining. In our study of the principles of data
mining in this book, we place particular emphasis on OLAP mining, that is, on the
integration of data mining and OLAP technology.

You might also like