Unit 1
Unit 1
Unit 1
1. The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed data into the bottom
tier from operational databases or other external sources (such as customer profile
information provided by external consultants). These tools and utilities perform data
extraction, cleaning, and transformation (e.g., to merge similar data from different
sources into a unified format), as well as load and refresh functions to update the
data warehouse. The data are extracted using application program
interfaces known as gateways. A gateway is supported by the underlying DBMS and
allows client programs to generate SQL code to be executed at a server. Examples
of gateways include ODBC (Open Database Connection) and OLEDB (Open Linking
and Embedding for Databases) by Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about
the data warehouse and its contents.
2. The middle tier is an OLAP server that is typically implemented using either
(1) a relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps
operations on multidimensional data to standard relational operations; or
(2) a multidimensional OLAP (MOLAP) model, that is, a special-purpose server
that directly implements multidimensional data and operations.
3. The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
KDD:
Simply stated, data mining refers to extracting or “mining” knowledge from large
amounts
of data. The term is actually a misnomer. Remember that the mining of gold from
rocks
or sand is referred to as gold mining rather than rock or sand mining. Thus, data
mining
should have been more appropriately named “knowledge mining from data,” which is
unfortunately somewhat long. “Knowledge mining,” a shorter term, may not reflect
the
emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term
characterizing the process that finds a small set of precious nuggets from a great
deal of
raw material (Figure 1.3). Thus, such a misnomer that carries both “data” and
“mining”
became a popular choice. Many other terms carry a similar or slightly different
meaning to data mining, such as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging. Many people treat data
mining as a synonymfor another popularly used term, Knowledge Discovery
fromData, or KDD. Alternatively, others view data mining as simply an essential step
in the process of knowledge discovery. Knowledge discovery as a process is depicted
in Figure 1.4 and consists of an iterative sequence of the following steps:
Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include
concept
hierarchies, used to organize attributes or attribute values into different levels of
abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included. Other examples
of domain knowledge are additional interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple heterogeneous sources).
Data mining engine: This is essential to the data mining system and ideally
consists of
a set of functional modules for tasks such as characterization, association and
correlation
analysis, classification, prediction, cluster analysis, outlier analysis, and evolution
analysis.
User interface: This module communicates between users and the data mining
system,
allowing the user to interact with the system by specifying a data mining query or
task, providing information to help focus the search, and performing exploratory data
mining based on the intermediate data mining results. In addition, this component
allows the user to browse database and data warehouse schemas or data structures,
evaluate mined patterns, and visualize the patterns in different forms.
Data contents: An OLTP system manages current data that, typically, are too
detailed
to be easily used for decision making. An OLAP system manages large amounts of
historical data, provides facilities for summarization and aggregation, and stores and
manages information at different levels of granularity. These features make the data
easier to use in informed decision making.
View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historical data or data in different organizations.
In contrast, an OLAP system often spans multiple versions of a database schema,
due to the evolutionary process of an organization. OLAP systems also deal with
information that originates from different organizations, integrating information
from many data stores. Because of their huge volume, OLAP data are stored on
multiple storage media.
Access patterns: The access patterns of an OLTP system consist mainly of short,
atomic transactions. Such a system requires concurrency control and recovery
mechanisms. However, accesses to OLAP systems are mostly read-only operations
(because most data warehouses store historical rather than up-to-date information),
although many could be complex queries.
Data warehouses and data marts are used in a wide range of applications.
Business executives use the data in data warehouses and data marts to perform data
analysis and make strategic decisions. In many firms, data warehouses are used as
an integral part of a plan-execute-assess “closed-loop” feedback system for
enterprise management. Data warehouses are used extensively in banking and
financial services, consumer goods and retail distribution sectors, and controlled
manufacturing, such as demand based production.
Typically, the longer a data warehouse has been in use, the more it will have
evolved. This evolution takes place throughout a number of phases. Initially, the data
warehouse is mainly used for generating reports and answering predefined queries.
Progressively, it is used to analyze summarized and detailed data, where the results
are presented in the form of reports and charts. Later, the data warehouse is used for
strategic purposes, performing multidimensional analysis and sophisticated slice-and-
dice operations. Finally, the data warehouse may be employed for knowledge
discovery and strategic decision making using data mining tools. In this context, the
tools for data warehousing can be categorized into access and retrieval tools,
database reporting tools, data analysis tools, and data mining tools.
Business users need to have the means to know what exists in the data
warehouse (through metadata), how to access the contents of the data warehouse,
how to examine the contents using analysis tools, and how to present the results of
such analysis.
There are three kinds of data warehouse applications: information processing,
analytical
processing, and data mining:
“How does data mining relate to information processing and on-line analytical
processing?” Information processing, based on queries, can find useful information.
However, answers to such queries reflect the information directly stored in databases
or computable by aggregate functions. They do not reflect sophisticated patterns or
regularities buried in the database. Therefore, information processing is not data
mining.
The functionalities of OLAP and data mining can be viewed as disjoint: OLAP is
a data summarization/aggregation tool that helps simplify data analysis, while data
mining allows the automated discovery of implicit patterns and interesting knowledge
hidden in large amounts of data. OLAP tools are targeted toward simplifying and
supporting interactive data analysis, whereas the goal of data mining tools is to
automate as much of the process as possible, while still allowing users to guide the
process. In this sense, data mining goes one step beyond traditional on-line analytical
processing.
An alternative and broader view of data mining may be adopted in which data
mining covers both data description and data modeling. Because OLAP systems can
present general descriptions of data from data warehouses, OLAP functions are
essentially for user-directed data summary and comparison (by drilling, pivoting,
slicing, dicing, and other operations). These are, though limited, data mining
functionalities. Yet according to this view, data mining covers a much broader
spectrum than simple OLAP operations because it performs not only data summary
and comparison but also association, classification, prediction, clustering, time-series
analysis, and other data analysis tasks.
Data mining is not confined to the analysis of data stored in data warehouses.
It may analyze data existing at more detailed granularities than the summarized data
provided in a data warehouse. It may also analyze transactional, spatial, textual, and
multimedia data that are difficult to model with current multidimensional database
technology. In this context, data mining covers a broader spectrum than OLAP with
respect to data mining functionality and the complexity of the data handled.
Because data mining involves more automated and deeper analysis than
OLAP, data mining is expected to have broader applications. Data mining can help
business managers find and reach more suitable customers, as well as gain critical
business insights that may help drive market share and raise profits. In addition, data
mining can help managers understand customer group characteristics and develop
optimal pricing strategies accordingly, correct item bundling based not on intuition
but on actual item groups derived from customer purchase patterns, reduce
promotional spending, and at the same time increase the overall net effectiveness of
promotions.