SKP Engineering College: A Course Material On
SKP Engineering College: A Course Material On
SKP Engineering College: A Course Material On
A Course Material
on
Data Warehousing and Data Mining
By
K.Vijayakumar
Assistant Professor
Computer Science and Engineering Department
Quality Certificate
Year/Sem: III/VI
Being prepared by me and it meets the knowledge requirement of the University curriculum.
Name: K.Vijayakumar
This is to certify that the course material being prepared by Mr.K.Vijayakumar is of the
adequate quality. He has referred more than five books and one among them is from
abroad author.
Seal: Seal:
Reporting and Query tools and Applications – Tool Categories – The Need for Applications
– Cognos Impromptu – Online Analytical Processing (OLAP) – Need – Multidimensional
Data Model – OLAP Guidelines – Multidimensional versus Multirelational OLAP –
Categories of Tools – OLAP Tools and the Internet.
OUTCOMES: After completing this course, the student will be able to:
TEXT BOOKS:
1. Alex Berson and Stephen J.Smith, “Data Warehousing, Data Mining and OLAP”, Tata
McGraw – Hill Edition, Thirteenth Reprint 2008.
2. Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, Third
Edition, Elsevier, 2012.
REFERENCES:
1. Pang-Ning Tan, Michael Steinbach and Vipin Kumar, “Introduction to Data Mining”,
Person Education, 2007.
2. K.P. Soman, Shyam Diwakar and V. Aja, “Insight into Data Mining Theory and Practice”,
Eastern Economy Edition, Prentice Hall of India, 2006.
3. G. K. Gupta, “Introduction to Data Mining with Case Studies”, Eastern Economy Edition,
Prentice Hall of India, 2006.
4. Daniel T.Larose, “Data Mining Methods and Models”, Wiley-Interscience, 2006.
CONTENTS
1 Unit – I 06
2 Unit – II 41
3 Unit – III 65
4 Unit – IV 101
5 Unit – V 164
Unit – I
Part – A
Facts are numerical measures. Facts can also be considered as quantities by which can
analyze the relationship between dimensions.
Dimensions are the entities (or) perspectives with respect to an organization for keeping
records and are hierarchical in nature.
A dimension table is used for describing the dimension. (e.g.) A dimension table for item
may contain the attributes item_ name, brand and type.
Fact table contains the name of facts (or) measures as well as keys to each of the related
dimensional tables.
In data warehousing research literature, a cube can also be called as cuboids. For different
(or) set of dimensions, we can construct a lattice of cuboids, each showing the data at
different level. The lattice of cuboids is also referred to as data cube.
The 0-D cuboid which holds the highest level of summarization is called the apex cuboid.
The apex cuboid is typically denoted by all.
• A large central table (fact table) containing the bulk of data with no redundancy.
• A set of smaller attendant tables (dimension tables), one for each dimension.
18.Point out the major difference between the star schema and the snowflake
schema? [CO1-L2]
The dimension table of the snowflake schema model may be kept in normalized form to
reduce redundancies. Such a table is easy to maintain and saves storage space.
19.Which is popular in the data warehouse design, star schema model (or) snowflake
schema model? [CO1-L2]
Star schema model, because the snowflake structure can reduce the effectiveness and
more joins will be needed to execute a query.
PART – B
1.1 Architecture
The data warehouse must be capable of holding and managing large volumes of data
as well as different structure of data structures over the time.
This is the central part of the data warehousing environment. This is the item number 2 in
the above arch. diagram. This is implemented based on RDBMS technology.
This is item number 1 in the above arch diagram. They perform conversions,
summarization, key changes, structural changes and condensation. The data
transformation is required so that the information can by used by decision support tools.
The transformation produces programs, control statements, JCL code, COBOL code, UNIX
scripts, and SQL DDL code etc., to move the data into data warehouse from multiple
operational systems.
Data heterogeneity: It refers to DBMS different nature such as it may be in different data
modules, it may have different access languages, it may have data navigation methods,
operations, concurrency, integrity and recovery processes etc.,
Data heterogeneity: It refers to the different way the data is defined and used in different
modules. E.g Prism Solutions, Evolutionary Technology Inc., Vality, Praxis and Carleton
It is data about data. It is used for maintaining, managing and using the data warehouse. It
is classified into two:
It contains info that gives info stored in data warehouse to users. It includes,
Subject areas, and info object type including queries, reports, images, video, audio
clips etc.
Internet home pages
Info related to info delivery system
Data warehouse operational info such as ownerships, audit trails etc.,
Meta data helps the users to understand content and find the data. Meta data are stored in
a separate data stores which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.
Its purpose is to provide info to business users for decision making. There are five
categories:
Data query and reporting tools
Application development tools
Executive info system tools (EIS)
OLAP tools
Data mining tools
Query and reporting tools are used to generate query and report. There are two types of
reporting tools. They are:
Production reporting tool used to generate regular operational reports
Desktop report writer are inexpensive desktop tools designed for end users.
Managed Query tools: used to generate SQL query. It uses Meta layer software in between
users and databases which offers a point-and-click creation of SQL statement. This tool is a
preferred choice of users to perform segment identification, demographic analysis, territory
management and preparation of customer mailing lists etc.
OLAP Tools: are used to analyze the data in multi dimensional and complex views. To
enable multidimensional properties it uses MDDB and MRDB where MDDB refers multi
dimensional data base and MRDB refers multi relational data bases.
Data mining tools: are used to discover knowledge from the data warehouse data also can
be used for data visualization and data correction purposes.
Departmental subsets that focus on selected subjects. They are independent used by
dedicated user group. They are used for rapid delivery of enhanced decision support
functionality to end users. Data mart is used in the following situation:
Extremely urgent user requirement
The absence of a budget for a full scale data warehouse strategy
The decentralization of business needs
The attraction of easy to use tools and mind sized project
There are two factors that drive you to build and use data warehouse. They are:
Business factors:
Business users want to make decision quickly and correctly using all available data.
Technological factors:
To address the incompatibility of operational data stores
IT infrastructure is changing rapidly. Its capacity is increasing and cost is decreasing
so that building a data warehouse is easy
There are several things to be considered while building a successful data warehouse
In the top down approach suggested by Bill Inmon, we build a centralized storage area to
house corporate wide business data. This repository (storage area) is called Enterprise
Data Warehouse (EDW). The data in the EDW is stored in a normalized form in order to
avoid redundancy.
The central repository for corporate wide data helps us maintain one version of truth of the
data.
The data in the EDW is stored at the most detail level. The reason to build the EDW on the
most detail level is to leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to provide for future requirements.
The disadvantages of storing data at the detail level are
1. The complexity of design increases with increasing level of detail.
2. It takes large amount of space to store data at detail level, hence increased cost.
Implement the top-down approach when
1. The business has complete clarity on all or multiple subject areas data warehosue
requirements.
2. The business is ready to invest considerable time and money.
The advantage of using the Top Down approach is that we build a centralized repository to
provide for one version of truth for business data. This is very important for the data to be
reliable, consistent across subject areas and for reconciliation in case of data related
contention between subject areas.
The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the EDW to be implemented followed by building
the data marts before which they can access their reports.
Bottom Up Approach
A Conformed dimension has consistent dimension keys, consistent attribute names and
consistent values across separate data marts. The conformed dimension means exact
same thing with every fact table it is joined.
A Conformed fact has the same definition of measures, same dimensions joined to it and at
the same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing and
integrating data marts as and when the requirements are clear. We don’t have to wait for
knowing the overall requirements of the warehouse. We should implement the bottom up
approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only one
data mart.
The advantage of using the Bottom Up approach is that they do not require high initial costs
and have a faster implementation time; hence the business can start using the marts much
earlier as compared to the top-down approach.
The disadvantages of using the Bottom Up approach is that it stores data in the de
normalized format, hence there would be high space usage for detailed data. We have a
tendency of not keeping detailed data in this approach hence loosing out on advantage of
having detail data .i.e. flexibility to easily cater to future requirements.
Most successful data warehouses that meet these requirements have these common
characteristics:
Are based on a dimensional model
Contain historical and current data
Include both detailed and summarized data
Consolidate disparate data from multiple sources while retaining consistency
Data warehouse design approach muse be business driven, continuous and iterative
engineering approach. In addition to the general considerations there are following specific
points relevant to the data warehouse design:
Data content
The content and structure of the data warehouse are reflected in its data model. The data
model is the template that describes how information will be organized within the integrated
warehouse framework. The data warehouse data must be a detailed data. It must be
formatted, cleaned up and transformed to fit the warehouse data model.
Meta data
It defines the location and contents of data in the warehouse. Meta data is searchable by
users to find definitions or subject areas. In other words, it must provide decision support
oriented pointers to warehouse data and thus provides a logical link between warehouse
data and decision support applications.
Data distribution
One of the biggest challenges when designing a data warehouse is the data placement and
distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes
necessary to know how the data should be divided across multiple servers and which users
should get access to which types of data. The data can be distributed based on the subject
area, location (geographical region), or time (current, month, year).
Tools
A number of tools are available that are specifically designed to help in the implementation
of the data warehouse. All selected tools must be compatible with the given data
warehouse environment and with each other. All tools must be able to use a common Meta
data repository.
Design steps
Data warehouse implementation relies on selecting suitable data access tools. The best
way to choose this is based on the type of data can be selected using this tool and the kind
of access it permits for a particular user. The following lists the various type of data that can
be accessed:
Simple tabular form data
Ranking data
Multivariable data
Time series data
Graphing, charting and pivoting data
Complex textual search data
Statistical analysis data
Data for testing of hypothesis, trends and patterns
Predefined repeatable queries
Ad hoc user specified queries
Reporting and analysis data
Complex queries with multiple joins, multi level sub queries and sophisticated search
criteria
A proper attention must be paid to data extraction which represents a success factor for a
data warehouse architecture. When implementing data warehouse several the following
selection criteria that affect the ability to transform, consolidate, integrate and repair the
data should be considered:
Timeliness of data delivery to the warehouse
The tool must have the ability to identify the particular data and that can be read by
conversion tool
The tool must support flat files, indexed files since corporate data is still in this type
The tool must have the capability to merge data from multiple data stores
The tool should have specification interface to indicate the data to be extracted
The tool should have the ability to read data from data dictionary
The code generated by the tool should be completely maintainable
The tool should permit the user to extract the required data
The tool must have the facility to perform data type and character set translation
The tool must have the capability to create summarization, aggregation and
derivation of records
The data warehouse database system must be able to perform loading data directly
from these tools
– As a data warehouse grows, there are at least two options for data placement. One
is to put some of the data in the data warehouse into another storage media.
– The second option is to distribute the data in the data warehouse across multiple
servers.
The users of data warehouse data can be classified on the basis of their skill level in
accessing the warehouse. There are three classes of users:
Casual users: are most comfortable in retrieving info from warehouse in pre defined formats
and running pre existing queries and reports. These users do not need tools that allow for
building standard and ad hoc reports
Power Users: can use pre defined as well as user defined queries to create simple and ad
hoc reports. These users can engage in drill down operations. These users may have the
experience of using reporting and query tools.
Expert users: These users tend to create their own complex queries and perform standard
analysis on the info they retrieve. These users have the knowledge about the use of query
and report tools
– Discovery of info
– Sharing the analysis
The functions of data warehouse are based on the relational data base technology. The
relational data base technology is implemented in parallel manner. There are two
advantages of having parallel relational data base technology for data warehouse:
Linear Speed up: refers the ability to increase the number of processor to reduce response
time.
Linear Scale up: refers the ability to provide same performance on the same requests as
the database size increases.
Inter query Parallelism: In which different server threads or processes handle multiple
requests at the same time.
Intra query Parallelism: This form of parallelism decomposes the serial SQL query into
lower level operations such as scan, join, sort etc. Then these lower level operations are
executed concurrently in parallel.
Horizontal parallelism: which means that the data base is partitioned across multiple disks
and parallel processing occurs within a specific task that is performed concurrently on
different processors against different set of data
Vertical parallelism: This occurs among different tasks. All query components such as scan,
join, sort etc are executed in parallel in a pipelined fashion. In other words, an output from
one task becomes an input into another task.
Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.
Random portioning includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which each
record is placed on the next disk assigned to the data base.
Intelligent partitioning assumes that DBMS knows where a specific record is located and
does not waste time searching for it across all disks. The various intelligent partitioning
include:
Hash partitioning: A hash algorithm is used to calculate the partition number based on the
value of the partitioning key for each row
Key range partitioning: Rows are placed and located in the partitions according to the value
of the partitioning key. That is all the rows with the key value from A to K are in partition 1, L
to T are in partition 2 and so on.
Schema portioning: an entire table is placed on one disk; another table is placed on
different disk etc. This is useful for small reference tables.
User defined portioning: It allows a table to be partitioned on the basis of a user defined
expression.
There are three DBMS software architecture styles for parallel processing:
1. Shared memory or shared everything Architecture
2. Shared disk architecture
3. Shared nothing architecture
Tightly coupled shared memory systems, illustrated in following figure have the following
characteristics:
Multiple PUs share memory.
Each PU has full access to all shared memory through a common bus.
Communication between nodes occurs via shared memory.
Performance is limited by the bandwidth of the memory bus.
Symmetric multiprocessor (SMP) machines are often nodes in a cluster. Multiple SMP
nodes can be used with Oracle Parallel Server in a tightly coupled system, where memory
is shared among the multiple PUs, and is accessible by all the PUs through a memory bus.
Examples of tightly coupled systems include the Pyramid, Sequent, and Sun SparcServer.
Performance is potentially limited in a tightly coupled system by a number of factors. These
include various system components such as the memory bandwidth, PU to PU
communication bandwidth, the memory available on the system, the I/O bandwidth, and the
bandwidth of the common bus.
Shared disk systems are typically loosely coupled. Such systems, illustrated in following
figure, have the following characteristics:
Each node consists of one or more PUs and associated memory.
Memory is not shared between nodes.
Communication occurs over a common high-speed bus.
Each node has access to the same disks and other resources.
A node can be an SMP if the hardware supports it.
Bandwidth of the high-speed bus limits the number of nodes (scalability) of the
system.
The cluster illustrated in figure is composed of multiple tightly coupled nodes. The
Distributed Lock Manager (DLM ) is required. Examples of loosely coupled systems are
VAXclusters or Sun clusters.
Since the memory is not shared among the nodes, each node has its own data cache.
Cache consistency must be maintained across the nodes and a lock manager is needed to
maintain the consistency. Additionally, instance locks using the DLM on the Oracle level
must be maintained to ensure that all nodes in the cluster see identical data.
There is additional overhead in maintaining the locks and ensuring that the data caches are
consistent. The performance impact is dependent on the hardware and software
components, such as the bandwidth of the high-speed bus through which the nodes
communicate, and DLM performance.
Shared nothing systems are typically loosely coupled. In shared nothing systems only one
CPU is connected to a given disk. If a table or database is located on that disk, access
depends entirely on the PU which owns it. Shared nothing systems can be represented as
follows:
Shared nothing systems are concerned with access to disks, not access to memory.
Nonetheless, adding more PUs and disks can improve scaleup. Oracle Parallel Server can
access the disks on a shared nothing system as long as the operating system provides
transparent disk access, but this access is expensive in terms of latency.
Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
Shared nothing systems provide for incremental growth.
System growth is practically unlimited.
MPPs are good for read-only databases and decision support applications.
Failure is local: if one node fails, the others stay up.
Disadvantages
More coordination is required.
More overhead is required for a process working on a disk belonging to another
node.
If there is a heavy workload of updates or inserts, as in an online transaction
processing system, it may be worthwhile to consider data-dependent routing to
alleviate contention.
DBMS management tools help to configure, tune, admin and monitor a parallel
RDBMS as effectively as if it were a serial RDBMS
Price / Performance: The parallel RDBMS can demonstrate a non linear speed up
and scale up at reasonable costs.
4). Write all the DBMS schemas for decision support. [CO1-H1]
The basic concepts of dimensional modeling are: facts, dimensions and measures. A fact is
a collection of related data items, consisting of measures and context data. It typically
represents business items or business transactions. A dimension is a collection of data that
describe one business dimension. Dimensions determine the contextual background for the
facts; they are the parameters over which we want to perform OLAP. A measure is a
numeric attribute of a fact, representing the performance or behavior of the business
relative to the dimensions.
Considering Relational context, there are three basic schemas that are used in dimensional
modeling:
1. Star schema
2. Snowflake schema
3. Fact constellation schema
4.1.Star schema
The multidimensional view of data that is expressed using relational data base semantics is
provided by the data base schema design called star schema. The basic of stat schema is
that information can be classified into two groups:
Facts
Dimension
Star schema has one large central table (fact table) and a set of smaller tables
(dimensions) arranged in a radial pattern around the central table.
Facts are core data element being analyzed while dimensions are attributes about the facts.
The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
The star schema architecture is the simplest data warehouse schema. It is called a star
schema because the diagram resembles a star, with points radiating from a center. The
center of the star consists of fact table and the points of the star are the dimension tables.
Usually the fact tables in a star schema are in third normal form(3NF) whereas dimensional
tables are de-normalized. Despite the fact that the star schema is the simplest architecture,
it is most commonly used nowadays and is recommended by Oracle.
Fact Tables
A fact table is a table that contains summarized numerical and historical data (facts) and a
multipart index composed of foreign keys from the primary keys of related dimension tables.
A fact table typically has two types of columns: foreign keys to dimension tables and
measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.
Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit
summary in a fact table can be viewed by a Time dimension (profit by month, quarter, year),
Region dimension (profit by country, state, city), Product dimension (profit for product1,
product2).
Typical fact tables store data about sales while dimension tables data about geographic
region (markets, cities), clients, products, times, channels.
Measures are numeric data based on columns in a fact table. They are the primary data
which end users are interested in. E.g. a sales fact table may contain a profit measure
which represents profit on each sale.
1.Indexing
• It requires multiple metadata definition( one for each component) to design a single table.
• Since the fact table must carry all key components as part of its primary key, addition or
deletion of levels in the hierarchy will require physical modification of the affected table,
which is time-consuming processed that limits flexibility.
• Carrying all the segments of the compound dimensional key in the fact table increases the
size of the index, thus impacting both performance and scalability.
2.Level Indicator.
The dimension table design includes a level of hierarchy indicator for every record.
Every query that is retrieving detail records from a table that stores details and aggregates
must use this indicator as an additional constraint to obtain a correct result.
The user is not and aware of the level indicator, or its values are in correct, the otherwise
valid query may result in a totally invalid answer.
Alternative to using the level indicator is the snowflake schema. Aggregate fact tables are
created separately from detail tables. Snowflake schema contains separate fact tables for
each level of aggregation.
Other problems with the star schema design - Pairwise Join Problem
5 tables require joining first two tables, the result of this join with third table and so on.
The intermediate result of every join operation is used to join with the next table.
Selecting the best order of pairwise joins rarely can be solve in a reasonable amount of
time.
Five-table query has 5!=120 combinations
2 .Snowflake schema: is the result of decomposing one or more of the dimensions. The
many-to-one relationships among sets of attributes of a dimension can separate new
dimension tables, forming a hierarchy. The decomposed snowflake structure visualizes the
hierarchical structure of dimensions very well.
3.Fact constellation schema: For each star schema it is possible to construct fact
constellation schema(for example by splitting the original star schema into more star
schemes each of them describes facts on another level of dimension hierarchies). The fact
constellation architecture contains multiple fact tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design
because many variants for particular kinds of aggregation must be considered and
selected. Moreover, dimension tables are still large.
A STAR join is high-speed, single pass, parallelizable muti-tables join method. It performs
many joins by single operation with the technology called Indexing. For query processing
the indexes are used in columns and rows of the selected tables.
Red Brick's RDBMS indexes, called STAR indexes, used for STAR join performance. The
STAR indexes are created on one or more foreign key columns of a fact table. STAR index
contains information that relates the dimensions of a fact table to the rows that contains
those dimensions. STAR indexes are very space-efficient. The presence of a STAR index
allows Red Brick's RDBMS to quickly identify which target rows of the fact table are of
interest for a particular set of dimension. Also, because STAR indexes are created over
foreign keys, no assumptions are made about the type of queries which can use the STAR
indexes.
Computer Science Engineering Department - 26 - DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM
SYBASE IQ is an example of a product that uses a bit mapped index structure of the data
stored in the SYBASE DBMS.
Sybase released SYBASE IQ database targeted an "ideal" data mart solution for handle
multi user adhoc(unstructured) queries.
Over view:
SYBASE IQ is a separate SQL database.
Once loaded, SYBASE IQ converts all data into a series of bit maps, which are then
highly compressed and stored on disk.
SYBASE positions SYBASE IQ as a read only database for data marts, with a practical
size limitations currently placed at 100 Gbytes.
Data cardinality: Bitmap indexes are used to optimize queries against low- cardinality data
— that is, data in which the total number of possible values is relatively low.
For example, address data cardinality pin code is 50 (50 possible values), and gender data
cardinality is only 2 (male and female)..
If the bit for a given index is "on", the value exists in the record. Here, a 10,000 — row
employee table that contains the "gender" column is bitmap-indexed for this value.
Bitmap indexes can become bulky and even unsuitable for high cardinality data where the
range of possible values is high. For example, values like "income" or "revenue" may have
an almost infinite number of values.
SYBASE IQ uses a patented technique called Bit-wise technology to build bitmap indexes
for high-cardinality data.
Index types: The first release of SYBASE IQ provides five index techniques.
SYBASE IQ advantages/Performance:
Bitwise technology
Compression
Optimized memory-based processing
Column wise processing
Low operating cost
Large block I/O
Operating-system-level parallelism
Prejoin and ad hoc join capabilities
Disadvantages of SYBASE IQ indexing:
No updates
Lack of core RDBMS features
Less advantageous for planned queries
High memory usage
4.4 Column Local Storage
.
Thinking Machine Corporation has developed CM-SQL RDBMS product, this approach is
based on storing data column-wise, as opposed to traditional row wise storage.
A traditional RDBMS approach to storing data in memory and on the disk is to store it one
row at a time, and each row can be viewed and accessed a single record. This approach
works well for OLTP environments in which a typical transaction access a record at a time.
Computer Science Engineering Department - 28 - DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM
However, for a set processing adhoc query environment in data warehousing the goal
is to retrieve multiple values of several columns. For example, if a problem is to calculate
average, maximum and minimum salary, the column wise storage of the salary field
requires a DBMS to read only one record.
The tools that provide data contents and formats from operational and external data stores
into the data warehouse includes following tasks.
The following are the Criteria’s that affects the Tools ability to transform, consolidate,
integrate and repair the data.
1. The ability to identify data - in the data source environments that can be read by the
conversion tool is important.
2. Support for flat files, indexed files is critical. eg. VSAM , IMS and CA-IDMS
3. The capability to merge data from multiple data stores is required in many
installations.
4. The specification interface to indicate the data to be extracted and the conversion
criteria is important.
5. The ability to read information from data dictionaries or import information from
warehouse products is desired.
6. The code generated by the tool should be completely maintainable from within the
development environment.
7. Selective data extraction of both data elements and records enables users to extract
only
the required data.
8. A field-level data examination for the transformation of data into information is
needed.
9. The ability to perform data-type and character-set translation is a requirement when
moving data between incompatible systems.
10. The capability to create summarization, aggregation and derivation records and field
is very important.
11. Vendor stability and support for the product items must be carefully evaluated.
Integrated solutions can fall into one of the categories described below.
• Code generators create modified 3GL/4GL programs based on source, target data
definitions, data transformation, improvement rules defined by the developer. This
approach reduces the need for an organization to write its own data capture,
transformation, and load programs.
•Database data replication tools utilize database triggers or a recovery log to capture
changes
to a single data source on one system and apply the changes to a copy of the source data
located on a different systems.
• The data layer provides - data access and transaction services for management of
corporate data assets. This layer is independent of any current business process or user
interface application. It manages the data and implements the business rules for data
integrity
. • The process layer - provides services to manage automation and support for current
business processes. It allows modification of the supporting application logic independent of
the necessary data or user interface.
Computer Science Engineering Department - 30 - DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM
• The user layer - manages user interaction with process and/or data layer services. It
allows the user interface to change independently of the basic business processes.
• Prism solutions
• SAS Institute
• Validity Corporation
• Information Builders
Prism warehouse manager can extract data from multiple source, environments, including
DB2, IDMS, IMS, VSAM, RMS, and sequential files under UNIX or MVS. It has strategic
relationship with pyramid and Informix.
SAS institute:
SAS starts with the basis of critical data still resides in the data center and offer its
traditional SAS system tools to serve at data warehousing functions. Its data repository
function can act to build the informational database.
SAS Data Access Engines serve as extraction tools to combine common variables,
transform data representation forms for consistency, consolidate redundant data, and use
business rules to produce computed values in the warehouse.
SAS engines can work with hierarchical and relational database and sequential files.
Validity Corporation:
Information builders:
A product that can be used as a component for data extraction, transformation and legacy
access tool suite for building data warehouse is EDA/SQL from information builders.
1.Informatica:
This is a multicompany metadata integration idea. Informatica joined services with Andyne,
Brio, Business objects, Cognos, Information Advantage, Info space, IQ software and
Microstrategy to deliver a "back-end" architecture and publish AFI specifications supporting
its technical and business metadata.
2. Power Mart:
Informatica's flagship product — PowerMart suite — consists of the following components.
• Power Mart Designer
• Power Mart server
• The Informatica Server Manager
• The Informatica Repository
• Informatica Power Capture
3. Constellar:
The transformation hub performs the tasks of data cleanup and transformation.
Metadata is one of the most important aspects of data warehousing. It is data about data
stored in the warehouse and its users.
Metadata contains :-
i. The location and description of warehouse system and data components (warehouse
objects).
ii. Names, definition, structure and content of the data warehouse and end user views.
iii. Identification of reliable data sources (systems of record).
iv. Integration and transformation rules - used to generate the data warehouse; these
include the mapping method from operational databases into the warehouse, and
algorithms used to convert, enhance, or transform data.
v. Integration and transformation rules - used to deliver data to end-user analytical
tools.
vi. Subscription information - for the information delivery to the analysis subscribers.
vii. Data warehouse operational information, - which includes a history of warehouse
updates, refreshments, snapshots, versions, ownership authorizations and extract audit
trail.
viii. Metrics - used to analyze warehouse usage and performance and end user usage patterns.
ix. Security - authorizations access control lists, etc.
In a situation such as a data warehouse different tools must be able to freely and easily
access, and in some cases manipulate and update, the metadata must be created by other
tools and stored in a variety of different storages. To achieve this goal is to establish atleast
minimum common method of interchange standards and guidelines for fulfill different
vendors tools. This can be offered by the data warehousing vendors and is known as the
metadata interchange initiative.
The application meta model — the tables, etc., used to "hold" the metadata for a
particular application.
The metadata meta model — the set of objects that the metadata interchange standard
can be used to describe.
These represent the information that is common to one or more classes of tools, such as
data extraction tools, replication tools, user query tools and database servers.
• Procedural approach: The API(Application program Interface) tool need to create update,
access, and interact with metadata. This approach used to do this in terms of developing
the standard metadata implementation.
• ASCII batch approach: This approach depend on the ASCII file format which contains the
description of metadata components and standardized access requirements that make up
the interchange standard meta data model.
• Hybrid approach: Data-driven model, A table driven API support only fully qualified
references for each metadata element, a tool interact with the API through the standard
access framework and directly access just the specific metadata object needed.
• The standard metadata model, which refers to the ASCII file format used to represent the
metadata that is being exchanged.
• The standard access framework, which describes the minimum number of API functions a
vendor must support
• Tool profile, which is provided by each tool vendor. The tool profile is a file that describes
what aspects of the interchange standard metamodel a particular tool supports.
• The user configuration, which is a file describing the legal interchange paths for metadata
in the user's environment.
Metadata repository management software can be used to map the source data to the
target database, generate code for data transformations, integrate and transform the data,
and control moving data to the warehouse.
Metadata defines the contents and location of data (data model) in the warehouse,
relationships between the operational databases and the data warehouse and the business
views of the warehouse data that are accessible by end-user tools.
A data warehouse design ensures a mechanism for maintaining the metadata repository
and all the access paths to the data warehouse must have metadata-as an entry point.
The variety of access paths available into the data warehouse, and at the same time to
show how many tool classes can be involved in the process.
A major problem in data warehousing is the inability to communicate to the end user about
what information resides in the data warehouse and how it can be accessed.
Computer Science Engineering Department 38 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM
• It can define all data elements and their attributes, data sources and timing, and the rules
that govern data use and data transformation
. • Metadata needs to be collected as the warehouse is designed and built.
• Even through there are a number of tools available to help users understand and use the
warehouse, these tools need to be carefully evaluated before any purchasing decision is
made.
The data warehouse arena must include external data within the data warehouse.
The data warehouse must reduce costs and to increase competitiveness and business
quickness.
The process of integrating external and internal data into the warehouse faces a number of
challenges.
In consistent data formats
Missing or invalid data
Different levels of aggregation
Semantic inconsistency
Unknown or questionable data quality and timeliness
Data warehouses integrate various data types such as alphanumeric data types, data types
for, text, voice, image, full motion video, web pages in HTML formats.
UNIVERSITY QUESTIONS
UNIT- I
Part A
1. Define the term ‘Data Warehouse’.
2. List out the functionality of metadata.
3. What are nine decision in the design of a Data warehousing?
4. List out the two different types of reporting tools.
5. What are the technical issues to be considered when designing and implementing a
data warehouse environment?
6. What are the advantages of data warehousing.
7. Give the difference between the Horizontal and Vertical Parallelism.
8. Define star schema.
9. What are the steps to be followed to store the external source into the data
warehouse?
10. Define Legacy data.
Part-B
1. Enumerate the building blocks of data warehouse. Explain the importance of
metadata in a data warehouse environment. [16]
2. Explain various methods of data cleaning in detail [8]
3. Diagrammatically illustrate and discuss the data warehousing architecture with briefly
explain components of data warehouse [16]
4. (i) Distinguish between Data warehousing and data mining. [8]
(ii)Describe in detail about data extraction, cleanup [8]
5. Write short notes on (i) Transformation [8]
(ii) Metadata [8]
6. List and discuss the steps involved in mapping the data warehouse to a
multiprocessor architecture. [16]
7. Explain in detail about different Vendor Solutions. [16]
UNIT II
A concept hierarchy that is a total (or) partial order among attributes in a database schema
is called a schema hierarchy.
The roll-up operation is also called drill-up operation which performs aggregation on a data
cube either by climbing up a concept hierarchy for a dimension (or) by dimension reduction.
Drill-down is the reverse of roll-up operation. It navigates from less detailed data to more
detailed data. Drill-down operation can be taken place by stepping down a concept
hierarchy for a dimension
The slice operation performs a selection on one dimension of the cube resulting in a sub
cube.
The dice operation defines a sub cube by performing a selection on two (or) more
dimensions.
This is a visualization operation that rotates the data axes in an alternative presentation of
the data.
• Top-down view • Data source view • Data warehouse view • Business query view
9.What are the methods for developing large software systems? [CO2-L1]
• Waterfall method
• Spiral method
The waterfall method performs a structured and systematic analysis at each step before
proceeding to the next, which is like a waterfall falling from one step to the next.
Computer Science Engineering Department 41 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM
11.List out the steps of the data warehouse design process? [CO2-L2]
The MOLAP model is a special purpose server that directly implements multidimensional
data and operations.
The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the
greater scalability of ROLAP and the faster computation of MOLAP,(i.e.) a HOLAP server
may allow large volumes of detail data to be stored in a relational database, while
aggregations are kept in a separate MOLAP store.
An enterprise warehouse collects all the information’s about subjects spanning the entire
organization. It provides corporate-wide data integration, usually from one (or)more
operational systems (or) external information providers. It contains detailed data as well as
summarized data and can range in size from a few giga bytes to hundreds of giga bytes,
tera bytes (or) beyond. An enterprise data warehouse may be implemented on traditional
mainframes, UNIX super servers (or) parallel architecture platforms. It requires business
modeling and may take years to design and build.
Data mart is a database that contains a subset of data present in a data warehouse.Data
marts are created to structure the data in a data warehouse according to issues such as
hardware platforms and access control strategies. We can divide a data warehouse into
data marts after the data warehouse has been created. Data marts are usually
implemented on low-cost departmental servers that are UNIX (or) windows/NT based. The
implementation cycle of the data mart is likely to be measured in weeks rather than months
(or) years.
Dependent data marts are sourced directly from enterprise data warehouses. Independent
data marts are data captured from one (or) more operational systems (or) external
A virtual warehouse is a set of views over operational databases. For efficient query
processing, only some of the possible summary views may be materialized. A virtual
warehouse is easy to build but requires excess capability on operational database servers.
Indexing is a technique, which is used for efficient data retrieval (or) accessing data in a
faster manner. When a table grows in volume, the indexes also increase in size requiring
more storage.
Metadata is used in data warehouse is used for describing data about data. (i.e.) meta data
are the data that define warehouse objects. Metadata are created for the data names and
definitions of the given warehouse.
Part - B
1. Define all the Reporting and query tools for data analysis:- [CO2-H2]
The principal purpose of data warehousing is to provide information to business users for
strategic decision making. These users interact with the data warehouse using front-end
tools, or by getting the required information through the information delivery system.
1.1 Tool Categories
There are five categories of decision support tools
1. Reporting
2. Managed query
3. Executive information systems (EIS)
4. On-line analytical processing (OLAP)
5. Data mining (DM)
1.1.1.Reporting tools:
Reporting tools can be divided into production reporting tools and desktop report writers.
1.1.Production reporting tools: Companies generate Production reporting tools for regular
operational reports or support high-volume batch jobs. E.g calculating and printing pay
checks.
1.2 Report writers: Are inexpensive desktop tools designed for end users. Products such
as Seagate software's crystal reports allows users to design and run reports without having
to rely on the IS department.
In general, report writers have graphical interfaces and built-in charting functions,
They can pull groups of data from a variety of data sources and integrate them in a single
report.
Leading report writers include Crystal Reports, Actuate and Platinum Technology,
Inc's Info Reports. Vendors are trying to increase the scalability of report writers by-
supporting three-tiered architectures in which report processing is done on a Windows NT
or UNIX server.
Report writers also are beginning to offer object-oriented interfaces for designing
and manipulating reports and modules for performing ad hoc queries and OLAP analysis.
Users and related activities
EIS tools include pilot software, Inc.'s Light ship, Platinum Technology's Forest and Trees,
Comshare, Inc.'s Commander Decision, Oracle's Express Analyzer and SAS Institute, Inc.'s
SAS/EIS.
EIS vendors are moving in two directions.
Many are adding managed query functions to compete head-on with other -decision
support tools.
Others are building packaged applications that address horizontal functions, such as
sales budgeting, and marketing, or vertical industries such as financial services.
Ex: Platinum Technologies offers Risk Advisor.
1.1.4 .OLAP tools:
It provides a sensitive way to view corporate data.
These tools aggregate data along common business subjects or dimensions and then let
users navigate through the hierarchies and dimensions with the click of a mouse button.
Some tools such as Arbor software Corp.'s Essbase , Oracle's Express, pre aggregate
data in special multi dimensional database.
Other tools work directly against relational data and aggregate data on the fly, such as
Micro-strategy, Inc.'s DSS Agent or Information /Advantage, Inc.'s Decision suite.
Some tools process OLAP data on the desktop instead of server.
Desktop OLAP tools include Cognos Power play, Brio Technology, In is Brio query,
Planning Sciences, Inc.'s Gentium, and Andyne's Pablo.
1.1.5.Data mining tools:
Provide close to corporate data that aren't easily differentiate with managed query or OLAP
tools.
Data mining tools use a variety of statistical and artificial intelligence (AI) algorithm to
analyze the correlation of variables in the data and search out interesting patterns and
relationship to investigate.
Data mining tools, such as IBM's Intelligent Miner, are expensive and require statisticians
to implement and manage.
These include Data Mind CorP's Data Mind, Pilot's Discovery server, and tools from
Business objects and SAS Institute.
This tools offer simple user interfaces that plug in directly to existing OLAP tools or
databases and can be run directly against data warehouses.
For example, all end-user tools use metadata definitions to obtain access to data stored in
the warehouse, and some of these tools (eg., OLAP tools) may employ additional or
intermediary data stores. (eg., data marts, multi dimensional data base).
1.1.6 Applications
Organizations use a familiar application development approach to build a query and
reporting environment for the data warehouse. There are several reasons for doing this:
A legacy DSS or EIS system is still being used, and the reporting facilities appear adequate.
An organization has made a large investment in a particular application development
environment (eg., Visual C++, Power Builder).
A new tool may require an additional investment in developers skill set, software, and the
infrastructure, all or part of which was not budgeted for in the planning stages of the project.
The business users do not want to get involved in this phase of the project, and will continue
to relay on the IT organization to deliver periodic reports in a familiar format .
A particular reporting requirement may be too complicated of an available reporting tool to
handle.
All these reasons are perfectly valid and in many cases result in a timely and cost-effective
delivery of a reporting system for a data warehouse.
2. What are the need for applications:- [CO2-H2]
The tools and applications fit into the managed query and EIS categories. As these are easy-to-
use, point-and-click tools that either accept SQL or generate SQL statements to query relational
data stored in the warehouse.
Some of these tools and applications can format the retrieved data in easy-to-read reports, while
others concentrate on the on-screen presentation.
The users of business applications such as
segment identification,
demographic analysis,
territory management, and
Customer mailing lists.
The complexity of the question grows these tools may rapidly become inefficient. Consider the
various access types to the data stored in a data warehouse.
Systems).
Interactive drill-down reporting and analysis.
The first four types of access are covered by the combined category of tools called query
and reporting tools
1. Creation and viewing of standard reports:
This is the main reporting activity: the routine delivery of reports based on pre determined
measures.
2. Definition and creation of ad-hoc reports:
These can be quite complex, and the trend is to off-load this time-consuming activity to the
users.
Reporting tools that allow managers and business users to quickly create their own reports
and get quick answers to business questions are becoming increasingly popular.
3. Data exploration: With the newest wave of business intelligence tools, users can easily
"surf' through data without a preset path to quickly uncover business trends or problems.
While reporting type 1 may appear relatively simple, types 2 and 3, combined
with certain business requirements often exceed existing tools capabilities and may
require building sophisticated applications to retrieve and analyze warehouse data.
This approach may be very useful for those data warehouse users who are not yet
comfortable with ad hoc queries.
Impromptu from Cognos Corporation for interactive database reporting that delivers 1-
to 1000 + seat scalability.
Impromptu's object-oriented architecture ensures control administrative consistency
across all users and reports.
Users access Impromptu through its easy-to-use graphical user interface.
Offers a fast and strong implementation at the enterprise level, and feature s full
administrative control, ease of deployment, and low cost of ownership.
It can support database reporting tool and single user reporting on personal data.
A catalog contains:
• Query activity
• Processing location
• Database connections
• Reporting permissions
• User profiles
• Client/server balancing
Computer Science Engineering Department 48 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM
• Database transaction
• Security by value
• Field and table security
Reporting
Impromptu is designed to make it easy for users to build and run their own reports.
Impromptu's predefined report wise templates include templates for mailing labels,
invoices, sales reports, and directories. These templates are complete with
formatting, logic, calculations, and custom automation.
The templates are database-independent; therefore, users simply map their data
onto the existing placeholders to quickly create reports.
Impromptu provides users with a variety of page and screen formats, known as Head
starts.
Impromptu offers special reporting options that increase the value of distributed
standard reports.
Picklists and prompts: Organizations can create standard Impromptu reports for which
users can select from lists of value called picklists. Picklists and prompts make a single
report flexible enough to serve many users.
Custom templates: Standard report templates with global calculations and business rules
can be created once and then distributed to users of different databases.
A template's standard logic, calculations and layout complete the report automatically in the
user's choice of format.
Exception reporting: Exception reporting is the ability to have reports highlight values that
lie outside accepted ranges. Impromptu offers three types of exception reporting.
Conditional filters — Retrieves only these values that are outside threshold •
Conditional highlighting — Create rules for formatting data on the basis of data
values.
Conditional display — Display report objects under certain conditions
Interactive reporting: Impromptu unifies querying and reporting in a single interface. Users
can perform both these tasks by interacting with live, data in one integrated module.
Frames: Impromptu offers an interesting frame based reporting style.
Frames are building blocks that may be used to produce reports that are formatted with
fonts, borders, colors, shading etc.
Frames or combination of frames, simplify building even complex reports.
The data formats itself according to the type of frame selected by the user.
Text frames allow users to add descriptive text to reports and display binary large
objects (BLOBS) such as product descriptions.
Picture frames incorporate bitmaps to reports or specific records, perfect for visually
enhancing reports.
OLE frames make it possible for user to insert any OLE object into a report.
Impromptu Request Server
The new request server, which allows clients to off-load the query process to the server. A
PC user can now schedule a request to run on the server, and an Impromptu request
server will execute the request, generating the result on the server. When done, the
scheduler notifies the user, who can then access, view or print at will from PC.
The Impromptu request server runs on HP/UX 9.X, IBM AIX 4.X and Sun Solaris 2.4. It
supports data maintained in ORACLE 7.X and SYBASE system 10/11.
Supported databases
Impromptu provides a native database support for ORACLE, Microsoft SQL Server,
SYBASE, SQL Server, Omni SQL Gateway, SYBASE Net Gateway. MDI DB2 Gateway,
Informix, CA-Ingres, Gupta SQL-Base, Borland InterBase, Btrieve, dBASE, Paradox, and
ODBC accessing any database with an ODBC driver,
Impromptu features include:
* Unified query and reporting interface
* Object-oriented architecture
* Complete integration with power play
* Scalability
* Security and control
*Data presented in business content
* Over 70_redefined report templates
* Frame-based reporting
* Business-relevant reporting
* Database-independent catalogs
Online Analytical Processing (OLAP) applications and tools are those that are designed to
ask “complex queries of large multidimensional collections of data.” Due to that OLAP is
accompanied with data warehousing.
One of the limitations of SQL is, it cannot represent complex problems. A query will be
translated in to several SQL statements. These SQL statements will involve multiple joins,
intermediate tables, sorting, aggregations and a huge temporary memory to store these
tables. These procedures required a lot of computation which will require a long time in
computing.
The second limitation of SQL is its inability to use mathematical models in these SQL
statements. If an analyst, create these complex statements using SQL statements, there
will be a large number of computation and huge memory needed. Therefore the use of
OLAP is preferable to solve this kind of problem.
Multidimensional data model is to view it as a cube. The cable at the left contains detailed
sales data by product, market and time. The cube on the right associates sales number
(unit sold) with dimensions-product type, market and time with the unit variables organized
as cell in an array.
This cube can be expended to include another array-price-which can be associates with all
or only some dimensions. As number of dimensions increases number of cubes cell
increase exponentially.
Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies for years,
quarters, months, weak and day. GEOGRAPHY may contain country, state, city etc.
In this cube we can observe, that each side of the cube represents one of the elements of
the question. The x-axis represents the time, the y-axis represents the products and the z-
axis represents different centers. The cells of in the cube represents the number of product
sold or can represent the price of the items
This Figure also gives a different understanding to the drilling down operations. The
relations defined must not be directly related, they related directly.
The size of the dimension increase, the size of the cube will also increase exponentially.
The time response of the cube depends on the size of the cube.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways:
Slice
The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows the
pivot operation.
In this the item and location axes in 2-D slice are rotated.
5.Name all the OLAP Guidelines and rules for implementation process. [CO2-H2]
Dr. E.F. Codd the ―father of the relational model, created a list of rules to deal with the
OLAP systems.
2).Transparency: -(OLAP must transparency to the input data for the users).
The OLAP systems technology, the basic database and computing architecture
{client/server, mainframe gateways, etc.) and the heterogeneity of input data sources
should be transparent to users to save their productivity and ability with front-end
environments and tools (eg., MS Windows, MS Excel).
3).Accessibility:-(OLAP tool should only access the data required only to the analysis
Needed).
The OLAP system should access only the data actually required to perform the
analysis. The system should be able to access data from all heterogeneous enterprise data
sources required/for the analysis.
6).Generic dimensionality: Data entered should be equivalent to the structure and operation
requirements.
7).Dynamic sparse matrix handling: The OLAP too should be able to manage the sparse
matrix and so maintain the level of performance.
8).Multi-user support: The OLAP should allow several users working concurrently to work
together on a specific model.
10).Intuitive data manipulation. Consolidation path reorientation pivoting drill down and Toll-
up and other manipulation should be accomplished via direct point-and-click; drag-and-drop
operations on the cells of the cube.
11).Flexible reporting: The ability to arrange rows, columns, and cells in a fashion that
facilitates analysis by spontaneous visual presentation of analytical report must exist
Computer Science Engineering Department 56 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM
12).Unlimited dimensions and aggregation levels: This depends on the kind of business,
where multiple dimensions and defining hierarchies can be made.
14).The ability to drill down to detail source record level: Which requires that the OLAP tool
should allow smooth transitions in the multidimensional database.
15).Incremental database refresh: The OLAP tool should provide partial refresh.
16).Structured Query Language (SQL interface): the OLAP system should be able to
integrate
Multidimensional structure: - " A variation of the relational model that uses multidimensional
structures for organize data and express the relationships between data”.
Multidimensional: MOLAP
MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
MOLAP stores this data in optimized multidimensional array storage, rather than in a
relational database. Therefore it requires the pre-computation and storage of information in
the cube the operation known as processing.
MOLAP analytical operations :-
Consolidation: involves the aggregation of data such as roll-ups or complex expressions
involving interrelated data. For example, branch offices can be rolled up to cities and rolled
up to countries.
Drill-Down: is the reverse of consolidation and involves displaying the detailed data that
comprises the consolidated data.
Slicing and dicing: refers to the ability to look at the data from different viewpoints. Slicing
and dicing is often performed along a time axis in order to analyze trends and find patterns.
ROLAP works directly with relational databases. The base data and the dimension tables
are stored as relational tables and new tables are created to hold the aggregated
information.
It depends on a specialized schema design.
Computer Science Engineering Department 57 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM
This methodology relies on manipulating the data stored in the relational database to give
the appearance of traditional OLAP's slicing and dicing functionality.
Comparison:
MOLAP ROLAP
3 MOLAP is best suited for inexperienced ROLAP is best suited for experienced
users, since it is very easy to use. users.
4 Maintains a separate database for data It may not require space other than
cubes. available in the Data warehouse.
1.MOLAP
The products used a data structure [multidimensional database management systems
(MDDBMS)] to organize, navigate, and analyze data, typically in an accumulated form and
required a tight coupling with the application layer and presentation layer.
Architectures enables excellent performance when the data is utilized as designed and
predictable application response times for applications addressing a narrow breadth of data
for a specific DSS requirement.
Applications requiring iterative and comprehensive time series analysis of trends are well
suited for MOLAP technology (eg., financial analysis and budgeting). Examples include
Arbor software's Ess base, Oracle's Express Server.
The implementation of applications with MOLAP products.
First, there are limitations in the ability of data structures to support multiple subject areas of
data (a common trait of many strategic DSS applications) and the detail data required by
many, analysis applications. This has begun to be addressed in some products, utilizing
basic "reach through" mechanisms that enable the MOLAP tools to access detail data
maintained in an RDBMS.(Fig ).
Fig: - MOLAP architecture
MOLAP products require a different set of skills and tools for the database administrator to
build and maintain the database, thus increasing the cost and complexity of support.
These hybrid solutions have as their primary characteristic the integration of specialized
multidimensional data storage with RDBMS technology, providing users with a facility that
tightly "couples" the multidimensional data structures (MDDSs) with data maintained in an
RDBMS.
This approach can be very useful for organizations with performance — sensitive
multidimensional analysis requirements and that have built, or are in the process of
building, a data warehouse architecture that contains multiple subject areas.
Eg: (Product and sales region) to be stored and maintained in a persistent structure.These
structure can be automatically refreshed at predetermined intervals established by an
administrator.
2.ROLAP
The fastest growing style of OLAP technology, with new vendors (eg., Sagnent technology)
entering the market at an accelerating step. Products directly through a dictionary layer of
metadata, bypassing any requirement for creating a static multidimensional data structures
Fig: - ROLAP architecture
Some products (e.g, Andyne's Pablo) that have a custom in ad hoc query have developed
features to provide "datacube" and "slice" and "dice" analysis capabilities. This is achieved
by first developing a query to select data from the DBMS which then delivers the requested
data to the desktop, where it is placed into a data cube. This data cube can be stored and
maintained locally, to reduce the overhead required to create the structure each time the
query is executed.
Once the data is in the data cube; users can perform multidimensional analysis (i.e., Slice,
dice, and pivot operations) against it. The simplicity of the installation and administration of
such products makes them particularly attractive to organizations looking to provide
seasoned users with more sophisticated analysis capabilities, without the significant cost
and maintenance of more complex products.
This mechanism allows for the flexibility of each user to build a custom datacube, the lack
of data consistency among users, and the relatively small amount of data that can be
efficiently maintained are significant challenges facing tool administrators. Examples
include Cognos Software's Power play, Andyne Software's Pablo, Business Objects,
Mercury Project, Dimensional Insight's cross target and Speedware's Media.
8. In what way the OLAP Tools are use for the Internet? [CO2-H1]
The two important themes in computing are Internet/Web and data warehousing. The
reason for this trend is simple: the advantages in using the web for access are magnified
even further in a data warehouse.
• The internet is a virtually free resource which provides a universal connectivity within and
between companies.
• The web simplifies complex administrative tasks of managing distributed environment.
• The web allows companies to store and manage both data and applications on servers
that can be centrally managed maintained and updated, thus eliminating problems with
software and data concurrency.
The general features of the web-enabled data access.
• The first-generation web sites used a static distribution model, in which clients access
static HTML pages via web browsers. The decision support reports were stored as HTML,
documents and delivered to users on request. This model has some serious deficiencies,
including, inability to provide web clients with interactive analytical capabilities such as drill
down.
• The second-generation web sites support interactive database queries by utilizing a multi
tiered architecture in which a web client submits a query in the form of HTML-encoded
request to a web server, which in turn transforms the request for structured data into 'a CGI
(Common Gateway Interface) script, or a script written to a proprietary web-server API (i.e.,
Netscape server API, or NSAPI). The gateway submits SQL queries to the database,
receives the results translates them into HTML, and sends the pages to the requester.
Requests for the unstructured data (eg., images, other HTML documents etc.,) can be sent
directly to the unstructured data store.
• The emerging third-generation web sites replace HTML gateways with web-based
application. These servers can download Java applets or Active X applications that execute
on clients, or interact with corresponding applets running on servers servlets.
The third-generation web servers provide users with all the capabilities of existing decision-
support applications without requiring them to load any client software except a web
browser. Decision support applications, especially query, reporting and OLAP tools are
rapidly converting their tools to work on the web.
• HTML publishing: This approach involves transforming an output of a query into the
HTML page that can be downloaded into a browser.
• Helper applications: A tool is configured as a helper application that resides within a
browser. This is a case of a fat client, in which, once the data is downloaded, users can
take advantage of all capabilities of the tool to analyze data.
• Plug-ins: A variation on the previous approach, plug-ins are helper applications that are
downloaded from the web server prior to their initial use. Since the plug-ins are downloaded
from the server, their normal administration and installation tasks are significantly reduced.
• Server-centric components: In this approach the vendor rebuilds a desktop tool as a
server component, or creates a new server component that can be integrated with the web
via a web gateway (eg., CGI or NSAPI scripts).
• Java and Active X applications: This approach is for a vendor to redevelop all or portions
of its tool in Java or Active X. This result is a true "thin" client model. It is promising and
flexible.
Several OLAP Tools from a Perspective of Internet/Web Implementations
Arbor Essbase Web: Essbase is one of the most determined of the early web products. It
includes not only OLAP manipulations, such as drill up, down, and across; pivot; slice and
dice; and fixed and dynamic reporting but also data entry, including full multi-user
concurrent write capabilities a feature that differentiates it from the others.
Brio technology: Brio shipped a suite of new products called brio.web.warehouse. This suite
implements several of the approaches listed above for deploying decision support.
OLAP applications on the web: The key to Brio's strategy is a new server component called
brio-query.server.
UNIT III
Part- A
It refers to extracting or “mining” knowledge from large amount of data. Data mining is a
process of discovering interesting knowledge from large amounts of data stored either, in
database, data warehouse, or other information repositories.
• Knowledge mining
• Knowledge extraction
• Data/pattern analysis.
• Data Archaeology
• Data dredging
• Data cleaning
• Data Mining
• Pattern Evaluation
• Knowledge Presentation
• Data Integration• Data Selection
• Data Transformation
Knowledge base is domain knowledge that is used to guide search or evaluate the
Interestingness of resulting pattern. Such knowledge can include concept hierarchies used
to organize attribute /attribute values in to different levels of abstraction.
It is used to predict the values of data by making use of known results from a different set
of sample data.
It is used to determine the patterns and relationships in a sample data. Data mining tasks
that belongs to descriptive model: • Clustering • Summarization • Association rules •
Sequence discovery
• Extended-relational databases
• Object-oriented databases
• Deductive databases
• Spatial databases
• Temporal databases
• Multimedia databases
• Active databases• Scientific databases
• Knowledge databases
Cluster analyses data objects without consulting a known class label. The class labels are
not present in the training data simply because they are not known to begin with.
12.Describe challenges to data mining regarding data mining methodology and user
interaction issues? [CO3-L2]
Pattern represents knowledge if it is easily understood by humans; valid on test data with
some degree of certainty; and potentially useful, novel,or validates a hunch about which the
used was curious. Measures of pattern interestingness, either objective or subjective, can
be used to guide the discovery process.
18. When we can say the association rules are interesting? [CO3-L1]
Association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold. Users or domain experts can set such
thresholds
20. How are association rules mined from large databases? [CO3-L1]
I step: Find all frequent item sets: II step: Generate strong association rules from frequent
item sets
PART –B
Fig:2 - Data mining — searching for knowledge (interesting patterns) in your data
information repositories. Data cleaning and data integration techniques may be performed
on the data.
Database or data warehouse server: The database or data warehouse serve obtains the
relevant data, based on the user's data mining request.
Knowledge base: This is the domain knowledge that used to guide the search or evaluate
the interestingness of resulting patterns. Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different levels of abstraction knowledge
such as user beliefs; threshold and metadata can be used to access a patterns
interestingness.
Data mining engine: This is essential to the data mining system and ideally consists of a
set of functional modules for task such as characterization, association classification,
cluster analysis, evolution and outlier analysis.
Pattern evaluation module: This component uses interestingness measures and interacts
with the data mining modules so as to focus the search towards increasing patterns. It may
use interestingness entrances to / filter out discovered patterns. Alternately, the pattern
evaluation module may also be integrate with mining module.
Graphical user interface: This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a task or data mining
query for performing exploratory data mining based on intermediate data mining results.
This module allows the user to browse database and datawarehouse schemes or data
structure, evaluate mined patterns and visualize the pattern in different forms such as
maps, charts etc.
Data Mining — on What Kind of Data
Data mining should be applicable to any kind of information repository. This includes
Flat files
Relational databases,
Data warehouses,
Transactional databases,
Advanced database systems,
World-Wide Web.
Advanced database systems include
Object-oriented and
Object relational databases, and
Special c application-oriented databases such as
Spatial databases,
Time-series databases,
Text databases,
Multimedia databases.
Flat files: Flat files are simple data files in text or binary format with a structure known by
the data mining algorithm to be applied. The data in these files can be transactions, time-
series data, scientific measurements, etc.
Relational databases: A relational database is a collection of tables. Each table consists of
a set of attributes (columns or fields) and a set of tuples (records or rows). Each tuple is
identified by a unique key and is described by a set of attribute values. Entity relationships
(ER) data model is often constructed for relational databases. Relational data can be
accessed by database queries written in a relational query language.
e.g Product and market table
Data warehouse:
A data warehouse is a repository of information collected from multiple sources, stored
under a unified scheme residing on a single site.
A data warehouse is formed by a multidimensional database structure, where each
dimension corresponds to an attribute or a set of attributes in the schema.
Fig: - Architecture of data warehouse
Data warehouse is formed by data cubes. Each dimension is an attribute and each cell
represents the aggregate measure. A data warehouse collects information about subjects
that cover an entire organization whereas data mart focuses on selected subjects. The
multidimensional data views makes (OLAP) Online Analytical Processing easier.
Transactional databases: A transactional database consists of a file where each record
represents a transaction. A transaction includes transaction identity number, list of items,
date of transactions etc.
Advanced databases:
Object oriented databases: Object oriented databases are based on object-oriented
programming concept. Each entity is considered as an object which encapsulates data and
code into a single unit objects are grouped into a class.
Object-relational database: Object relational database are constructed based on an
object relational data mode which extends the basic relational data model by handling
complex data types, class hierarchies and object inheritance.
Spatial databases: A spatial database stores a large amount of space-related data, such
as maps, preprocessed remote sensing or medical imaging data and VLSI chip layout data.
Spatial data may be represented in raster format, consisting of n-dimensional bit maps or
fixed maps.
Temporal Databases, Sequence Databases, and Time-Series Databases
A temporal database typically stores relational data that include time-related attributes.
A sequence database stores sequences of ordered events, with or without a existing
view of time. E.g customer shopping sequences, Web click streams.
A time-series database stores sequences of values or events obtained over repeated
measurements of time (e.g., hourly, daily, weekly). E.g stock exchange, inventory control,
observation of temperature and wind.
Text databases and multimedia databases: Text databases contains word descriptions
of objects such as long sentences or paragraphs, warning messages, summary reports etc.
Text database consists of large collection of documents from various sources. Data stored
in most text databases are semi structured data.
A multimedia database stores and manages a large collection of multimedia objects such
as audio data, image, video, sequence and hypertext data.
The World Wide Web: The World Wide Web and its associated distributed information
services, such as Yahoo!, Google, America Online, and AltaVista, provides worldwide, on-
line information services. Capturing user access patterns in such distributed information
environments is called Web usage mining or Weblog mining.
Descriptive mining:- tasks characterize the general properties of the data in the database.
Predictive mining: - tasks perform conclusion on the current data in order to make
predictions.
Users have no idea regarding what kinds of patterns is required, so they search for
several different kinds of patterns in parallel.
Data mining systems
- should be able to discover patterns at various granularity
- should help users for interesting patterns.
Data mining functionalities:
Outlier analysis
Evolution analysis
Association analysis.
This association rule involves a single attribute or predicate (i.e., buys) that repeats.
Association rules that contain a single predicate are referred to as single-dimensional
association rules.
Also, the above rule can be written simply as
The rule indicates that of the AllElectronics customers under study, 2% are 20 to 29 years
of age with an income of 20,000 to 29,000 and have purchased a CD player at
AllElectronics. There is a 60% probability that a customer in this age and income group will
purchase a CD player. Note that this is an association between more than one attribute, or
predicate (i.e., age, income, and buys).
Classification -> process of finding a model (or function) that describes and differentiates
data classes or concepts, for the purpose of using the model to predict the class of objects
whose class label is unknown.
The derived model is based on the analysis of a set of training data (i.e., data objects
whose class label is known).
The derived model represented by
(i).classification (IF-THEN) rules, (ii). decision trees,(iii) neural networks
Prediction
Prediction models calculate continuous-valued functions. Prediction is used to predict
missing or unavailable numerical data values. Prediction refers both numeric prediction and
class label prediction. Regression analysis is a statistical methodology is used for numeric
prediction. Prediction also includes the identification of distribution trends based on the
available data.
Clustering Analysis
Clustering analyzes data objects without consulting a known class label.
Clusters can be grouped based on the principle of maximizing the intra-class similarity and
minimizing the interclass similarity.
Clustering is a method of grouping data into different groups, so that data in each group
share similar trends and patterns. The objectives of clustering are
* To uncover natural groupings
* To initiate hypothesis about the data
* To find consistent and valid organization of the data
5 .Outlier Analysis
A database may contain data objects that do not fulfil with the general model of the data.
These data objects are called outliers.
Most data mining methods discard outliers as noise or exceptions.
Applications like credit card fraud detection, cell phone cloning fraud and detection of
suspicious activities the rare events can be more interesting than the more regularly
occurring ones. The analysis of outlier data is referred to as outlier mining.
Outliers may be detected using statistical tests, distance measures, deviation-based
methods
6. Evolution Analysis
Data evolution analysis describes and model regularities (or) trends of objects whose
behaviour changes over time. Normally, evolution analysis is used to predict the future
trends by effective decision making process.
It include characterization, discrimination, association and correlation analysis,
classification, prediction, or clustering of time-related data, time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data analysis.
E.g stock market (time-series) data of the last several years available from the New York
Stock Exchange and like to invest in shares of high-tech industrial companies.
A data mining system has the possible to generate thousands or even millions of
patterns, or rules.
A pattern is interesting if it is (1) easily understood by humans, (2) valid on new or test
data with some degree of certainty, (3) possibly useful, and (4) innovative.
A pattern is also interesting if it validates a assumptions that the user wanted to confirm.
An interesting pattern represents knowledge.
Another objective measure for association rules is confidence, which measures the degree
of certainty of the detected association. This is taken to be the conditional probability
P(Y/X), that is, the probability that a transaction containing X also contains Y. More
formally, support and confidence are defined as
support(X=>Y) = P(X Y):
confidence(X=>)Y) = P(Y/X):
For example, rules that do not satisfy a confidence 50% can be considered uninteresting.
Rules below this reflect noise, exceptions, or minority cases and are probably of less value.
A data mining system generate all of the interesting patterns—refers to the completeness of
a data mining algorithm. It is often unrealistic and inefficient for data mining systems to
generate all of the possible patterns.
A data mining system generate only interesting patterns—is an optimization problem in data
mining. It is highly desirable for data mining systems to generate only interesting patterns.
This is efficient for users and data mining systems, because have search through the
patterns generated in order to identify the truly interesting ones.
Depending on the data mining approach used, techniques from other disciplines may be
applied, such as
o neural networks,
o fuzzy and/or rough set theory,
o knowledge representation,
o inductive logic programming,
o high-performance computing.
Depending on the kinds of data to be mined or on the given data mining application, the
data mining system may also integrate techniques from
o spatial data analysis,
o information retrieval,
o pattern recognition,
o image analysis,
o signal processing,
o computer graphics,
o Web technology,
o economics,
o business,
o bioinformatics,
o psychology
o A complete data mining system usually provides multiple and/or integrated data mining
functionalities.
o Moreover, data mining systems can be famous based on the granularity or levels of
abstraction of the knowledge mined, including generalized knowledge ,primitive-level
knowledge, or knowledge at multiple levels .
o An advanced data mining system should facilitate the discovery of knowledge at multiple
levels of abstraction.
o Data mining systems can also be categorized as those that mine data regularities
(commonly occurring patterns) versus those that mine data irregularities (such as
exceptions, or outliers).
In general, concept description, association and correlation analysis, classification,
prediction, and clustering mine data regularities, rejecting outliers as noise. These
methods may also help detect outliers.
A data mining task can be specified in the form of a data mining query, which is input to
the data mining system.
A data mining query is defined in terms of data mining task primitives. These primitives
allow the user to interactively communicate with the data mining system during
discovery in order to direct the mining process, or examine the findings from different
angles or depths.
The set of task-relevant data to be mined: This specifies the portions of the database or the
set of data in which the user is interested. This includes the database attributes or data
warehouse dimensions of interest
The kind of knowledge to be mined: This specifies the data mining functions to be
performed,
such as characterization, discrimination, association or correlation analysis, classification,
prediction, clustering, outlier analysis, or evolution analysis.
The background knowledge to be used in the discovery process: This knowledge about
the domain to be mined is useful for guiding the knowledge discovery process and
for evaluating the patterns found.
Concept hierarchies (shown in Fig 2) are a popular form of background knowledge, which
allow data to be mined at multiple levels of abstraction.
The interestingness measures and thresholds for pattern evaluation: They may be used to
guide the mining process or, after discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different interestingness measures.
The expected representation for visualizing the discovered patterns: This refers to the
form in which discovered patterns are to be displayed ,which may include rules, tables,
charts, graphs, decision trees, and cubes.
This facilitates a data mining system’s communication with other information systems
and its integration with the overall information processing environment.
No coupling: means that a DM system will not utilize any function of a DB or DW system.
It may fetch data from a file system, process data using some data mining algorithms, and
then store the mining results in another file.
Drawbacks.
First, a DB system provides flexibility and efficiency at storing, organizing, accessing, and
processing data. Without using a DB/DWsystem, a DM system spend more time for finding,
collecting, cleaning, and transforming data. In DB/DW systems, data’s are well organized,
indexed, cleaned, integrated, or consolidated, so that finding the task-relevant, high-quality
data becomes an easy task.
Second, there are many tested, scalable algorithms and data structures implemented in DB
and DW systems. Without any coupling of such systems, a DM system will need to use
other tools to extract data, making it difficult to integrate such a system into an information
processing environment. Thus, no coupling represents a poor design.
Loose coupling: means that a DM system will use some facilities of a DB or DW system,
fetching data from a data repository managed by these systems, performing data mining,
and then storing the mining results either in a file or in a designated place in a database or
data warehouse.
Advantages : Loose coupling is better than nocoupling because it can fetch any portion of
data stored in DB’s or DW’s by using query processing, indexing, and other system
facilities.
Drawbacks : However, many loosely coupled mining systems are main memory-based.
Because mining does not explore data structures and query optimization methods provided
by DB or DW systems, it is difficult for loose coupling to achieve high scalability and good
performance with large data sets.
Semitight coupling: means that too linking a DM system to a DB/DW system, efficient
implementations of a few essential data mining primitives can be provided in the DB/DW
system.
These primitives can include sorting, indexing, aggregation, histogram analysis, multiway
join, and precomputation of some essential statistical measures, such as sum, count, max,
min, standard deviation, and so on.
Moreover, some frequently used intermediate mining results can be precomputed and
stored in the DB/DW system.
Tight coupling: means that a DM system is smoothly integrated into the DB/DW system.
This approach is highly desirable because it facilitates efficient implementations of data
mining functions, high system performance, and an integrated information processing
environment.
Data mining queries and functions are optimized based on mining query analysis, data
structures, indexing schemes, and query processing methods of a DB or DW system.
By technology advances, DM, DB, and DW systems will integrate together as one
information system with multiple functionalities. This will provide a uniform information
processing environment.
Mining methodology and user interaction issues: These reflect the kinds of knowledge
mined, the ability to mine knowledge at multiple granularities, the use of domain knowledge,
ad hoc mining, and knowledge visualization.
Mining different kinds of knowledge databases: Data mining should cover a wide data
analysis and knowledge discovery tasks, including data characterization, discrimination,
association, classification, prediction, clustering, outlier analysis.
Interactive mining of knowledge at multiple levels of abstraction: The data mining process
should be interactive. Interactive mining allows users to focus the search for patterns,
providing and refining data mining requests based on returned results.
Computer Science Engineering Department 85 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM
Performance issues:
Efficiency and scalability of data mining algorithms: To effectively extract information from a
huge amount of data in databases, data mining algorithms must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms: The huge size of many databases,
the wide distribution of data, and the computational complexity of some data mining
methods are factors motivating the development of algorithms that divide data into
partitions that can be processed in parallel.
7. What is Data Preprocessing and Why preprocess the data?. Also explain Data
cleaning ,Data integration and transformation, Data reduction, Discretization and
concept hierarchy generation. [CO3-H3]
I. Data Preprocessing :-
Data cleaning
o Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
Data integration
o Integration of multiple databases, data cubes, or files
Data transformation
o Normalization and aggregation
Data reduction
o Obtains reduced representation in volume but produces the same or similar
analytical results
Data discretization
o Part of data reduction but with particular importance, especially for numerical
data
Importance
o “Data cleaning is one of the three biggest problems in data warehousing”—Ralph
Kimball
o “Data cleaning is the number one problem in data warehousing”—DCI survey
Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute varies
considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
o a global constant : e.g., “unknown”, a new class?!
o the attribute mean
o the attribute mean for all samples belonging to the same class: smarter
o the most probable value: inference-based such as Bayesian formula or decision
tree
Binning
o first sort data and partition into (equal-frequency) bins
o then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
Regression
o smooth by fitting the data into regression functions
Clustering
o detect and remove outliers
Combined computer and human inspection
o detect suspicious values and check by human (e.g., deal with possible outliers)
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Cluster Analysis
Data integration:
o Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id , B.cust no.
o Integrate metadata from different sources
Computer Science Engineering Department 90 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM
rA, B
( A A)(B B) ( AB) n AB
1)AB of tuples,
where n is the(nnumber (n 1)Aand
B B are the respective means of A and B, σA
A
and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB
cross-product.
If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the
stronger correlation.
rA,B = 0: independent; rA,B < 0: negatively correlated.
Χ2 (chi-square) test
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose actual count is
very different from the expected count
Correlation does not imply causality
o No., of hospitals and no., of car-theft in a city are correlated
o Both are causally linked to the third variable: population
Data Transformation
o Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
IV.Data reduction
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components)
that can be best used to represent data
Steps
o Normalize input data: Each attribute falls within the same range
o Compute k orthonormal (unit) vectors, i.e., principal components
o Each input data (vector) is a linear combination of the k principal component vectors
o The principal components are sorted in order of decreasing “significance” or strength
o Since the components are sorted, the size of the data can be reduced by eliminating the weak
components, i.e., those with low variance. (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data
Works for numeric data only
Used when the number of dimensions is large
3. Data Compression
String compression
o There are extensive theories and well-tuned algorithms
o Typically lossless
o But only limited manipulation is possible without expansion
Audio/video compression
o Typically lossy compression, with progressive refinement
o Sometimes small fragments of signal can be reconstructed without reconstructing the
whole
Time sequence is not audio
o Typically short and vary slowly with time
4. Numerosity Reduction
Reduce data volume by choosing alternative, smaller forms of data representation
Parametric methods
o Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
o Example: Log-linear models—obtain value at a point in m-D space as the product on
appropriate marginal subspaces
Non-parametric methods
o Do not assume models
o Major families: histograms, clustering, sampling
Parametric methods
Non-parametric methods
Histograms,
Divide data into buckets and store average (sum) for each bucket
Partitioning rules:
o Equal-width: equal bucket range
o Equal-frequency (or equal-depth)
o V-optimal: with the least histogram variance (weighted sum of the original values that
each bucket represents)
o MaxDiff: set bucket boundary between each pair for pairs have the β–1 largest
differences
Clustering
Partition data set into clusters based on similarity, and store cluster representation (e.g.,
centroid and diameter) only
Can be very effective if data is clustered but not if data is “dirty”
Can have hierarchical clustering and be stored in multi-dimensional index tree structures
There are many choices of clustering definitions and clustering algorithms.
Sampling
o Recursively reduce the data by collecting and replacing low level concepts (such as
numeric values for age) by higher level concepts (such as young, middle-aged, or
senior)
A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural”
intervals.
o If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the
range into 3 equi-width intervals
o If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4
intervals
o If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into
5 intervals
UNIT-III
University questions
PART A
1. Define data.
2. State why the data preprocessing an important issue for data warehousing and data mining.
3. What is the need for discretization in data mining?.
4. What are the various forms of data preprocessing?
5. Define Data Mining.
6. List out any four data mining tools.
7. What do data mining functionalities include?
8. Define patterns.
PART-B
1. (i) Explain the various primitives for specifying Data mining Task. [6]
(ii) Describe the various descriptive statistical measures for data mining.[10]
5. How data mining system are classified? Discuss each classification with an
example. [16]
6. How data mining system can be integrated with a data warehouse? Discuss
with an example. [16]
UNIT 4
PART – A
Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean association
rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of
frequent item set properties.
If a set cannot pass a test, all of its supersets will fail the same test as well
Association rules can be generated as follows For each frequent item set1, generate all non empty
subsets of 1. For every non empty subsets s of 1, output the rule “S=>(1-s)”if Support
count(1)=min_conf, Support_count(s) where min_conf is the minimum confidence threshold.
5. What are the things suffering the performance of Apriori candidate generation
technique? [CO4-L2]
6. Describe the method of generating frequent item sets without candidate generation?
[CO4-L2]
• Uniform minimum support for all levels(or uniform support) • Using reduced minimum support at
lower levels(or reduced support) • Level-by-level independent • Level-cross filtering by single item
• Level-cross filtering by k-item set
Mining is performed under the guidance of various kinds of constraints provided by the user. The
constraints include the following • Knowledge type constraints • Data constraints • Dimension/level
constraints • Interestingness constraints • Rule constraints.
Two step process • A model is built describing a predefined set of data classes or concepts. • The
model is constructed by analyzing database tuples described by attributes.The model is used for
classification.
A decision tree is a flow chart like tree structures, where each internal node denotes a test on an
attribute, each branch represents an outcome of the test,and leaf nodes represent classes or class
distributions. The top most in a tree is the root node.
The information Gain measure is used to select the test attribute at each node in the decision
tree. Such a measure is referred to as an attribute selection measure or a measure of the
goodness of split.
When a decision tree is built, many of the branches will reflect anomalies in the training data due
to noise or outlier. Tree pruning methods address this problem of over fitting the data.
Approaches:
• Pre pruning
• Post pruning
Prediction can be viewed as the construction and use of a model to assess the class of an
unlabeled sample or to assess the value or value ranges of an attribute that a given sample is
likely to have.
Regression can be used to solve the classification problems but it can also be used for
applications such as forecasting. Regression can be performed using many different types of
techniques; in actually regression takes a set of data and fits the data to a formula.
20.What are the different types of data used for cluster analysis? [CO4-L1]
The different types of data used for cluster analysis are interval scaled, binary, nominal, ordinal
and ratio scaled data.
PART -B
1. Describe the Mining Frequent Patterns and Associations & Correlations. [CO4-H2]
Basic Concepts
Market Basket Analysis:
Frequent Itemsets, Closed Itemsets, and Association Rules
Frequent Pattern Mining: A Road Map
1. Basic Concepts
Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear
in
a data set frequently.
For example, a set of items, such as milk and bread, that appear frequently together in a
transaction data set is a frequent itemset.
A subsequence, such as buying first a PC, then a digital camera, and then a memory card,
if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.
Computer Science Engineering Department 103 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM
Note that the itemset support defined in Equation is sometimes referred to as relative support,
whereas the occurrence frequency is called the absolute support.
Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence
threshold (min conf) are called Strong Association Rules.
1.Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently
as a predetermined minimum support count, min_sup.
2. Generate strong association rules from the frequent itemsets: By definition, these rules must
satisfy minimum support and minimum confidence.
Closed Itemsets : An itemset X is closed in a data set S if there exists no proper super-itemset Y
such that Y has the same support count as X in S. An itemset X is a closed frequent itemset in set
S if X is both closed and frequent in S.
Maximal frequent itemset: An itemset X is a maximal frequent itemset (or max-itemset) in set S if X
is frequent, and there exists no super-itemset Y such that X belongsY and Y is frequent in S.
Frequent pattern mining can be classified in various ways, based on the following criteria:
1. Based on the completeness of patterns to be mined: The following can be mined based on the
Completeness of patterns.
Frequent itemsets, Closed frequent itemsets, Maximal frequent itemsets,
Constrained frequent itemsets (i.e., those that satisfy a set of user-defined constraints),
Approximate frequent itemsets (i.e., those that derive only approximate support counts for
the mined frequent itemsets),
Near-match frequent itemsets (i.e., those that tally the support count of the near or almost
matching itemsets),
Top-k frequent itemsets (i.e., the k most frequent itemsets for a user-specified value, k),
The above Rules are single-dimensional association rules they refer only one dimension, buys.
The following rule is an example of a multidimensional rule:
Apriori is an algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent
itemsets for Boolean association rules. The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent itemset properties.
Apriori uses an iterative approach known as a level-wise search, where k-itemsets are used
to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The resulting set is
denoted L1.
Computer Science Engineering Department 106 DataWarehousing and DataMining
S.K.P. Engineering College, Tiruvannamalai VI SEM
Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so
on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of
the database.
To improve the efficiency of the level-wise generation of frequent itemsets, an important
property called the Apriori property, presented below, is used to reduce the search space.
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
E.g
1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets,
C1. The algorithm simply scans all of the transactions for count the number of occurrences of each
item.
2.Minimum support count is 2, i.e min sup = 2. The set of frequent 1-itemsets, L1, can then be
determined. It consists of the candidate 1-itemsets satisfying minimum support. In our example, all
of the candidates in C1 satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join to generate
a candidate set of 2-itemsets, C2. C2 consists of 2-itemsets. Note that no candidates are
removed fromC2 during the prune step because each subset of the candidates is also frequent.
4. Next, the transactions in D are scanned and the support count of each candidate itemset in C2 is
accumulated, as shown below
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets
in C2 having minimum support.
7. The generation of the set of candidate 3-itemsets,C3, is as follows. From the join step,
get C3 = = {I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5g, } {I2, I4, I5}. Based on
the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine
that the four latter candidates cannot possibly be frequent. The resulting pruned version of C3 is
shown below.
7. The transactions in D are scanned in order to determine L3, consisting of those candidate
3-itemsets in C3 having minimum support (Figure 5.2).
Pseudo-code:
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk ;
Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-
itemset
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
The Apriori Algorithm: Example
• In the first iteration of the algorithm, each item is a member of the set of
candidate.
• The set of frequent 1-itemsets, L1 , consists of the candidate 1-itemsets
satisfying minimum support.
The generation of the set of candidate 3-itemsets, C3 , involves use of the Apriori
Property.
In order to find C3, we compute L2 Join L2.
C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
Now, Join step is complete and Prune step will be used to reduce the size of C 3.
Prune step helps to avoid heavy computation due to large Ck.
Based on the Apriori property that all subsets of a frequent itemset must also be
frequent, we can determine that four latter candidates cannot possibly be frequent.
How ?
For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2,
I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3}
in C3.
Lets take another example of {I2, I3, I5} which shows how the pruning is performed.
The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori
Property. Thus We will have to remove {I2, I3, I5} from C3.
Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of
Join operation for Pruning.
Now, the transactions in D are scanned in order to determine L3, consisting of those
candidates 3-itemsets in C3 having minimum support.
Thus, C4 = φ , and algorithm terminates, having found all of the frequent items.
This completes our Apriori Algorithm.
Once the frequent itemsets from transactions in a database D have been found, then
generate strong association rules from them (where strong association rules satisfy both
minimum support and minimum confidence). This can be done using Equation as
Back to e.g
L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3},
{I1,I2,I5}}.
Lets take l = {I1,I2,I5}.
Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
Let minimum confidence threshold is , say 70%.
The resulting association rules are shown below, each listed with its confidence.
o Rule1: I1 ^ I2 I5
Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
R1 is Rejected.
o Rule2 : I1 ^ I5 I2
Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
R2 is Selected.
o Rule3: I2 ^ I5 I1
Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
R3 is Selected
o Rule4: I1 I2 ^ I5
Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
R4 is Rejected.
o Rule5: I2 I1 ^ I5
• Apriori Advantages:
– Uses large itemset property.
– Easily parallelized
– Easy to implement.
• Apriori Disadvantages:
– Assumes transaction database is memory resident.
– Requires up to m database scans
–
4.Mining Frequent Itemsets without Candidate Generation
Steps:
1. Start from each frequent length-1 pattern (as an initial suffix pattern).
2. Construct its conditional pattern base which consists of the set of prefix paths in
the FP-Tree co-occurring with suffix pattern.
3. Then, Construct its conditional FP-Tree & perform mining on such a tree.
4. The pattern growth is achieved by concatenation of the suffix pattern with the
frequent patterns generated from a conditional FP-Tree.
5. The union of all frequent patterns (generated by step 4) gives the required
frequent itemset.
Lets start from I5. The I5 is involved in 2 branches namely {I2 I1 I5: 1} and {I2 I1
I3 I5: 1}.
Therefore considering I5 as suffix, its 2 corresponding prefix paths would be {I2
I1: 1} and {I2 I1 I3: 1}, which forms its conditional pattern base.
Out of these, Only I1 & I2 is selected in the conditional FP-Tree because I3 is not
satisfying the minimum support count.
o For I1 , support count in conditional pattern base = 1 + 1 = 2
o For I2 , support count in conditional pattern base = 1 + 1 = 2
o For I3, support count in conditional pattern base = 1
o Thus support count for I3 is less than required min_sup which is 2 here.
Now , We have conditional FP-Tree with us.
All frequent pattern corresponding to suffix I5 are generated by considering all
possible combinations of I5 and conditional FP-Tree.
The same procedure is applied to suffixes I4, I3 and I1.
Note: I2 is not taken into consideration for suffix because it doesn’t have any
prefix at all.
Advantages of FP growth
Disadvantages of FP-Growth
– FP-Tree may not fit in memory!!
– FP-Tree is expensive to build
Pseudo Code
procedure FP growth(Tree, a)
Both the Apriori and FP-growth methods mine frequent patterns from a set of
transactions
in TID-itemset format (that is,{ TID : itemset}), where TID is a transaction-id and itemset
is the set of items bought in transaction TID. This data format is known as horizontal
data format.
Alternatively, data can also be presented in item-TID-set format (that is,{item : TID-
set}), where item is an item name, and TID set is the set of transaction identifiers
containing the item. This format is known as vertical data format.
********************************************************************************
Association rules generated from mining data at multiple levels of abstraction are
called multiple-level or multilevel association rules.
Multilevel association rules can be mined efficiently using concept hierarchies
under a support-confidence framework. A top-down strategy is used, where counts are
collected for the calculation of frequent itemsets at each concept level, starting at the
concept level 1 and working downward in the hierarchy towards the specific concept
levels, until no more frequent itemsets can be found. For each level, any algorithm for
discovering frequent itemsets may be used, such as Apriori or its variations.
Using uniform minimum support for all levels (referred to as uniform support):
The same minimum support entry is used when mining at each level of abstraction.
For example, in following Figure , a minimum support enrty of 5% is used throughout
(e.g., for mining from “computer” down to “laptop computer”). Both “computer” and
“laptop computer” are found to be frequent, while “desktop computer” is not.
The method is also simple in that users are required to specify only one
minimum support entry. An Apriori-optimization technique can be used, based on the
concept of an ancestor is a superset of its children’s: The search avoids examining
itemsets containing any item whose ancestors do not have minimum support.
When mining multilevel rules users approaching which groups are more important
than others, also it is important to set up user-specific, item, or group based minimal
support entries.
For example, a user could set up the minimum support entries based on product
price, on items of interest, such as low support entries for laptop computers and flash
drives which association patterns containing items in these categories.
Techniques for mining multidimensional association rules can be categorized into two
basic approaches regarding the treatment of quantitative attributes.
Following Figure shows the lattice of cuboids defining a data cube for the dimensions
age, income, and buys. The cells of an n-dimensional cuboid can be used to store the
support counts of the corresponding n-predicate sets.
The base cuboid aggregates the task-relevant data by age, income, and buys;
the 2-D cuboid, (age, income), aggregates by age and income, and so on; the 0-D
(apex) cuboid contains the total number of transactions in the task-relevant data.
where Aquan1 and Aquan2 are tests on quantitative attribute intervals, and Acat tests a
categorical attribute from the task-relevant data. Such rules have been referred to as
two-dimensional quantitative association rules, because they contain two quantitative
dimensions.
An example of such a 2-D quantitative association rule is
This approach maps pairs of quantitative attributes onto a 2-D grid for tuples satisfying a
given categorical attribute condition. The grid is then searched for clusters of points
from which the association rules are generated. The following steps are involved in
ARCS:
Binning: Quantitative attributes can have a very wide range of values defining their
domain. A big 2-D grid can be if it is plotted age and income as axes, where each
possible value of age was assigned a unique position on one axis, and similarly, each
possible value of income was assigned a unique position on the other axis.
To keep grids down to a manageable size, we instead partition the ranges of
quantitative attributes into intervals.The partitioning process is referred to as binning,
that is, where the intervals are considered “bins.” Three common binning strategies area
as follows:
Equal-width binning, where the interval size of each bin is the same
Equal-frequency binning, where each bin has approximately the same number of
tuples assigned to it.
Clustering-based binning, where clustering is performed on the quantitative attribute
to group neighboring points (judged based on various distance measures) into the
same bin
Finding frequent predicate sets: Once the 2-D array containing the count distribution
for each category is set up, it can be scanned to find the frequent predicate sets (those
satisfying minimum support) that also satisfy minimum confidence. Strong association
rules can then be generated from these predicate sets, using a rule generation
algorithm.
Clustering the association rules: The strong association rules obtained in the
previous
step are then mapped to a 2-D grid. Following figure shows a 2-D grid for 2-D
quantitative
association rules predicting the condition buys (X, “HDTV”) on the rule right-hand side,
given the quantitative attributes age and income.
The four rules can be combined or “clustered” together to form the following simpler
rule, which
If minimum support 30% and a minimum confidence of 60% was given then the
following association rule is discovered:
Above Rule is a strong association rule since its support value of 4,000/10,000 =40%
and confidence value of 4,000/6,000 =66% satisfy the minimum support and minimum
confidence thresholds, respectively.
That is, a correlation rule is measured not only by its support and confidence but also by
the correlation between itemsets A and B.
Lift is a simple correlation measure that is given as follows. The occurrence of itemset
A is independent of the occurrence of itemset B if
otherwise, itemsets A and B are dependent and correlated as
events.
This definition can easily be extended to more than two itemsets. The lift between the
occurrence of A and B can be measured by computing
If the resulting value of above Equation is less than 1, then the occurrence of A
is negatively
correlated with the occurrence of B.
If the resulting value is greater than 1, then A and B are positively correlated,
meaning that the occurrence of one implies the occurrence of the other.
If the resulting value is equal to 1, then A and B are independent and there is no
correlation between them.
Above last Equation is equivalent to , which
is also
referred as the lift of the association (or correlation) rule A=>B.
A data mining process may uncover so many rules which uninteresting to the users. A
good practical is to have the users constraints to limit the search space. This strategy is
known as constraint-based mining.
Knowledge type constraints: These specify the type of knowledge to be mined, such as
association or correlation.
Data constraints: These specify the set of task-relevant data.
Dimension/level constraints: These specify the desired dimensions (or attributes) of the
data, or levels of the concept hierarchies, to be used in mining.
Interestingness constraints: These specify thresholds on statistical measures of rule
interestingness, such as support, confidence, and correlation.
Rule constraints: These specify the form of rules to be mined. Such constraints may be
expressed as metarules
Metarule-guided mining.
where P1 and P2 are variables that are instantiated to attributes from the given database
during the mining process, X is a variable representing a customer, and Y and W take
on values of the attributes assigned to P1 and P2, respectively.
The data mining system can then search for rules that match the given metarule. For
instance, Rule (2) matches or complies with Metarule (1).
Rule constraints specify expected set/subset relationships of the variables in the mined
rules, constant initiation of variables, and aggregate functions.
1.Basic Concepts
Databases are rich with hidden information that can be used for intelligent decision
making.
Classification and prediction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends.
A bank loans officer needs analysis of her data in order to learn which loan applicants
are
“safe”and which are “risky” for the bank.
A marketing manager at AllElectronics needs data analysis to help guess whether a
customer with a given profile will buy a new computer.
A medical researcher wants to analyze breast cancer data in order to predict which one
of
three specific treatments a patient should receive.
In each of these examples, the data analysis task is classification, where a model or
classifier is constructed to predict categorical labels, such as “safe” or “risky” for the
loan application data; “yes” or “no” for the marketing data; or “treatment A,” “treatment
B,” or “treatment C” for the medical data.
Suppose the marketing manager like to predict how much a given customer will spend
during a sale at AllElectronics. This data analysis task is an example of numeric
prediction, where the model constructed predicts a continuous-valued function, or
ordered value, as opposed to a categorical label. This model is a predictor.
Classification and numeric prediction are the two major types of prediction problems.
The term of prediction to refer to numeric prediction.
Classification work. Data classification is a two-step process, as shown for the loan
application data of Figure 1.
Decision tree induction is the learning of decision trees from class-labelled training
tuples.
A decision tree is a flowchart-like tree structure, where each internal node (non leaf
node)
denotes a test on an attribute, each branch represents an outcome of the test, and each
leaf
node (or terminal node) holds a class label. The topmost node in a tree is the root
node.Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals.
Decision trees are used for classification- Given a tuple, X, for which the associated
class label is unknown, the attribute values of the tuple are tested against the decision
tree. A path is traced from the root to a leaf node, which holds the class prediction for
that tuple. Decision trees can easily be converted to classification rules.
Decision tree induction algorithms have been used for classification in many application
areas, such as medicine, manufacturing and production, financial analysis, astronomy,
and molecular biology.
Algorithm: Generate decision tree. Generate a decision tree from the training tuples of
data
partition D.
Input:
Data partition, D, which is a set of training tuples and their associated class
labels;
attribute list, the set of candidate attributes;
Method:
(1) create a node N;
(2) if tuples in D are all of the same class, C then
(3) return N as a leaf node labeled with the class C;
(4) if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6) apply Attribute selection method(D, attribute list) to find the “best” splitting criterion;
(7) label node N with splitting criterion;
(8) if splitting attribute is discrete-valued and multiway splits allowed then // not
restricted to binary trees
(9) attribute list attribute list � splitting attribute; // remove splitting attribute
(10) for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
(11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
(12) if Dj is empty then
(13) attach a leaf labeled with the majority class in D to node N;
(14) else attach the node returned by Generate decision tree(Dj, attribute list) to node N;
endfor
(15) return N;
Algorithm
o An attribute selection measure is a trial for selecting by splitting the criterion to “best”
from a given data partition, D, of class-labelled training tuples into individual classes.
o If we were to split D into smaller partitions according to the outcomes of the splitting
criterion, ideally each partition would be pure.
o Attribute selection measures are also known as splitting rules because they
determine how the tuples at a given node are to be split.
age pi ni l(pi,ni)
Youth 2 3 0.971
Middle-aged 4 0 0
Senior 3 2 0.971
Gain(income) 0.029
Gain( student ) 0.151
Gain(credit _ rating ) 0.048
C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to
information gain). To find gain ratio
The attribute with the maximum gain ratio is selected as the splitting attribute
3. Gini index
The Gini index is used in CART. Using the notation described above, the Gini index
measures the impurity of D, a data partition or set of training tuples, as
E.g
but Gini{medium,high} is 0.30 and thus the best since it is the lowest.
Tree Pruning
Overfitting: An induced tree may overfit the training data
o Too many branches, some may reflect differences due to noise or outliers
o Poor accuracy for unseen samples
Two approaches to avoid overfitting
o Prepruning: Halt tree construction early—do not split a node if this would
result in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold
o Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
Use a set of data different from the training data to decide which is the
“best pruned tree”
SPRINT
o Constructs an attribute list data structure
PUBLIC
o Integrates tree splitting and tree pruning: stop growing the tree earlier
RainForest
o Builds an AVC-list (attribute, value, class label)
BOAT Bootstrapped Optimistic Algorithm for Tree Construction
o Uses bootstrapping to create several small samples
Let D be a training set of tuples and their associated class labels, and each tuple
is represented by an n-D attribute vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) i i
i P(X)
Since P(X) is constant for all classes, only needs to be maximized only
2
and P(xk|Ci) is
P(X | C i) g ( xk , Ci , Ci )
Naïve Bayesian Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
o Advantages
Easy to implement
Good results obtained in most of the cases
o Disadvantages
Assumption: class conditional independence, therefore loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
o To deal with these dependencies
Bayesian Belief Networks are used.
o Several scenarios:
Given both the network structure and all variables observable: learn only
the CPTs
Network structure known, some hidden variables: gradient descent
(greedy hill-climbing) method, analogous to neural network learning
Network structure unknown, all variables observable: search through the
model space to reconstruct network topology
Unknown structure, all hidden variables: No good algorithms known for
this purpose
A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a
class labeled data set D, let ncovers be the number of tuples covered by R; ncorrect be the
number of tuples correctly classified by R; and |D| be the number of tuples in D. We can
define the coverage and accuracy of R as
e.g Consider rule R1 above, which covers 2 of the 14 tuples. It can correctly classify
both tuples. Therefore, coverage(R1) = 2/14 = 14:28% and accuracy (R1) = 2/2 =
100%.( See table)
To extract rules from a decision tree, one rule is created for each path from the root to a
leaf node. Each splitting criterion along a given path is logically ANDed to form the rule
antecedent (“IF” part). The leaf node holds the class prediction, forming the rule
consequent (“THEN” part).
E.g Extracting classification rules from a decision tree. The above decision tree can be
converted to classification IF-THEN rules by tracing the path from the root node to each
leaf node in the tree. The rules extracted from Figure are
IF-THEN rules can be extracted directly from the training data (i.e., without having to
generate a decision tree first) using a sequential covering algorithm. In this the rules are
learned sequentially (one at a time), where each rule for a given class will ideally cover
many of the tuples of that class (and none of the tuples of other classes).
Input:
D, a data set class-labeled tuples;
Att vals, the set of all attributes and their possible values.
Method:
(1) Rule_set = { }; // initial set of rules learned is empty
(2) for each class c do
(3) repeat
A Neuron (= a perceptron)
For Examp le
n
y sign( wi xi k )
i 0
The n-dimensional input vector x is mapped into variable y by means of the scalar
product and a nonlinear function mapping.
The inputs to the network correspond to the attributes measured for each
training tuple
Inputs are fed simultaneously into the units making up the input layer
They are then weighted and fed simultaneously to a hidden layer
The number of hidden layers is arbitrary, although usually only one
The weighted outputs of the last hidden layer are input to units making up the
output layer, which emits the network's prediction
The network is feed-forward in that none of the weights cycles back to an input
unit or to an output unit of a previous layer
From a statistical point of view, networks perform nonlinear regression: Given
enough hidden units and enough training samples, they can closely approximate
any function
First decide the network topology: number of units in the input layer, number of
hidden layers (if > 1), number of units in each hidden layer, and number of units
in the output layer
Normalizing the input values for each attribute measured in the training tuples to
[0.0—1.0]
One input unit per domain value, each initialized to 0
Output, if for classification and more than two classes, one output unit per class
is used
Once a network has been trained and its accuracy is unacceptable, repeat the
training process with a different network topology or a different set of initial
weights.
Backpropagation
Iteratively process a set of training tuples & compare the network's prediction
with the actual known target value
For each training tuple, the weights are modified to minimize the mean squared
error between the network's prediction and the actual target value
Modifications are made in the “backwards” direction: from the output layer,
through each hidden layer down to the first hidden layer, hence
“backpropagation”
Steps
o Initialize weights (to small random #s) and biases in the network
o Propagate the inputs forward (by applying activation function)
o Backpropagate the error (by updating weights and biases)
o Terminating condition (when error is very small, etc.)
Efficiency of backpropagation: Each time (one interation through the training set)
takes O(|D|x w), with |D| tuples and w weights, but number of times can be
exponential to n, the number of inputs, in the worst case.
Let data D be (x1, y1), (x2, y2) …, (x|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi.
There are infinite lines (hyperplanes) separating the two classes but find the best
one.
SVM searches for the hyperplane with the largest margin, i.e., maximum marginal
hyperplane (MMH)
A separating hyperplane can be written as
W . X+b = 0;
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
For 2-D it can be written as
w0 + w1x1 +w2x2 = 0:
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the
margin) are support vectors
This becomes a constrained (convex) quadratic optimization problem: Quadratic
objective function and linear constraints Quadratic Programming (QP)
Lagrangian multipliers
That is, any tuple that falls on or above H1 belongs to class +1, and any tuple that
falls
on or below H2 belongs to class -1. Combining the two inequalities of above two
Equations
we get
yi (w0 + w1x1 + w2x2 ) ≥ 1, .
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the “sides” defining the
margin) satisfy above Equation and are called support vectors.
Association rules show strong associations between attribute-value pairs (or items) that
occur frequently in a given data set. Such analysis is useful in many decision-making
processes, such as product placement, catalog design, and cross-marketing.
Association rules are mined in a two-step process frequent itemset mining, and rule
generation.
The first step searches for patterns of attribute-value pairs that occur repeatedly
in a data set, where each attribute-value pair is considered an item. The resulting
attribute value pairs form frequent itemsets.
The second step analyses the frequent itemsets in order to generate association
rules.
Advantages
CBA uses an approach to frequent itemset mining, where multiple passes are made
over the data and the derived frequent itemsets are used to generate longest
itemsets. In general, the number of passes made is equal to the length of the longest
rule found. The complete set of rules satisfying minimum confidence and minimum
support thresholds are found and insert in the classifier.
CBA uses a method to construct the classifier, where the rules are organized
according to decreasing preference based on their confidence and support. In this
way, the set of rules making up the classifier form a decision list.
It uses several rule pruning strategies with the help of a tree structure for efficient
storage and retrieval of rules.
CMAR adopts a variant of the FP-growth algorithm to find the complete set of rules
satisfying the minimum confidence and minimum support thresholds. FP-growth
uses a tree structure, called an FP-tree, to register all of the frequent itemset
information contained in the given data set, D. This requires only two scans of D.
The frequent itemsets are then mined from the FP-tree.
CMAR uses an enhanced FP-tree that maintains the distribution of class labels
among tuples satisfying each frequent itemset. In this way, it is able to combine rule
generation together with frequent itemset mining in a single step.
CMAR employs another tree structure to store and retrieve rules efficiently and to
prune rules based on confidence, correlation, and database coverage. Rule pruning
strategies are triggered whenever a rule is inserted into the tree.
CMAR also prunes rules for which the rule antecedent and class are not positively
correlated, based on a c2 test of statistical significance.
CPAR uses an algorithm for classification known as FOIL (First Order Inductive
Learner). FOIL builds rules to differentiate positive tuples ( having class buys
computer = yes) from negative tuples (such as buys computer = no).
For multiclass problems, FOIL is applied to each class. That is, for a class, C, all
tuples of class C are considered positive tuples, while the rest are considered
negative tuples. Rules are generated to differentiate C tuples from all others. Each
time a rule is generated, the positive samples it satisfies (or covers) are removed
until all the positive tuples in the data set are covered.
CPAR relaxes this step by allowing the covered tuples to remain under
consideration, but reducing their weight. The process is repeated for each class. The
resulting rules are merged to form the classifier rule set.
Eager learners
Decision tree induction, Bayesian classification, rule-based classification,
classification by backpropagation, support vector machines, and classification based
on association rule mining—are all examples of eager learners.
Eager learners - when given a set of training tuples, will construct a classification
model before receiving new tuples to classify.
Lazy Learners
o In a lazy approach, for a given training tuple, a lazy learner simply stores it or does
only a little minor processing and waits for until a test tuple given. After seeing the
test tuple it perform classification in order to classify the tuple based on its similarity
to the stored training tuples.
o Lazy learners do less work when a training tuple is presented and more work when
making a classification or prediction. Because lazy learners store the training tuples
or “instances,” (they are also referred to as instance based learners,) even though all
learning is essentially based on instances.
and is
Nearest neighbour classifiers can also be used for prediction, that is, to return a real-
valued prediction for a given unknown tuple. In this case, the classifier returns the
average value of the real-valued labels associated with the k nearest neighbours of
the unknown tuple.
When given a new case to classify, a case-based reasoner will first check if an
identical training case exists. If one is found, then the associated solution to that
case is returned. If no identical case is found, then the case-based reasoner will
search for training cases having components that are similar to those of the new
case.
Ideally, these training cases may be considered as neighbours of the new case. If
cases are represented as graphs, this involves searching for subgraphs that are
similar to subgraphs within the new case. The case-based reasoner tries to combine
the solutions of the neighbouring training cases in order to propose a solution for the
new case.
Finding a good similarity metric and suitable methods for combining solutions.
The selection of salient features for indexing training cases and the development
of efficient indexing techniques.
A balance between accuracy and efficiency changes as the number of stored
cases becomes very large. As this number increases, the case-based reasoner
becomes more intelligent. After a certain point, however, the efficiency of the
system will suffer as the time required searching for and process relevant cases
increases.
1. Genetic Algorithms
2. Rough Set Approach
3. Fuzzy Set Approaches
1. Genetic Algorithms
o If an attribute has k > 2 values, k bits can be used to encode the attribute’s
values
Based on the notion of survival of the fittest, a new population is formed to consist of
the fittest rules and their offspring
The fitness of a rule is represented by its classification accuracy on a set of training
examples
Off springs are generated by crossover and mutation
The process continues until a population P evolves when each rule in P satisfies a
pre-specified threshold
Slow but easily parallelizable
o Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of
membership (such as using fuzzy membership graph)
1. Linear Regression
2. Nonlinear Regression
3. Other Regression-Based Methods
Numeric prediction is the task of predicting continuous values for given input. e.g.,To
predict the salary of employee with 10 years of work experience, the sales of a new
product.
The predictor variables are the attributes of the tuple. In general, the values of the
predictor variables are known. The response variable is unknown so predict it.
1. Linear Regression
where x is the mean value of x1, x2, ….. , x|D|, and y is the mean value of y1, y2, ….,
y|D|.
Example. Straight-line regression using method of least squares. Table shows a set of
paired data where x is the number of years of work experience of a employee and y is
the corresponding salary of the employee.
The 2-D data can be graphed on a scatter plot, as in Figure. The plot suggests a linear
relationship between the two variables, x and y.
We model the relationship that salary may be related to the number of years of work
experience with the equation .
Given the above data, we compute x = 9.1 and y = 55.4. Substituting these values into
Equations we get
Thus, the equation of the least squares line is estimated by y = 23.6+3.5x. Using this
equation, we can predict that the salary of a college graduate with, say, 10 years of
experience is $58,600.
2. Nonlinear Regression
Other functions, such as power function, can also be transformed to linear model.
Some models are intractable nonlinear (e.g., sum of exponential terms)
possible to obtain least square estimates through extensive calculation on more
complex formulae
Regression and model trees tend to be more accurate than linear regression when
the data are not represented well by a simple linear model
University Questions
Unit 4
Part A
1. What is meant by market Basket analysis?
2. What is the use of multilevel association rules?
3. What is meant by pruning in a decision tree induction?
4. Write the two measures of Association Rule.
5. With an example explain correlation analysis.
6. How are association rules mined from large databases?
7.What is tree pruning in decision tree induction?
8.What is the use of multi-level association rules?
9. What are the Apriori properties used in the Apriori algorithms?
PART-B
1. Decision tree induction is a popular classification method. Taking one typical decision
tree induction algorithm , briefly outline the method of decision tree classification. [16]
2. Consider the following training dataset and the original decision tree induction
algorithm (ID3). Risk is the class label attribute. The Height values have been already
discredited into disjoint ranges. Calculate the information gain if Gender is chosen as
the test attribute. Calculate the information gain if Height is chosen as the test attribute.
Draw the final decision tree (without any pruning) for the training dataset. Generate all
the “IF-THEN rules from the decision tree.
Gender Height Risk
F (1.5, 1.6) Low
M (1.9, 2.0) High
F (1.8, 1.9) Medium F (1.8, 1.9) Medium F (1.6, 1.7) Low
M (1.8, 1.9) Medium
F (1.5, 1.6) Low M (1.6, 1.7) Low M (2.0, 8) High M (2.0, 8) High
F (1.7, 1.8) Medium M (1.9, 2.0) Medium F (1.8, 1.9) Medium F (1.7, 1.8) Medium
F (1.7, 1.8) Medium [16]
(ii) Find all the association rules that involve only B, C.H (in either leftor right hand side
of the rule). The minimum confidence is 70%. [7]
4. (a)Explain the algorithm for constructing a decision tree from training samples [12]
(b)Explain Bayes theorem. [4]
6. Discuss the approaches for mining multi level association rules from the transactional
databases. Give relevant example. [16]
7. Write and explain the algorithm for mining frequent item sets without candidate
generation. Give relevant example. [16]
DATE
UNIT V
Clustering is a process of grouping the physical or conceptual data object into clusters.
A cluster analysis is the process of analyzing the various clusters to organize the
different objects into meaningful and descriptive objects.
3. What are the fields in which clustering techniques are used? [CO5-L2]
The basic requirements of cluster analysis are • Dealing with different types of
attributes. • Dealing with noisy data. • Constraints on clustering. • Dealing with arbitrary
shapes. • High dimensionality • Ordering of input data • Interpretability and usability •
Determining input parameter and • Scalability
5.What are the different types of data used for cluster analysis? [CO5-L2]
The different types of data used for cluster analysis are interval scaled, binary, nominal,
ordinal and ratio scaled data.
Interval scaled variables are continuous measurements of linear scale. For Example ,
height and weight, weather temperature or coordinates for any cluster. These
measurements can be calculated using Euclidean distance or Minkowski distance
7. Define Binary variables? And what are the two types of binary variables? [CO5-
L2]
Binary variables are understood by two states 0 and 1, when state is 0, variable is
absent and when state is 1, variable is present. There are two types of binary variables,
symmetric and asymmetric binary variables. Symmetric variables are those variables
that have same state values and weights. Asymmetric variables are those variables that
have not same state values and weights.
A nominal variable is a generalization of the binary variable. Nominal variable has more
than two states, For example, a nominal variable, color consists of four states, red,
green, yellow, or black. In Nominal variables the total number of states is N and it is
denoted by letters, symbols or integers. An ordinal variable also has more than two
states but all these states are ordered in a meaningful sequence. A ratio scaled variable
makes positive measurements on a non-linear scale, such as exponential scale, using
the formula AeBt or Ae-Bt Where A and B are constants.
Hierarchical method groups all the objects into a tree of clusters that are arranged in a
hierarchical order. This method works on bottom-up or top-down approaches.
Density based method deals with arbitrary shaped clusters. In density-based method,
clusters are formed on the basis of the region where the density of the objects is high.
In this method objects are represented by the multi resolution grid data structure. All the
objects are quantized into a finite number of cells and the collection of cells build the
grid structure of objects. The clustering operations are performed on that grid structure.
This method is widely used because its processing time is very fast and that is
independent of number of objects.
It is a grid based multi resolution clustering method. In this method all the objects are
represented by a multidimensional grid structure and a wavelet transformation is applied
for finding the dense region. Each grid cell contains the information of the group of
objects that map into a cell. A wavelet transformation is a process of signaling that
produces the signal of various frequency sub bands.
For optimizing a fit between a given data set and a mathematical model based methods
are used. This method uses an assumption that the data are distributed by probability
distributions. There are two basic approaches in this method that are
Approach
Regression can be used to solve the classification problems but it can also be used for
applications such as forecasting. Regression can be performed using many different
types of techniques; in actually regression takes a set of data and fits the data to a
formula
22. What are the reasons for not using the linear regression model to estimate the
output data? [CO5-L2]
There are many reasons for that, One is that the data do not fit a linear model, It is
possible however that the data generally do actually represent a linear model, but
thelinear model generated is poor because noise or outliers exist in the data. Noise is
erroneous data and outliers are data values that are exceptions to the usual and
expected data.
23. What are the two approaches used by regression to perform classification?
[CO5-L2]
Instead of fitting a data into a straight line logistic regression uses a logistic curve. The
formula for the univariate logistic curve is P= e (C0+C1X1) 1+e (C0+C1X1) The logistic
curve gives a value between 0 and 1 so it can be interpreted as the probability of class
membership.
A time series is a set of attribute values over a period of time. Time Series Analysis may
be viewed as finding patterns in the data and predicting future values.
Part B
UNIT V CLUSTERING AND APPLICATIONS AND TRENDS IN DATA MINING 8
1. Cluster Analysis
2. Types of Data
1) Partitioning Methods
2) Hierarchical Methods
3) Density-Based Methods
4) Grid Based Methods
5) Model-Based Clustering Methods
6) Clustering High Dimensional Data
7) Constraint Based Cluster Analysis
8) Outlier Analysis
1. Briefly explain all the Cluster Analysis concepts with suitable examples [CO5-
H1]
Cluster. A cluster is a collection of data objects that are similar to one another within
the same cluster and are dissimilar to the objects in other clusters.
Clustering. The process of grouping a set of physical or abstract objects into classes of
similar objects is called clustering.
Pattern Recognition
Spatial Data Analysis
o Create thematic maps in Geographical information system by clustering
feature spaces
o Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
World Wide Web
o Document classification
o Cluster Weblog data to discover groups of similar access patterns
Suppose that a data set to be clustered contains n objects, which may represent
persons, houses, documents, countries, and so on. The two data structures are used.
1). The most popular distance measure is Euclidean distance, which is defined as
Both the Euclidean distance and Manhattan distance satisfy the following mathematic
requirements of a distance function
2. Binary variables
The dissimilarity between two objects i and j can be computed based on the ratio of
mismatches:
where m is the number of matches (i.e., the number of variables for which i and j are
in the same state), and p is the total number of variables.
Consider object identifier, test-1 column only to find the categorical variables. By using
above equation we get
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
The values of an ordinal variable can be mapped to ranks. For example, suppose
that an ordinal variable f has Mf states. These ordered states define the ranking 1,
….., Mf .
o From above table consider only the object-identifier and the continuous ordinal
variable, test-2, are available. There are 3 states for test-2, namely fair, good, and
excellent, that is Mf =3.
o For step 1, if we replace each value for test-2 by its rank, the four objects are
assigned the ranks 3, 1, 2, and 3, respectively.
o Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to
1.0.
o For step 3, we can use, say, the Euclidean distance (Equation (7.5)), which results in
the following dissimilarity matrix:
where A and B are positive constants, and t typically represents time. E.g.,the growth of
a bacteria population , the decay of a radioactive element.
o This time, from the above table consider only the object-identifier and the ratio-
scaled variable, test-3, are available.
o Logarithmic transformation of the log of test-3 results in the values 2.65, 1.34, 2.21,
and 3.08 for the objects 1 to 4, respectively.
o Using the Euclidean distance on the transformed values, we obtain the following
dissimilarity matrix:
5. Vector objects:
and y.
A grid-based method first quantizes the object space into a finite number of cells that
form a grid structure, and then performs clustering on the grid structure. STING is a
typical example of a grid-based method based on statistical information stored in grid
cells. WaveCluster and CLIQUE are two clustering algorithms that are both grid based
and density-based.
A model-based method hypothesizes a model for each of the clusters and finds the
best fit of the data to that model. Examples of model-based clustering include the EM
algorithm (which uses a mixture density model), conceptual clustering (such as
COBWEB), and neural network approaches (such as self-organizing feature maps).
One person’s noise could be another person’s signal. Outlier detection and analysis
are very useful for fraud detection, customized marketing, medical analysis, and many
other tasks. Computer-based outlier analysis methods typically follow either a statistical
distribution-based approach, a distance-based approach, a density-based local outlier
detection approach, or a deviation-based approach.
Partitioning Methods
Given D, a data set of n objects, and k, the number of clusters to form, a partitioning
algorithm organizes the objects into k partitions (k ≤ n), where each partition represents
a cluster. The commonly used partitioning methods are (i). k-means, (ii). k-medoids.
o k-means. where each cluster’s center is represented by the mean value of the
objects in the cluster. i.e Each cluster is represented by the center of the cluster.
o Algorithm
Input:
k: the number of clusters,
D: a data set containing n objects.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based on
the mean value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for each
cluster;
(5) until no change;
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Input:
k: the number of clusters,
D: a data set containing n objects.
Method:
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative object;
(4) randomly select a nonrepresentative object, Orandom;
(5) compute the total cost, S, of swapping representative object, Oj, with Orandom;
(6) if S < 0 then swap Oj with Orandom to form the new set of k representative objects;
(7) until no change
PAM works efficiently for small data sets but does not scale well for large data sets.
A hierarchical clustering method works by grouping data objects into a tree of clusters.
Hierarchical clustering methods can be further classified as either agglomerative or
divisive,
depending on whether the hierarchical decomposition is formed in a bottom-up
(merging) or top-down (splitting) fashion.
Decompose data objects into a several levels of nested partitioning (tree of clusters),
called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired
level, then each connected component forms a cluster.
BIRCH (1996):
Birch: Balanced Iterative Reducing and Clustering using Hierarchies
Scales linearly: finds a good clustering with a single scan and improves the quality
with a few additional scans
Weakness: handles only numeric data, and sensitive to the order of the data record.
o C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d},
{b, c, e}, {b, d, e}, {c, d, e}
o C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
o Jaccard co-efficient may lead to wrong clustering result
o C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
o C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
o Jaccard co-efficient-based similarity function:
Major features:
o Discover clusters of arbitrary shape
o Handle noise
o One scan
o Need density parameters as termination condition
Methods (1). DBSCAN (2).OPTICS (3).DENCLUE
DBSCAN searches for clusters by checking the £ -neighborhood of each point in the
database. If the £ neighborhood of a point p contains more than MinPts, a new
cluster with p as a core object is created.
DBSCAN then iteratively collects directly density-reachable objects from these core
objects, which may involve the merge of a few density-reachable clusters. The
process terminates when no new point can be added to any cluster.
OPTICS computes an better cluster ordering for automatic and interactive cluster
analysis .The cluster ordering can be used to extract basic clustering information such
as cluster centers or arbitrary-shaped clusters as well as provide the basic clustering
structure.
For example, in above Figure is the reachability plot for a simple two-dimensional data
set, which presents a general overview of how the data are structured and clustered.
The data objects are plotted in cluster order (horizontal axis) together with their
respective reachability-distance (vertical axis). The three Gaussian “bumps” in the plot
reflect three clusters in the data set.
3). DENCLUE (DENsity-based CLUstEring)
Clustering Based on Density Distribution Functions
(1) the influence of each data point can be formally modeled using a mathematical
function called an influence function, which describes the impact of a data point within
its neighborhood;
(2) the overall density of the data space can be modeled analytically as the sum of the
influence function applied to all data points.
Advantages
It then uses a wavelet transformation to transform the original feature space, finding
dense regions in the transformed space.
A wavelet transform is a signal processing technique that decomposes a signal into
different frequency subbands.
The wavelet model can be applied to d-dimensional signals by applying a one-
dimensional wavelet transforms d times.
In applying a wavelet transform, data are transformed so as to reserve distance
between objects at different levels of resolution. This allows natural clusters in the
data to become more different.
Clusters can then be identified by searching for dense regions in the new domain.
Advantages:
Model-based clustering methods attempt to optimize the fit between the given data and
some mathematical model. Such methods are often based on the assumption that the
data are generated by a mixture of underlying probability distributions.
Typical methods
o Statistical approach
EM (Expectation maximization), AutoClass
o Machine learning approach
COBWEB, CLASSIT
o Neural network approach
SOM (Self-Organizing Feature Map)
An extension to k-means
o Assign each object to a cluster according to a weight (prob. distribution)
o New means are computed based on weighted measures
General idea
o Starts with an initial estimate of the parameter vector
o Iteratively rescores the patterns against the mixture density produced by the
parameter vector
o The rescored patterns are used to update the parameter updates
o Patterns belonging to the same cluster, if they are placed by their scores in a
particular component
Algorithm converges fast but may not be in global optima
o Maximization step:
Estimation of model parameters
Conceptual clustering
o A form of clustering in machine learning
o Produces a classification scheme for a set of unlabeled objects
o Finds characteristic description for each concept (class)
COBWEB (Fisher’87)
o A popular a simple method of incremental conceptual learning
o Creates a hierarchical clustering in the form of a classification tree
o Each node refers to a concept and contains a probabilistic description of that
concept
Working method:
o For a given new object, COBWEB decides where to include it into the classification
tree. For this COBWEB derives the tree along an suitable path, updating counts
along the way, in search of the “best host” or node at which to classify the object.
o If the object does not really belong to any of the concepts represented in the tree
then better to create a new node for the given object. The object is then placed in an
existing class, or a new class is created for it, based on the partition with the highest
category utility value.
Limitations of COBWEB
o The assumption that the attributes are independent of each other is often too
strong because correlation may exist
o Not suitable for clustering large database data – skewed tree and expensive
probability distributions
. CLASSIT
o an extension of COBWEB for incremental clustering of continuous data
o suffers similar problems as COBWEB
Partition the data space and find the number of points that lie inside each cell of the
partition.
Identify the subspaces that contain clusters using the Apriori principle
Identify clusters
o Determine dense units in all subspaces of interests
o Determine connected dense units in all subspaces of interests.
Generate minimal description for the clusters
o Determine maximal regions that cover a cluster of connected dense units for
each cluster
o Determination of minimal cover for each cluster.
Fig .Dense units found with respect to age for the dimensions salary and vacation are
intersected in order to provide a candidate search space for dense units of higher
dimensionality.
Strength
o automatically finds subspaces of the highest dimensionality such that high
density clusters exist in those subspaces
o insensitive to the order of records in input and does not presume some
canonical data distribution
o scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases
Weakness
o The accuracy of the clustering result may be degraded at the expense of
simplicity of the method
Text documents are clustered based on the frequent terms they contain. A term
can be made up of a single word or several words. Terms are then extracted.
A stemming algorithm is then applied to reduce each term to its basic stem. In
this way, each document can be represented as a set of terms. Each set is
typically large. Collectively, a large set of documents will contain a very large set
of different terms.
Advantage: It automatically generates a description for the generated clusters in
terms of their frequent term sets.
o Figure.1 shows a fragment of microarray data containing only three genes (taken as
“objects” ) and ten attributes (columns a to j ).
o However, if two subsets of attributes, {b, c, h, j, e} and { f , d, a, g, i}, are selected
and plotted as in Figure. 2 (a) and (b) respectively,
o Figure. 2(a) forms a shift pattern, where the three curves are similar to each other
with respect to a shift operation along the y-axis.
o Figure.2(b) forms a scaling pattern, where the three curves are similar to each other
with respect to a scaling operation along the y-axis.
Fig: Raw data from a fragment of microarray data containing only 3 objects and 10
attributes
Clustering with obstacle objects using a partitioning approach requires that the distance
between each object and its corresponding cluster center be re-evaluated at each
iteration whenever the cluster center is changed.
e.g A city may have rivers, bridges, highways, lakes, and mountains. We do not want to
swim across a river to reach an ATM.
Fig(a) :First, a point, p, is visible from another point, q, in Region R, if the straight line
joining p and q does not intersect any obstacles.
The shortest path between two points, p and q, will be a subpath of VG’ as shown in
Figure (a).
We see that it begins with an edge from p to either v1, v2, or v3, goes through some
path in VG, and then ends with an edge from either v4 or v5 to q.
Fig.(b).To reduce the cost of distance computation between any two pairs of objects,
microclusters techniques can be used. This can be done by first triangulating the region
R into triangles, and then grouping nearby points in the same triangle into microclusters,
as shown in Figure (b).
After that, precomputation can be performed to build two kinds of join indices based on
the shortest paths:
o VV index: indices for any pair of obstacle vertices
o MV index: indices for any pair of micro-cluster and obstacle indices
e.g., A parcel delivery company with n customers would like to determine locations
for k service stations so as to minimize the traveling distance between customers
and service stations.
The company’s customers are considered as either high-value customers (requiring
frequent, regular services) or ordinary customers (requiring occasional services).
The manager has specified two constraints: each station should serve (1) at least
100 high-value customers and (2) at least 5,000 ordinary customers.
o Find an initial “solution” by partitioning the data set into k groups and satisfying
user-constraints
o Iteratively refine the solution by micro-clustering relocation (e.g., moving δ μ-
clusters from cluster Ci to Cj) and “deadlock” handling (break the microclusters
when necessary)
o Efficiency is improved by micro-clustering
Data objects which are totally different from or inconsistent with the remaining set of
data, are called outliers. Outliers can be caused by measurement or execution error
E.g The display of a person’s age as 999.
Outlier detection and analysis is an interesting data mining task, referred to as outlier
mining.
Applications:
o Fraud Detection (Credit card, telecommunications, criminal activity in e-
Commerce)
o Customized Marketing (high/low income buying habits)
o Medical Treatments (unusual responses to various drugs)
o Analysis of performance statistics (professional athletes)
o Weather Prediction
o Financial Applications (loan approval, stock tracking)
Working hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from
an initial distribution model, F, that is,
A discordancy test verifies whether an object, oi, is significantly large (or small) in
relation to the distribution F.
Alternative hypothesis.
An alternative hypothesis, H, which states that oi comes from another distribution
model, G, is adopted
There are different kinds of alternative distributions.
o Inherent alternative distribution
o Mixture alternative distribution
o Slippage alternative distribution
An object, O, in a data set, D, is a distance-based (DB) outlier with parameters pct and
dmin, that is, a DB(pct;dmin)-outlier, if at least a fraction, pct, of the objects in D lie at a
distance greater than dmin from O.
Index-based algorithm
Given a data set, the index-based algorithm uses multidimensional indexing structures,
such as R-trees or k-d trees, to search for neighbours of each object o within radius
dmin around that object.
o Nested-loop algorithm
This algorithm avoids index structure construction and tries to minimize the number of
I/Os. It divides the memory buffer space into two halves and the data set into several
logical blocks. I/O efficiency can be achieved by choosing the order in which blocks are
loaded into each half.
Techniques
o Sequential Exception Technique
o OLAP Data Cube Technique
Dissimilarity function: It is any function that, if given a set of objects, returns a low
value if the objects are similar to one another. The greater the dissimilarity among
the objects, the higher the value returned by the function.
Cardinality function: This is typically the count of the number of objects in a given
set.
Smoothing factor: This function is computed for each subset in the sequence. It
assesses how much the dissimilarity can be reduced by removing the subset from
the original set of objects.
Financial data collected in banks and financial institutions are often relatively
complete, reliable, and of high quality
Design and construction of data warehouses for multidimensional data analysis and
data mining
o View the debt and revenue changes by month, by region, by sector, and by
other factors
o Access statistical information such as max, min, total, average, trend, etc.
Loan payment prediction/consumer credit policy analysis
o feature selection and attribute relevance ranking
o Loan payment performance
o Consumer credit rating
Classification and clustering of customers for targeted marketing
o multidimensional segmentation by nearest-neighbor, classification, decision
trees, etc. to identify customer groups or associate a new customer to an
appropriate customer group
Detection of money laundering and other financial crimes
o integration of from multiple DBs (e.g., bank transactions, federal/state crime
history DBs)
o Tools: data visualization, linkage analysis, classification, clustering tools,
outlier analysis, and sequential pattern analysis tools (find unusual access
sequences)
Retail industry: huge amounts of data on sales, customer shopping history, etc.
Applications of retail data mining
o Identify customer buying behaviors
o Discover customer shopping patterns and trends
o Improve the quality of customer service
o Achieve better customer retention and satisfaction
o Enhance goods consumption ratios
o Design more effective goods transportation and distribution policies
Examples
A rapidly expanding and highly competitive industry and a great demand for data
mining
o Understand the business involved
o Identify telecommunication patterns
o Catch fraudulent activities
o Make better use of resources
o Improve the quality of service
The following are a few scenarios for which data mining may improve
telecommunication services
DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C),
guanine (G), and thymine (T).
Gene: a sequence of hundreds of individual nucleotides arranged in a particular
order
Humans have around 30,000 genes
Tremendous number of ways that the nucleotides can be ordered and sequenced to
form distinct genes
Data mining may contribute to biological data analysis in the following aspects
Vast amounts of data have been collected from scientific domains (including
geosciences, astronomy, and meteorology) using sophisticated telescopes,
multispectral high-resolution remote satellite sensors, and global positioning
systems.
Large data sets are being generated due to fast numerical simulations in various
fields, such as climate and ecosystem modeling, chemical engineering, fluid
dynamics, and structural mechanics.
The security of our computer systems and data is at constant risk. The extensive
growth of the Internet and increasing availability of tools and tricks for interrupting
and attacking networks have prompted intrusion detection to become a critical
component of network administration.
An intrusion can be defined as any set of actions that threaten the integrity,
confidentiality, or availability of a network resource .
The following are areas in data mining technology applied or further developed for
intrusion detection:
UNIT-V
University Questions
PART A
1. What are the requirements of clustering?
2. What are the applications of spatial data bases?
3. What is text mining?
4. Distinguish between classification and clustering.
5. Define a Spatial database.
7. What is the objective function of K-means algorithm?
8. Mention the advantages of Hierarchical clustering.
9. What is an outlier? Give example.
10. What is audio data mining?
11. List two application of data mining.
PART-B
1. BIRCH and CLARANS are two interesting clustering algorithms that perform effective
clustering in large data sets.
(i) Outline how BIRCH performs clustering in large data sets. [10] (ii) Compare and
outline the major differences of the two scalable clustering algorithms BIRCH and
CLARANS. [6]
2. Write a short note on web mining taxonomy. Explain the different activities of text
mining.
3. Discuss and elaborate the current trends in data mining. [6+5+5]
4. Discuss spatial data bases and Text databases [16]
5. What is a multimedia database? Explain the methods of mining multimedia
database? [16]