U1-U5 Consolidated PDF
U1-U5 Consolidated PDF
U1-U5 Consolidated PDF
DATA MINING
TEXT BOOKS:
1. Alex Berson and Stephen J.Smith, ―Data Warehousing, Data Mining and OLAP‖, Tata McGraw – Hill
Edition, Thirteenth Reprint 2008.
2. Jiawei Han and Micheline Kamber, ―Data Mining Concepts and Techniques‖, Third Edition, Elsevier,
2012.
REFERENCES:
1. Pang-Ning Tan, Michael Steinbach and Vipin Kumar, ―Introduction to Data Mining‖, Person
Education, 2007.
2. K.P. Soman, Shyam Diwakar and V. Aja, ―Insight into Data Mining Theory and Practice‖, Eastern
Economy Edition, Prentice Hall of India, 2006.
3. G. K. Gupta, ―Introduction to Data Mining with Case Studies‖, Eastern Economy Edition, Prentice Hall
of India, 2006.
4. Daniel T.Larose, ―Data Mining Methods and Models‖, Wiley-Interscience, 2006.
UNIT I DATA WAREHOUSING
Data warehousing Components –Building a Data warehouse –Mapping the Data Warehouse to a
Multiprocessor Architecture – DBMS Schemas for Decision Support – Data Extraction, Cleanup,
and Transformation Tools –Metadata.
1.1 INTRODUCTION
1.1.1 Data Warehouse
A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection
of data. It is a central repository of integrated data from one or more sources.
Data warehouse is used to
Store current & historical data.
Create analytical reports for knowledge workers
Take informed decisions in an organization.
Subject Oriented
A data warehouse is subject oriented because it provides information around a subject
rather than the organization's ongoing operations. These subjects can be product, customers,
suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing operations, rather
it focuses on modeling and analysis of data for decision making.
Integrated
A data warehouse is constructed by integrating data from heterogeneous sources such as
relational databases, flat files, etc., to enhance the effective analysis of data.
Time Variant
The data collected in a data warehouse is identified with in a particular time period. The data in
a data warehouse provides information from the historical point of view.
Non-volatile
The historical data in a data warehouse is kept separate from the operational database and
therefore frequent changes in operational database do not affect the data in data warehouse.
1.1.2 Online Transaction Processing (OLTP) vs Online Analytical Processing (OLAP)
OLAP
Online Analytical Processing (OLAP) deals with Historical Data or Archival Data. OLAP
is a powerful technology for identifying the data which includes capabilities for data discovery,
data reporting and performing complex analytical calculations. The main component of OLAP
system is a data cube. A data cube is constructed by combining data warehouse‘s structures like
facts and dimensions. The merging of all the cubes creates a multidimensional data warehouse.
Online Transaction Processing (OLTP) is a tool capable of supporting transactions
oriented data over internet. OLTP monitors day to day transaction of an organization. OLTP
supports transaction oriented applications in 3-tier architecture. Data from OLTP are collected
over a period of time and store it in a very large database called Data warehouse. The Data
warehouses are highly optimized for read (SELECT) operation. Transactional data are extracted
from multiple OLTP sources and pre-processing is done on these data to make it compatible with
the data warehouse data format.
OLAP Example 1: If we collect last 10 years data about flight reservation, The data can
give us many meaningful information such as the trends in reservation. This may give useful
information like peak time of travel, what kinds of people are traveling in various classes
(Economy/Business) etc.
Space Operational data stored, typically Large data sets of historical information.
small. Large storage needed.
Normalized schemas. Many tables Star, snowflake and constellation
Schema and relationships. schema. Fewer tables not normalized.
Insert, update and delete operations. Refreshing of data with huge data sets.
Data Refresh They are performed fast. Immediate Takes time and is sporadic.
results.
Fast. Requires some indexes on Slow. Depending on the amount of data.
Speed
large tables. Requires more indexes.
Data Model Entity-relationship on databases. One or multi-dimensional data.
Horizon Day-to-day, weeks, months. Long time data.
Data heterogeneity: Holding same name for different attributes. (This occurs when data from
disparate data sources and presented to the user with a unified interface)
1.2.3 Metadata
Metadata is used for building, maintaining, managing and using Data warehouse. Metadata
provides the user easy access to the data. Metadata interface need to be created. Meta data
management is provided via metadata repository and accompanying software which runs on the
work station.
There are two types of metadata.
Technical metadata.
Business metadata.
Technical metadata Contains data to be used by warehouse designers and administrators.
Business metadata Contains information that gives users an easy to understand perspective of the
information stored in the data warehouse.
Important functional components of Metadata repository is the information directory.
Content of information directory is the metadata.
Information directory and the metadata repository should ….
Be a gateway to data warehouse environment.
Support easy distribution and replication of its contents.
Searchable by business oriented keywords.
Support sharing of information objects.
Support variety of scheduling options.
Support the distribution of query results.
Support and provide interfaces to other applications.
Support end user monitoring of the status of the data warehouse environment.
A data warehouse, unlike a data mart, deals with multiple subject areas and is typically
implemented and controlled by a central organizational unit such as the corporate IT group as
shown in Figure 1.2. Data marts are small slices of the data warehouse. Whereas data
warehouses have an enterprise-wide depth, the information in data marts pertains to a single
department.
Figure 1.2 Data Warehouse to Data Mart
Dependent data marts are simple because clean data has already been loaded into the
central data warehouse. Here ETL process is mostly a process of identifying the right subset of
data relevant to the chosen data mart subject and moving a copy of it, perhaps in a
summarized form. Independent data marts deal with all aspects of the ETL process,
similar as central data warehouse. The number of sources is likely to be fewer and the
amount of data associated with the data mart is less than the warehouse, given your focus
on a single subject.
Application development tools can be used when the analytical need of the data warehouse
increases. In this case organizations need to depend on developing some applications based on
some of the proven approaches. Some of the application development platforms are power
builder from power soft, VB from microsoft, forte from forte software, buisness objects etc…
OLAP Tools
OLAP tools are based on multi dimensional databases and allow a sophisticated user to
analyze the data using elaborate, multidimensional, complex views. Main applications of these
tools include product performance, profitability, effectiveness of sales program , etc…These
tools assume that data is organized in multi dimensional database.
Data mining Tools
Success factor for any business is to use information effectively. Knowledge discovery from the
available data is important to formulate the business strategies. Data mining is the process of
extracting patterns to build predictive rather than retrospective model. Data mining tools are used
to find hidden relationships. Data mining tools are used to perform the following tasks.
Segmentation.
Classification.
Association.
Preferencing.
Visualize data
Correct Data
To design a Data warehouse Ralph Kimball proposed a nine step method as listed as follows:
Dimension Table
Fact Table
Transaction fact tables record facts about a specific event (e.g., sales events)
Snapshot fact tables record facts at a given point in time (e.g., account details at month
end)
Accumulating snapshot tables record aggregate facts at a given point in time (e.g., total
month-to-date sales for a product)
Building a Data Warehouse - Dimension Table:
Dimension tables as shown in Figure1.3 have a relatively small number of records
compared to fact tables, but each record may have a very large number of attributes to describe
the fact data. Dimensions can define a wide variety of characteristics, but some of the most
common attributes defined by dimension tables include:
Dimension attributes should be:
Verbose (labels consisting of full words)
Descriptive
Complete (having no missing values)
Discretely valued (having only one value per dimension table row)
Quality assured (having no misspellings or impossible values)
If any dimension occurs in two data marts, they must be exactly the same dimension, or one
must be a mathematical subset of the other. Only in this way can two data marts share one or
more dimensions in the same application. When a dimension is used in more than one data
mart, the dimension is referred to as being conformed.
First, it is often increasingly difficult to source increasingly old data. The older the data,
the more likely there will be problems in reading and interpreting the old files or the old
tapes. Second, it is mandatory that the old versions of the important dimensions be used, not
the most current versions. This is known as the ‗slowly changing dimension‘ problem, which
is described in more detail in the following step.
Two main measures used to measure the data warehouse processing performance
improvement are:
Throughput: The number of tasks that can be completed within a given time interval as shown in
Figure 1.5(b).
Response time: The amount of time it takes to complete a single task from the time it is
submitted as shown below in 1.5(a).
Types of Speed up
There are three different types of speedup as shown in Figure 1.6 that may occur. They are
Linear speed up: Performance improvement growing linearly with additional resources
Superlinear speed up: Performance improvement growing super linearly with additional
resources
Sublinear speed up: Performance improvement growing sub linearly with additional resources
Scale up
Scale up is defined as the uniprocessor elapsed time on small system divided by multiprocessor
elapsed time on larger system.
Types of Scale up
There are two different types of scale as shown in Figure 1.7 up that may occur. They are
Linear scale up: the ability to maintain the same level of performance when both the workload
and the resources are proportionally added.
Sublinear Scale up: The performance of the system decreases when both the workload and the
resources are proportionally added.
Vertical Parallelism - occurs among different tasks – all components query operations i.e. scans,
join, sort
1.4.2.3Data Partitioning
Data parallelism is parallelization across multiple processors in parallel computing
environments. It focuses on distributing the data across different nodes, which operate on
the data in parallel. It can be applied on regular data structures like arrays and matrices by
working on each element in parallel.
Data parallelism spreads the data across multiple disks randomly or intelligently.Random
methods include random selection, round robin. It spreads the data across multiple disks
randomly or intelligently.
Hash partitioning
Key range partitioning
Schema partitioning
User defined partitioning
1.4.3 Parallel Database Architectures
Parallel computers are no longer a monopoly of supercomputers. Parallel database system seeks
to improve performance through parallelization of various operations, such as loading data,
building indexes and evaluating queries. Although data may be stored in a distributed fashion,
the distribution is governed solely by performance considerations. Parallel databases improve
processing and input/output speeds by using multiple CPUs and disks in parallel.
Skew
In skew as shown below in Figure1.11 there will be unevenness of workload and load balancing
is one of the critical factors to achieve linear speed up.
Shared-memory architecture
Shared-disk architecture
Shared-nothing architecture
Shared-something architecture
Shared-Nothing Architecture
In shared nothing architecture each processor has its own local main memory and disks. Load
balancing becomes difficult. Figure 1.13 represents A Shared Nothing Architecture.
Figure 1.13 A Shared Nothing Architecture
Shared-Something Architecture
It uses a mixture of shared-memory and shared-nothing architectures. Each node is a shared-
memory architecture connected to an interconnection network in a shared-nothing architecture.
Figure 1.14 represents Cluster of SMP architectures.
Interconnection Networks
The interconnection networks contains Bus, Mesh, Hypercube as shown below in Figure
1.15.
Figure 1.15(a) Bus interconnection Figure 1.15(b) Mesh interconnection network
Interquery parallelism
Intraquery parallelism
Interoperation parallelism
Intraoperation parallelism
Pipeline parallelism
Independent parallelism
Mixed parallelism
Pipeline parallelism
Independent parallelism
IBM – Used in DB2 Parallel Edition (DB2 PE), a Database based on DB2/6000 Server
Architecture
They now contain extended functionalities for data profiling, data cleansing, Enterprise
Application Integration (EAI), Big Data processing, data governance and master data
management.
Tool Requirements
Prism solutions
Prism manager provides a solution for data warehousing by mapping source data to target
database management system. The prism warehouse manager generates code to extract and
integrate data, create and manage metadata and create subject oriented historical database. It
extracts data from multiple sources –DB2, IMS, VSAM, RMS &sequential files.
SAS institute
SAS data access engines serve as a extraction tools to combine common variables, transform
data Representations forms for consistency. It support for decision reporting, graphing .so it act
as the frontend.
Workstation based:In this User must transfer the PDL file from the mainframe to a location
accessible by PASSPORT
PASSPORT offers
Metadata directory at the core of process, robust data conversion, migration, analysis and
auditing facilities. PASSPORT work bench or GUI workbench that enables project development
on a work station and also maintains various personal who design, implement or use.
The Metacenter
It is developed by Carleton Corporation and designed to put users in a control of the
dataWarehouse. The heart of the meta center is the metadata dictionary. The meta center
conjunction with PASSPORT provides number of capabilities.
Data extraction and transformation
Event management and notification
Data mart subscription
Control center mover
Validity Corporation
Validity corporation integrity data reengineering tool is used to investigate standardizes
transform and integrates data from multiple operational systems and external sources. It main
focus is on data quality indeed focusing on avoiding the GIGO (garbage in garbage out)
Principle.
Benefits of integrity tool: Builds accurate consolidated views of customers, supplier, products
and other corporate entities. Maintain the highest quality of data.
Pros and Cons: In SSIS, the transformation is processed in the Memory and so the integration
process is much faster in the SQL server. SSIS is only compatible only with all the SQL servers.
Pros and Cons: It fits in for every kind of Data Integration process from small file
transformation to big Data migration and Analysis. Moreover, its highly scalable Architecture
has created a huge customer base.
Pros and Cons: This tool has a clear and easy integration with the production process and other
business process components. And also you can find an exceptionally good auditing and Data
capturing process with these ETL tools.
Pros and Cons:Data Profiling and Data validation are very attractive features in this ETL tool.
The advances version of the Data Validation also serves as a Firewall for your Data Network.
The only down point here is, it is more suitable for small and mid sized enterprises
Pros and Cons:This is a tool working cross platform hence the user circle is not restricted to
certain OS users alone. The Non availability of the debugging facility is one of the reasons why
the big enterprises do not opt for Clover ETL.
Pros and Cons: Comparing to the other ETL tools, this tools has a slow performance rate. The
other major drawback is the absence of the debugging facility.
AB – Initio
AB Initio is Enterprise Software Company whose products are very user friendly for Data
processing. Customers can use these tools for Data Integration, Data warehousing and also in
support for retails and Banking.
Pros and Cons: This is considered as one of the most efficient and fast processing Data
Integration tool
1.6 Meta Data
Metadata is simply defined as a set of data that describes and gives information about
other data i.e. data about data. The data that is used to represent other data is known as metadata.
The term metadata is often used in the context of Web pages, where it describes page content for
a search engine. For example, the index of a book serves as a metadata for the contents in the
book.
Metadata acts as a directory. This directory helps the decision support system to locate the
contents of a data warehouse.
Business Metadata
Business metadata is data that adds business context to other data. It provides information
authored by business people and/or used by business people. It is in contrast to technical
metadata, which is data used in the storage and structure of the data in a database or system.A
simple example of business metadata is a glossary entry. Hover functionality in an application or
web form can enable a glossary definition to be shown when cursor is on a field or term.
Technical Metadata
Technical metadata describes the information required to access the data, such as where the
data resides or the structure of the data in its native environment.Technical metadata represents
information that describes how to access the data in its original native data storage. It includes
database system names, table and column names and sizes, data types and allowed values.
Technical metadata also includes structural information such as primary and foreign key
attributes and indices.
Using our example of an address book database, the following represent the technical
metadata we know about the ZIP code column:
Named ZIPCode
Nine characters long
A string
Located in the StreetAddress table
Uses SQL Query Language
Operational Metadata -
Operational Metadata are metadata about operational data.It includes currency of data and data
lineage. Currency of data means whether the data is active, archived, or purged. Lineage of data
means the history of data migrated and transformation applied on it.
(c) information on the implicit semantics of data, along with any other kind of data that aids the
end-user exploit the information of the warehouse
(d) information on the infrastructure and physical characteristics of components and the sources
of the data warehouse
(e) information including security, authentication, and usage statistics that aids the administrator
tune the operation of the data warehouse as appropriate. Figure 1.23 shown below represents a
Metadata element for customer entity and Figure 1.24 shows Who needs metadata?
Metadata Repository
A metadata repository is a database created to store metadata. The metadata itself is housed in
and managed by metadata repository.Metadata repository management software is used to map
the source data to the target database, code generated for data transformation, integrate and
transform the data and control data moving data to data warehouse.A metadata provides decision
support oriented pointers to warehouse data.Provides link between data warehouse and decision
support systems.Data warehouse arch should ensure that there is a mechanism to populate the
metadata repository and all access paths to data warehouse should have meta data as entry point.
PART A
PART-B
1. Enumerate the building blocks of data warehouse. Explain the importance of metadata in a data
warehouse environment.
2. Explain various methods of data cleaning in detail.
3. Diagrammatically illustrate and discuss the data warehousing architecture with briefly explain
components of data warehouse.
4. (i) Distinguish between Data warehousing and data mining.
(ii)Describe in detail about data extraction, clean-up
5. Write short notes on (i)Transformation (ii)Metadata
6. List and discuss the steps involved in mapping the data warehouse to a multiprocessor
architecture.
7. Discuss in detail about Bitmapped Indexing
8. Explain in detail about different Vendor Solutions.
9. Explain Various group of Access tool.
10. Explain indexing
UNIT II BUSINESS ANALYSIS
Reporting and Query –Tool Categories– The Need for Applications–Cognos Impromptu–
(OLAP)–Need– Multidimensional Data Model– OLAP Guidelines Multidimensional versus
Multi- –Categories of Tools– OLAP Tools and theInternet.
The data warehouse is accessed using an end-user query and reporting tool from Business
Objects.The principle purpose of data warehousing is to provide information to business users
for strategic decision making. These users interact with the data warehouse using the front-end
tools, or by getting required information through the information delivery system.
Reporting tools
Managed Query tools
Executive information systems
On-line analytical processing
Data mining
Report writers:
Report writers are inexpensive desktop tools designed for End users. Generally they have
graphical interfaces and built in charting functions. They can pull a group of data from a variety
of data sources and integrate them in a single report. Leading report writers include Crystal
Reports, Actuate and Platinum Technology, Inc's Info Reports. Vendors are trying to increase the
scalability of report writers by supporting three-tiered architectures in Windows NT and Unix
server. At the beginning they are offered for Object oriented interfaces for designing and
manipulating reports and modules for performing ad hoc queries and OLAP Analysis.
Users and related activities
The first four types of access are covered by the combined category of tools and are called as
query and reporting tools. Three distinct types of reporting are identified.
1. Creation and viewing of standard reports – Routine delivery of reports based on
predetermined measures.
2. Definition and creation of ad hoc reports – It allows managers and business users to
quickly create their own reports and get quick answers to business questions.
3. Data exploration – Users can easily ―surf‖ through data without a preset path to
quickly uncover business trends or problems. The above said reasons may require
applications often take the form of custom-developed screens and reports that retrieve
frequently used data and format it in a predefined standardized way.
A catalog contains:
Folders - Meaningful groups of information representing columns from one or more tables
Columns - Individual data elements that can appear in one or more folders
Calculations - Expressions used to compute required values from existing data
Conditions - Used to filter information so that only a certain type of information is
displayed
Prompts - Pre-defined selection criteria prompts that users can include in reports they
create
Other components, such as metadata, a logical database name, join information and user
classes
Impromptu reporting begins with the information catalog, a LAN based repository
(Storage area) of business knowledge and data access rules. The catalog insulates users from
such technical aspects of the database as SQL syntax, table joins and hidden table and field
names.
Use of catalogs
view, run, and print reports
export reports to other applications
disconnect from and connect to the database
create reports
change the contents of the catalog
add user classes
Interactive reporting: Impromptu unifies querying and reporting in a single interface. Userscan
perform both these tasks by interacting with live, data in one integrated module.
Frames: Impromptu offers an interesting frame based reporting style.Frames are building blocks
that may be used to produce reports that are formatted withfonts, borders, colors, shading
etc.Frames or combination of frames, simplify building even complex reports.The data formats
itself according to the type of frame selected by the user.
List frames are used to display detailed information.
Form frames offer layout and design flexibility.
Cross-tab frames are used to show the totals of summarized data at selectedintersections.
Chart frames make it easy for users to see their business data in 2-D and 3-
Ddisplays using line, bar, ribbon, area and pie charts.
Text frames allow users to add descriptive text to reports and display binary
largeobjects (BLOBS) such as product descriptions or contracts.
Picture frames incorporate bitmaps to reports or specific records, perfect for
visuallyenhancing reports.
OLE frames make it possible for user to insert any OLE object into a report.
2.3.5. Impromptu Request Server
Impromptu introduced the new request server, which allows clients to off-load the query
process to the server. APC user can now schedule a request to run on the server, and an
Impromptu requestserver will execute the request, generating the result on the server. When
done, thescheduler notifies the user, who can then access, view or print at will from PC.The
Impromptu request server runs on HP/UX 9.X, IBM AIX 4.X and Sun Solaris 2.4. Itsupports
data maintained in ORACLE 7.X and SYBASE system 10/11.
2.3.6. Supported databases
Impromptu provides a native database support for ORACLE, Microsoft SQL Server,
SYBASE, SQL Server, Omni SQL Gateway, SYBASE Net Gateway. MDI DB2 Gateway,
Informix, CA-Ingres, Gupta SQL-Base, Borland InterBase, Btrieve, dBASE, Paradox, and
ODBC accessing any database with an ODBC driver.
2.3.9. PowerBuilder
Object-oriented applications, including encapsulation, polymorphism, and inheritance
and GUI objects.Once object created and tested and it can be reused by other applications. The
strength of the power builder is to develop windows application towards client/server
architecture. Power builder offers a fourth-generation language, object oriented graphical
development environment, and the ability to interface with a wide variety of database
management systems.
DataWindows Painter
These are Dynamic objects that provide access to databases and other data sources such
as ASCII files. Power Builder applications use DataWindows to connect to multiple databases
and files, as well as import and export data in a variety of formats such as dBase, Excel, Lotus
and tab delimited text. DataWindowssupports execution of stored procedures. DataWindows
allows developers to select a number of presentation styles from the list of tabular, grid, label,
and free form.It also allows a user specifiednumber of rows to be displayed in a display line.
QueryPainter
This is used to generate of SQL statements that can be stored in PowerBuilder
libraries.Thus, using Application Painter, Window Painter, and DataWindows Painter facilities,
a simple client/server application can be constructed literally in minutes.A rich set of SQL
functions is supported, including CONNECT/DISCONNECT, DECLARE, OPEN, and CLOSE
cursor, FETCH, and COMMIT/ROLLBACK.PowerBuilder supplies server other painter.
Database Painter
This painter allows developers to pick tables from the list box and examine and edit join
conditions and predicates, key fields, extended attribute, display formats and other database
attributes.
Structure Painter- This painter allows creation and modification of data structures and groups
of related data elements.
Preference Painter – This is a configuration tool that is used to examine and modify
configuration parameters for the PowerBuilder development environment.
Menu Painter – This painter creates menus for the individual windows and the entire
application.
Function Painter –This is a development toll that assists developers in creating functions calls
and parameters using combo boxes.
Library Painter– This painter manages the library in which the application components reside.
It also check-in and check-out of library objects for developers.
User object Painter – This painter allows developers to create custom controls. These custom
controls can be treated just like standard PowerBuilder controls.
Help Painter – This is a built-in help system, similar to the MS Windows Help facility.
2.3.13. Forté
In a three-tiered client/server computing architecture, an applications functionality is
partitioned into three distinct pieces: presentation logic with its GUI, application business logic,
and data access function. The presentation logic is placed on a client, while the application logic
resides on application server, and the data access logic and the database reside on a database or a
data warehouse server.
Application partitioning:
Forté allows developers tobuild a logical application that is independent of the
underlying environment.Developers build an application as if it were to run entirely on a single
machine.Forté automatically splits apart the application to run across the clients and servers that
constitute the deployment environment.It support tunable application partitioning.
Shared-application services:
With Forté, developers build a high-end application as a collection of application
components. The components can include client functionality such as data presentation and other
desktop processing.Shared-application services form the basis for a three-tiered application
architecture in which clients request actions from application services that, in turn access one or
more of the underlying data sources. Eachtier can be developed and maintained independent of
each other.
Business events:
Business events automate the notification of significant business occurrences so that
appropriate actions can be taken immediately by users. Forté detects the events whether they
originate on a user‘s desktop or in an application services, and sends notification to all the
application components that have expressed interest in that event.It supports three functional
components
Application Development Facility (ADF) - Distributed object computing framework, to define
user interfaces and application logic. It includes GUI designer for building user screens, a
proprietary 4GL called Transactional object-oriented language (TOOL).
System Generation Facility (SGF) - This assists developers in partitioning the application,
generating executables for distribution. Forté‘s most powerful feature is its ability to automate
partitioning of the application into client and server components. SGF automatically puts
processes on the appropriate device on basis of the application‘s logic and platform inventory.
Distributed Execution Facility (DEF) - This provides tools for managing applications at
runtime, including system administration support, a distributed object manager to handle
communications between applications partitions, and a performance monitor.
Web and Java integration - Release 3.0 provides integration with java, desktop, and
mainframe platforms. Integration with Java ActiveX and ActiveX server support. Forté servers
can be called from OLE. Support for the ability to call Forté Application servers from C++
modules. An option to generate and compile C++ codefor client modules. 4GL Profiler provides
detailed data on an applications performance.
Portability and supported platforms - Forté provides transparent portability across the most
common client/server platforms for both development and deployment.Forté masks the
differences while preserving the native look and feel of each environment. Any set of supported
platforms can be used for deployment. Server/Hosts platforms include Data General AViiON,
Digital Alpha, Open VMS, UNIX, HP 9000, IBM RS/6000, Sun SPARC, and Window NT.
Desktop GUI support includes Macintosh, Motif, and Windows.
2.3.14 Information Builder - The products from Information builder are Catcus and FOCUS
Fusion.
Cactus
It is a new second-generation, enterprise-class, client/server development environment. Cactus
lets developers create, test and deploy business applications spanning the Internet. It is a three-
tiered development environment and enables creation of application of any size and scope.It
builds highly reusable components for distributes enterprise-class applications through a visual
object-based development environment. Cactus provides access to a wealth of ActiveX, VBX,
and OLE controls.
Web-enabled access: Cactus offers full application development for the Web with no prior
knowledge of HTML, Java or complex 3GLs. Developers can build the traditional PC- based
front ends or web applications for industry standard all from one toolbox. Developers can focus
on the business problem rather than the underlying technology.
Application Manager – in integrated application repository that manages the data access,
business logic and presentation components created during development.
Partition Manager– a component that allows developers to drag locally developed procedures
and drop them on different Cactus servers anywhere in the enterprise.
Object browser– offers developers direct access to any portion of a multi-tiered application.
Cactus OCX– an OLE Custom Control that allows any cactus procedure to be called by a third
party application.
Focus Fusion :
A tool from Information Builder, is the new multidimensional database technology for OLAP
and data warehousing.
Fast query and reporting - It‘s advanced indexing, parallel query and rollup facilities provides
performance for reports, queries and analyses.
Integrated copy management facilities, which schedule automatic data refresh from any source
into Fusion. Open access via industry-standard protocols, such as ANSI SQL, ODBC, and HTTP
via EDA/SQL, so that Fusion works with hundreds of desktop tools including World Wide Web
browsers.
2.3. OLAP
OLAP stands for Online Analytical Processing. It uses database tables (fact and
dimension tables) to enable multidimensional viewing, analysis and querying of large amounts
of data. E.g. OLAP technology could provide management with fast answers to complex queries
on their operational data or enable them to analyze their company's historical data for trends and
patterns. Online Analytical Processing (OLAP) applications and tools are those that are designed
to ask ― complex queries of large multidimensional collections of data. Due to that OLAP is
accompanied with data warehousing.
Multidimensional data model is to view it as a cube as shown in figure 2.2. The table at
the left contains detailed sales data by product, market and time. The cube on the right associates
sales number (unit sold)with dimensions-product type, market and time with the unit variables
organized as cell in an array.
This cube can be expended to include another array-price-which can be associates with
all or only some dimensions. The cube supports matrix arithmetic that allows the cube to present
the dollar sales array simply by performing a single matrix operation on all cells of the
array(dollar sales= units * price). The response time of the multidimensional query depends on
how many cells have to be added on the fly. The caveat here is that, as the number of dimensions
increases number of cubes cell increase exponentially. On the other hand, the majority of
multidimensional queries deal with summarized, high level data. Therefore, the solution to
building an efficient multidimensional database is to pre aggregate all logical subtotals and totals
along all dimensions. This aggregation is especially valuable since typical dimensions are
hierarchical in nature i.e. time dimension may contain hierarchies for years, quarters, months,
weak and day. GEOGRAPHY may contain country, state, city etc.
Another way to reduce the size of the cube is to properly handle sparse data. Often, not
every cell has a meaning across all dimensions( many marketing databases may have more than
95 percent of all cells empty or containing 0). Another kind of sparse data is created when many
cells contain duplicate data( i.e., if the cube contains a PRICE dimension, the same price may
apply to all markets and all quarters for the year). The ability of a multidimensional database to
skip empty or repetitive cells can greatly reduce the size of the cube and amount of processing.
Dimensional hierarchy, sparse data management, and pre aggregation are the keys, since
they can significantly reduce the size of the database and the need to calculate values. Such a
design obviates the need for multitable joins and provides quick and direct access to the arrays
of answers, thus significantly speeding up execution of the multidimensional queries.
In this cube we can observe, that each side of the cube represents one of the elements of
the question. The x-axis represents the time, the y-axis represents the products and the z-axis
represents different centers. The cells of in the cube represents the number of product sold or
can represent the price of the items. This Figure 2.3 also gives a different understanding to the
drilling down operations. The relations defined must not be directly related, they related
directly. The size of the dimension increase, the size of the cube will also increase
exponentially. The time response of the cube depends on the size of the cube.
Aggregation(roll-up)
dimension reduction: e.g., total sales by city
summarization over aggregate hierarchy: e.g., total sales by city and year -> total
sales by region and by year
Selection (slice) defines a subcube
e.g., sales where city = Palo Alto and date =1/15/96
Navigation to detailed data(drill-down)
e.g., (sales - expense) by city, top 3% of cities by averageincome
Visualization Operations (e.g., Pivot ordice)
Dr. E.F. Codd the ―father of the relational model, created a list of rules to deal with the OLAP
systems. Users should priorities these rules according to their needs to match their business
requirements.
These rules are:
1) Multidimensional conceptual view: The OLAP should provide an appropriate
multidimensional Business model that suits the Business problems and requirements.
2) Transparency: The OLAP system‘s technology the underlying databases and computing
architecture and the heterogeneity of input data sources should be transparent to users
to preserve their productivity and proficiency with familiar front-end environments and
tools.
3) Accessibility: The OLAP tool should only access the data required only to theanalysis
needed. Additionally, the system should be able to access data from all heterogeneous
enterprise data sources required for the analysis.
4) Consistent reporting performance: As the number of dimensions and the size of the
database increase, users should not affect in anyway theperformance.
5) Client/server architecture: The OLAP tool should use the client server architecture to
ensure better performance, adaptivity, interoperability, andflexibility.
6) Generic dimensionality: Data entered should be equivalent to the structure and
operation capabilities.
7) Dynamic sparse matrix handling: The OLAP too should be able to manage thesparse
matrix and so maintain the level ofperformance.
8) Multi-user support: The OLAP should allow several users working concurrently towork
together.
9) Unrestricted cross-dimensional operations: The OLAP tool should be able to recognize
dimensional hierarchies and automatically perform operations across the dimensions of
the cube.
10) Intuitive data manipulation. ―Consolidation path re-orientation, drilling down across
columns or rows, zooming out, and other manipulation inherent in the consolidation
path outlines should be accomplished via direct action upon the cells of the analytical
model, and should neither require the use of a menu nor multiple trips across the user
interface.
11) Flexible reporting: It is the ability of the tool to present the rows and column in a
manner suitable to be analyzed.
12) Unlimited dimensions and aggregation levels: This depends on the kind of Business,
where multiple dimensions and defining hierarchies can bemade. The OLAP system
should not impose any artificial restrictions on the number of dimensions or
aggregation levels.
2.6.1. MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary formats.
That is, data structures uses array-based technology and in most cases, provide improved storage
techniques to minimize the disk space requirements through sparse data management. This
architecture enables excellent performance when the data is utilized as designed, and predictable
application response time for applications addressing a narrow breadth of data for a specific DSS
requirement. In addition, some products treat time as a special dimension( e.g., pilot software‘s
analysis server), enhancing their ability to perform time series analysis. Other products provide
strong analytical capabilities (e.g., Oracles‘s Express Server) built into the database.
Applications requiring iterative and comprehensive time series analysis of trends are well
suited for MOLAP technology (e.g., financial analysis and budgeting). Examples include Arbor
Software‘s Essbase, Oracles‘s Express Server, pilot software‘s Lighship server and Kenan
Technology‘s Multiway.
Advantages:
Excellent performance: MOLAP cubes are built for fast data retrieval, and are
optimal for slicing and dicing operations.
Can perform complex calculations: All calculations have been pre-generated when
the cube is created. Hence, complex calculations are not only doable, but they
return quickly.
Disadvantages:
Limited in the amount of data it can handle: Because all calculations are performed
when the cube is built, it is not possible to include a large amount of data in the
cube itself. This is not to say that the data in the cube cannot be derived from a
large amount of data. Indeed, this is possible. But in this case, only summary-level
information will be included in the cube itself.
Requires additional investment: Cube technology are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology, chances
are additional investments in human and capital resources are needed.
Disadvantages:
Performance can be slow: Because each ROLAP report is essentially a SQL
query(or multiple SQL queries) in the relational database, the query time can be
long if the underlying data size islarge.
Limited by SQL functionalities: Because ROLAP technology mainly relies
ongenerating SQL statements to query the relational database, and SQL statements
do not fit all needs (for example, it is difficult to perform complex calculations
using SQL), ROLAP technologies are therefore traditionally limited by what SQL
can do. ROLAP vendors have mitigated this risk by building into the tool out-of-
the-box complex functions as well as the ability to allow users to define their
ownfunctions.
This style of OLAP, which is beginning to see increased activity, provides users with the
ability to perform limited analysis capability, either directly against RDBMS products, or by
leveraging an intermediate MLOP server( shown in figure 2.6).Some products developed
features to provide ―datacube‖ and ―slice and dice‖ analysis capabilities. This is achieved by first
developing a query to select data from the DBMS, which then delivers the requested data to the
desktop, where it is placed into a datacube. The data cube can be stored and maintained locally
in the desktop to reduce the overhead required to create the structure each time the query is
executed .Once the data is in the datacube, users can perform multidimensional analysis. The
tools can work with MLOP servers, and the data from the relational DBMS can be delivered to
the MLOP server, and from there to the desktop.
The simplicity of the installation and administration of such products makes them
particularly attractive to organizations looking to provide seasoned users with more sophisticated
analysis capabilities, without the significant cost and maintenance of more complex products.
With all the ease of installation and administration that accompanies the desktop OLAP
products, most of these tools require the data cube to be built and maintained on the desktop.
With metadata definitions that assist users in retrieving the correct set of data that makes up the
data
cube. Each user to build a custom data cube, the lack of data consistency among users, and the
relatively small amount of data that can be efficiently maintained are significant.
ExamplesCognos Software‘s PowerPlay, Andyne Software‘s Pablo, Dimensional Insight‘s
CrossTarget, and Speedware‘s Media.
OLAP tools provide way to view the corporate data. The tools aggregate data along
common business subjects or dimensions and then let the users navigate through the hierarchies
and dimensions. Some tools such as Arbor Software Corp‘s Essbase pre-aggregate data in
special multidimensional database. Some other tools work directly against relational data and
aggregate data on the fly. Some tools process OLAP data on the desktop instead of server.
Desktop OLAP tools include Cognos‘ PowerPlay, Brio Technology and Andyne‘s Pablo. Many
of the differences between OLAP tools are fading. Vendors are rearchitecturing their products to
give users greater control over the tradeoff between flexibility and performance tha is inherent in
OLAP tools. Many vendors are rewriting pieces of their products in Java.
Database vendors eventually might be the largest OLAP providers. Leading database
vendors incorporate OLAP functionality in their database kernels. Examples:
CognosPowerPlay, IBI FOCUS FusionPilot Software.
Power Playfro Cognos is a mature and popoular software tool for multidimensional
analysis for corporate data. It can be characterized as an MQE tool that can leverage corporate
investment in the relational database technology to provide multidimensional access to
enterprise data, at the same time proving robustness, scalability, and administrative control. It is
an open OLAP solution that can interoperate with a wide variety of third-party software tools,
databases and applications.
Support for enterprise data sets of 20 million recors, 100000 categories, and 100
measures.
A drill through capability for queries from Cognos Impromptu
Powerful 3-D charting capabilities with background and rotation control for advanced
users
Faster and easier ranking of data
Unlimited undo levels and customizable toolbars.
A ―home‖ button that automatically resets the dimension line to the top level.
Full support for OLE2 Automation, as both a client and a server.
Linked displays that gives users multiple views of the same data in a report.
Complete integration with relational database security and data management features.
FOCUS Fusion is a modular tool that supports flexible configurations for diverse needs, and
includes the following components:
Fusion/Dbserver
Fusion/Administrator
Fusion/PDQ
EDA/Link
EDA/WebLink
EDA Gateways
Enterprise Copy Manager for Fusion
The Internet/WWW and data warehouse are tightly bound together. The reason of this
trend is simple; the compelling advantage in using the web for access are magnified even further
in a data warehouse. Indeed:
The Internet is a virtually free resource which provides a universal connectivity within
and between companies.
The Web allows companies to store and manage both data and applications on server that
can be centrally managed, maintained and updated.
From these and other reasons, the web is a perfect medium for decision support. Lets
look at the general features of the web-enabled data access.
First-generation Web sites – web sites used a static distribution model, in which the
client can access the decision support report through static HTML pages via web browsers. In
this model, the decision support reports were stored as HTML documents and delivered to users
on request. Clearly, this model has some serious deficiencies, including inability to provide web
clients with interactive analytical capabilities such as drill-down.
Second-generation Web sites – web sites support interactive database queries by utilizing
a multitiered architecture in which a web client submits a query in the form of HTML- encoded
request to a web server, which in turn transforms the request for structured data into a and CGI
(HTML gateway) script. The gateway submits the SQL queries to the database, receives the
results, translates them into HTML, and sends the pages to the requester shown in figure 2.7.
Requests for the unstructured data can be sent directly to the unstructured data store.
Figure 2.7: Web processing Model
Third-generation Web sites – web sites replace HTML gateways with web based
application servers. These servers can download Java Applets or ActiveX applications that
execute on clients or Web based application servers. Vendor‘s approaches for deploying tools on
the Web includeHTML publishingHelper applications, Plug-ins, Server-centric components,
Java and ActiveX applications.
HTML publishing
Helper applications
Plug-ins
Server-centric components
Java and ActiveX applications
Essbase is one of the most ambitious of the early Web products. It includes not only
OLAP manipulations such as Drill up, down, and across pivot, slice and dice; and fixed and
dynamic reporting but also data entry including full multiuser concurrent write capabilities- a
feature that differentiates it from others. It doesn't have client package that might suffer from
sales of its web gateway product. It makes sense from a business perspective. The web product
does not replace administrative and development modules, only user access for query and
update.
Micro Strategy‘s flagship product, DSS agent, was originally a windows-only tool, but
Micro strategy has smoothly made the transition, first with an NT-based server product, and now
as one of the first OLAP tools to have a web-access product. DSS agent in concert with the
complement of Micro Strategy‘s product suite- DSS server relational OLAP server, DSS
Architect data modeling tool and DSS Executive design tool for building executive information
system.
Brio Technology
Brio shaped a suite of new products called brio.web.warehouse. This suite implements
several of the approaches listed above for deploying decision support OLAP applications on the
web. The key to Brio‘s strategy is a new server component called brio.query.server. The server
works in conjunction with Brio Enterprise and Brio‘s web clients- brio. quick view and brio,
insight- and can off-load processing from the clients and thus enables users to access Brio
reports via Web browsers. On the client side, Brio uses plug-ins to give users viewing and report
manipulation capabilities.
UNIT-II QUESTION BANK
PART A
PART-B
Categorical data represents characteristics. Therefore it can represent things like a person‘s
gender, language etc. Categorical data can also take on numerical values (Example: 1 for female
and 0 for male). Note that those numbers don‘t have mathematical meaning.
Nominal values represent discrete units and are used to label variables. The Nominal value has
no quantitative value and it is just like labels. The nominal data that has no order and if the order
of its value is changed, the meaning would not change. The two examples of nominal features
below diagram as shown in Figure 3.2.
Figure 3.2 Nominal Features
The left feature that describes a person‘s gender would be called dichotomous, which is a
type of nominal scales that contains only two categories.
Numerical data is data that is measurable, such as time, height, weight, amount, and so on.
The numerical data can be identified by seeing average or order the data in either ascending
or descending order.
The discrete data is one in which the values are distinct and separate. In other words the
discrete can be defined as data that can only take on certain values. This type of data can‘t be
measured but it can be counted. It basically represents information that can be categorized
into a classification. An example is the number of heads in 100 coin flips. We can check by
asking the following two questions whether you are dealing with discrete data or not: Can
you count it and can it be divided up into smaller and smaller parts?
Continuous Data represents measurements and therefore their values can‘t be counted but
they can be measured. An example would be the height of a person, which you can describe
by using intervals on the real number line.
Interval values represent ordered units that have the same difference. Therefore we speak of
interval data when we have a variable that contains numeric values that are ordered and
where we know the exact differences between the values. An example would be a feature that
contains temperature of a given place like you can see below in Figure 3.4.
Ratio values are also ordered units that have the same difference. Ratio values are the same as
interval values, with the difference that they do have an absolute zero. Good examples are
height, weight, length etc as shown in Figure 3.5.
Figure 3.5 Ratio Data
3.3 Data Mining Functionalities
Data Mining is the process of extracting information from huge sets of data. Data mining
functionalities are used to specify the kind of patterns to be found in data mining tasks. In
general, data mining tasks can be classified into two categories:
Descriptive mining tasks characterize the general properties of the data in the database.
While the Descriptive analytics looks at data and analyzes past events for insight as to how to
approach the future. Descriptive analytics looks at past performance and understands that
performance by mining historical data to look for the reasons behind past success or failure.
Almost all management reporting such as sales, marketing, operations, and finance, uses this type
of post- mortem analysis. For example, descriptive analytics examines historical electricity usage
data to help plan power needs and allow electric companies to set optimal prices.
Predictive mining tasks perform inference on the current data in order to make
predictions. Prescriptive analytics automatically synthesizes big data, mathematical
sciences, business rules, and machine learning to make predictions and then suggests decision
options to take advantage of the predictions. Prescriptive analytics goes beyond predicting future
outcomes by also suggesting actions to benefit from the predictions and showing the decision
maker the implications of each decision option. Prescriptive analytics not only anticipates what
will happen and when it will happen, but also why it will happen.
Data can be associated with classes or concepts. For example, in an electronic shop
named All Electronics store, classes of items for sale include computers and printers, and
concepts of customers include big Spenders and budget Spenders. Such descriptions of a class or
a concept are called class/concept descriptions. These descriptions can be derived via
Data characterization
Data discrimination
Both data characterization and discrimination
Data characterization
It is a summarization of the general characteristics or features of a target class of data. For
example, to study the characteristics of software products whose sales increased by 10% in the
last year, the data related to such products can be collected by executing an SQL query. Effective
data summarization and its various forms of outputs.
Example: A data mining system should be able to produce a description summarizing the
characteristics of customers who spend more than $1,000 a year at All Electronics store.
Data discrimination
It is a comparison of the general features of target class data objects with the general features of
objects from one or a set of contrasting classes. The target and contrasting classes can be
specified by the user, and the corresponding data objects retrieved through database queries.
Example: The user may like to compare the general features of software products whose sales
increased by 10% in the last year with those whose sales decreased by at least 30% during the
same period.
A data mining system should be able to compare two groups of AllElectronics customers,
such as those who shop for computer products regularly (more than two times a month) versus
those who rarely shop for such products (i.e., less than three times a year). The resulting
description provides a general comparative profile of the customers, such as 80% of the
customers who frequently purchase computer products are between 20 and 40 years old and have
a university education, whereas 60% of the customers who infrequently buy such products are
either seniors or youths, and have no university degree. Drilling down on a dimension, such as
occupation, or adding new dimensions, such as income level, may help in finding even more
discriminative features between the two classes.
Frequent patterns, as the name suggests, are patterns that occur frequently in data. A
frequent item set typically refers to a set of items that frequently appear together in a
transactional data set, such as milk and bread. Frequent patterns are itemsets, subsequences, or
substructures that appear in a data set with frequency no less than a user-specified threshold. For
example, a set of items, such as milk and bread, that appear frequently together in a transaction
data set, is a frequent itemset.
Confidence is the percentage of transactions in T, containing wine, that also contain Cheese. In
other words, the probability of having Cheese, given that wine is already in the basket. (65% of
all those who bought Wine, also bought Cheese.)
Confidence (A→B) = P (B|A)
3.3.3 Classification
Classification models predict categorical class labels. Classification is the process of finding
a best model that describes and distinguishes data classes or concepts, for the purpose of being
able to use the model to predict the class of objects whose class label is unknown. The class label
is usually the target variable in classification, which makes it special from other categorical
attributes. Derived model is based on the analysis of a set of training data. Some of the
commonly used classification algorithms are listed as follows.
Decision Tree
Neural Network.
Bayesian classification.
If then rules
Data classification is a two-step process, as shown for the loan application data of Figure 3.6 (a).
In the first step, a classifier is built describing a predetermined set of data classes or concepts.
This is the learning step (or training phase), where a classification algorithm builds the classifier
by analyzing or ―learning from‖ a training set made up of database tuples and their associated
class labels. A tuple, X, is represented by an n-dimensional attribute vector, X = (x1, x2,. . . , xn),
depicting n measurements made on the tuple from n database attributes, respectively, A 1, A2,. . ,
An. Each tuple, X, is assumed to belong to a predefined class as determined by another database
attribute called the class label attribute. The class label attribute is discrete-valued and unordered.
It is categorical in that each value serves as a category or class.
In the second step Figure 6.1(b), the model is used for classification. First, the predictive
accuracy of the classifier is estimated. If we were to use the training set to measure the accuracy
of the classifier, this estimate would likely be optimistic, because the classifier tends to overfit
the data (i.e., during learning it may incorporate some particular anomalies of the training data
that are not present in the general data set overall). Therefore, a test set is used, made up of test
tuples and their associated class labels. These tuples are randomly selected from the general data
set. They are independent of the training tuples, meaning that they are not used to construct the
classifier.
3.3.4 Prediction
Prediction is different from classification. Classification refers to predict categorical class label.
Prediction models continuous-valued functions.
Minimal generalization
Attribute relevance analysis
Generalized linear model construction
Prediction
Determine the major factors which influence the prediction. Data relevance analysis:
uncertainty measurement, entropy analysis, expert judgment, etc. Multi-level prediction: drill-
down and roll-up analysis.
Unlike classification and prediction, which analyze class-labeled data objects, clustering
analyzes data objects without consulting a known class label. Finding groups of objects such that
the objects in a group will be like one another. And different from the objects in other groups.
Data Clustering analysis is used in many applications. Such as market research, pattern
recognition, data analysis, and image processing. Data Clustering analysis is used in many
applications. Such as market research, pattern recognition, data analysis, and image processing.
A database may contain data objects that do not comply with the general behavior or
model of the data. These data objects are outliers. Most data mining methods discard outliers as
noise or exceptions. However, in some applications such as fraud detection, the rare events can
be more interesting than the more regularly occurring ones. The analysis of outlier data is
referred to as outlier mining. Outliers may be detected using statistical tests that assume a
distribution or probability model for the data, or using distance measures.
An outlier is a data point that is significantly different (abnormal or irregular) or deviates from
the remaining data as shown below in Figure 3.7.
Figure 3.7 Outlier
Each purple dot represents a data point in a data set. From the graph, the two data points are
considered outliers since they are very far away from the rest of the data points.
Example:
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
extremely large amounts for a given account number in comparison to regular charges incurred
by the same account. Outlier values may also be detected with respect to the location and type of
purchase, or the purchase frequency.
Data evolution analysis describes and models regularities or trends for objects whose
behavior changes over time. Although this may include characterization, discrimination,
association and correlation analysis, classification, prediction, or clustering of time related data,
distinct features of such an analysis include time-series data analysis, sequence or periodicity
pattern matching, and similarity-based data analysis.
Example:
Evolution analysis helps predicting the value of a user-specified goal attribute based on the
values of other attributes. For instance, a banking institution might want to predict whether a
customer's credit would be "good" or "bad" based on their age, income and current savings.
A pattern is interesting if it is easily understood by humans, valid on new or test data with
some degree of certainty, potentially useful and novel.
―Can a data mining system generate all of the interesting patterns?‖ This refers to the
completeness of a data mining algorithm. It is often unrealistic and inefficient for data mining
systems to generate all of the possible patterns. Instead, user-provided constraints and
interestingness measures should be used to focus the search. For some mining tasks, such as
association, this is often sufficient to ensure the completeness of the algorithm.
―Can a data mining system generate only interesting patterns?‖ This is an optimization
problem in data mining. It is highly desirable for data mining systems to generate only
interesting patterns. This would be much more efficient for users and data mining systems,
because neither would have to search through the patterns generated in order to identify the truly
interesting ones. Progress has been made in this direction; however, such optimization remains a
challenging issue in data mining.
Data mining is defined as a process used to extract usable data from a larger set of any raw data.
It implies analyzing data patterns in large batches of data using one or more software. Data
mining has applications in multiple fields, like science and research. Data mining is the process
of discovering patterns in large data sets involving methods at the intersection of machine
learning, statistics, and database systems. Data mining is the analysis step of the "knowledge
discovery in databases" process, or KDD.
4. Data transformation - Data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation
5. Data mining - An essential process where intelligent methods are applied in order to
extract data patterns.
6. Pattern evaluation - To identify the truly interesting patterns representing knowledge based
on some interestingness measures.
Data cleaning
The data we have collected are not clean and may contain errors, missing values, noisy or
inconsistent data. So we need to apply different techniques to get rid of such anomalies.
Data integration
First of all the data are collected and integrated from all the different sources.
Data selection
We may not all the data we have collected in the first step. So in this step we select only
those data which we think useful for data mining.
Data transformation
The data even after cleaning are not ready for mining as we need to transform them into
forms appropriate for mining. The techniques used to accomplish this are smoothing,
aggregation, normalization etc.
Data mining
Now we are ready to apply data mining techniques on the data to discover the interesting
patterns. Techniques like clustering and association analysis are among the many different
techniques used for data mining.
Pattern evaluation
This step involves visualization, transformation, removing redundant patterns etc from the
patterns we generated.
Knowledge presentation
This step helps user to make use of the knowledge acquired to take better decisions.
The architecture of a typical data mining system as shown below in Figure 3.9 may have the
following major components Database, data warehouse, World Wide Web, or other information
repository:
The various components used in a data mining system are given below, and its diagrammatic
representation is shown below in Figure 3.9.
Data sources
Database or data warehouse server
Knowledge base
Data mining engine
Pattern evaluation module
User interface
Figure 3.9 Data Mining System
Data sources
There are so many documents present. That is a database, data warehouse, World Wide Web
(WWW). That is the actual sources of data. Sometimes, data may reside even in plain text
files or spreadsheets. World Wide Web or the Internet is another big source of data.
There different sources of data that are used in data mining process. The data from multiple
sources are integrated into a common source known as Data Warehouse.
Relational Databases
Data Warehouses
Transactional Databases
Advanced Data and Information Systems and
Advanced Applications
A relational database for All Electronics as shown in Figure 3.10. The All Electronics company
is described by the following relation tables: customer, item, employee, and branch. A Relational
database is defined as the collection of data organized in tables with rows and columns. Physical
schema in Relational databases is a schema which defines the structure of tables. Logical schema
in Relational databases is a schema which defines the relationship among tables. Standard API of
relational database is SQL. Application: Data Mining, ROLAP model, etc.
Figure 3.10 Relational Database
3.6.2 Data Warehouse
A data warehouse is defined as the collection of data integrated from multiple sources that
will queries and decision making. There are three types of data warehouse Enterprise data
warehouse, Data Mart and Virtual Warehouse. Two approaches can be used to update
data in Data Warehouse: Query-driven Approach and Update-driven Approach.
Application: Business decision making, Data mining, etc as shown in Figure 3.11.
‗
The aggregate value stored in each cell of the cube is sales amount (in thousands).
For example, the total sales for the first quarter,Q1,for Items relating to security systems in
Vancouver is $400,000, as stored in cellVancouver,Q1,security. Additional cubes may be used
to store aggregate sums over each dimension, corresponding to the aggregate values obtained
using different SQL group-bys. (e.g., the total sales amount per city and quarter, or per city and
item, or per quarter A and item, or per each individual dimension).
Object-Relational Databases
Sequence database
Temporal database
Time-series database
Spatial databases
Text databases
Multimedia databases
A heterogeneous database
Legacy database
Data Streams
The World Wide Web
Advanced Data and Information Systems and Advanced Applications
The object-relational data model inherits the essential concepts of object-oriented databases,
where, in general terms, each entity is considered as an object. Following the All Electronics
example, objects can be individual employees, customers, or items. Data and code relating to
an object are encapsulated into a single unit.
Data Streams many applications involve the generation and analysis of a new kind of data,
called stream data, where data flow in and out of an observation platform (or window)
dynamically. Such data streams have the following unique features: huge or possibly infinite
volume, dynamically changing, flowing in and out in a fixed order, allowing only one or a
small number of scans, and demanding fast (often real-time) response time. Mining data
streams involves the efficient discovery of general patterns and dynamic changes within stream
data.
World Wide Web and its associated distributed information services, such as Yahoo!, Google,
America Online, and AltaVista, provide rich, worldwide, on-line information services, where
data objects are linked together to facilitate interactive access. Capturing user access patterns in
such distributed information environments is called Web usage mining. Automated Web page
clustering and classification help in arrange Web pages in a multidimensional manner based on
their contents. Web community analysis helps identify hidden Web social networks and
communities and observe their evolution.
3.7 Data Mining Task Primitives
A data mining task can be specified in the form of a data mining query, which is input to
the data mining system. A data mining query is defined in terms of data mining task primitives
as shown in Figure 3.14. These primitives allow the user to interactively communicate with the
data mining system during discovery in order to direct the mining process, or examine the
findings from different angles or depths.
3.8 Integration of a Data Mining System with a Database or Data Warehouse System
A good system architecture will facilitate the data mining system to make best use of
the software environment. A data mining system is said to be effective if it can able to do the
following:
A critical question in the design of a data mining system is ―How to integrate or couple the
DM(Data Mining) system with a database system and/or a data warehouse(DW) system?‖
There are four different ways that include no coupling, loose coupling, semitight coupling,
and tight coupling. No coupling means that a DM system will not utilize any function of a DB
or DW system. Loose coupling means that a DM system will use some facilities of a DB or
DW. Semitight coupling means that besides linking a DM system to a DB/DW system efficient
implementations of a few essential data mining primitives can be provided in the DB/DW
system.. Tight coupling means that a DM system is smoothly integrated into the
DB/DW system.
The Figure 3.15 represents an Integration of Data mining system with Data Warehouse in
which Data sources are loaded in to the data Warehouse and then Data Mining is performed.
Figure 3.15 Integration of a Data Mining System with a Database or Data Warehouse System
3.8.1 No coupling
No coupling means that a Data mining system will not utilize any function of a data
base or Data WareHouse system. It may fetch data from a particular source (such as a file
system), process data using some data mining algorithms, and then store the mining results in
another file.
In the architecture, data mining system does not utilize any functionality of a database or data
warehouse system. A no-coupling data mining system retrieves data from a particular data
source such as file system, processes data using major data mining algorithms and stores results
into the file system. The no-coupling data mining architecture does not take any advantages of
database or data warehouse that is already very efficient in organizing, storing, accessing and
retrieving data. The no-coupling architecture is considered a poor architecture for data mining
system, however, it is used for simple data mining processes.
Loose coupling means that a Data Mining system will use some facilities of a Data
Base or Data Warehouse system, fetching data from a data repository managed by these
systems, performing data mining, and then storing the mining results either in a file or in a
designated place in a database or data warehouse.
In the architecture, data mining system uses the database or data warehouse for data
retrieval. In loose coupling data mining architecture, data mining system retrieves data from the
database or data warehouse, processes data using data mining algorithms and stores the result
in those systems. This architecture is mainly for memory-based data mining system that does
not require high scalability and high performance.
Semitight coupling means that besides linking a Data mining to a Data Base/Data
WareHouse system, efficient implementations of a few essential data mining primitives can be
provided in the DB/DW system. These primitives can include sorting, indexing, aggregation,
histogram analysis, multiway join, and precomputation of some essential statistical measures,
such as sum, count, max, min, standard deviation, and so on.
In semi-tight coupling data mining architecture, besides linking to database or data warehouse
system, data mining system uses several features of database or data warehouse systems to
perform some data mining tasks including sorting, indexing, aggregation…etc. In this
architecture, some intermediate result can be stored in database or data warehouse system for
better performance.
Tight coupling means that a Data Mining system is smoothly integrated into the Data
Base/Data Warehouse system. The data mining subsystem is treated as one functional
component of an information system. Data mining queries and functions are optimized based
on mining query analysis, data structures, indexing schemes, and query processing methods of
a Data Base or Data Warehouse system. In tight coupling data mining architecture, database or
data warehouse is treated as an information retrieval component of data mining system using
integration. All the features of database or data warehouse are used to perform data mining
tasks. This architecture provides system scalability, high performance, and integrated
information.
Pattern evaluation
The patterns discovered should be interesting because either they represent common
knowledge or lack novelty.
There are many kinds of data stored in databases and data warehouses. It is not possible
for one system to mine all these kind of data. So different data mining system should be
construed for different kinds data.
Data preprocessing is a data mining technique that involves transforming raw data into
an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven
method of resolving such issues.
1 n
xi
x
x
n i 1 N
w x i i
x i 1
n
w
i 1
i
n / 2 ( f )l
median L1 ( )c
Mode f median
Mode is the value that occurs most frequently in the data. There are three types of mode. They
are Unimodal, bimodal, trimodal. Empirical formula:
N i 1 N i 1
1 n 1 n 2 1 n 2
s
2
( xi x) n 1[i1 xi n (i1 xi ) ]
n 1 i1
2
From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard
deviation)
From μ–2σ to μ+2σ: contains about 95% of it
From μ–3σ to μ+3σ: contains about 99.7% of it
Boxplot Analysis
Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum Boxplot -Data is
represented with a box. The ends of the box are at the first and third quartiles, i.e., the height of
the box is IRQ. The median is marked by a line within the box. Whiskers: two lines outside the
box extend to Minimum and Maximum, as shown in Figure 3.18 and Visualization of Data
Dispersion: Boxplot Analysis( as shown in figure 3.19)
Missing Data
Data is not always available E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data Missing data may be due to equipment
malfunction inconsistent with other recorded data and thus deleted data not entered due to
misunderstanding certain data may not be considered important at the time of entry not register
history or changes of the data Missing data may need to be inferred.
Fill in it automatically with a global constant : e.g., ―unknown‖, a new class?! the attribute
mean the attribute mean for all samples belonging to the same class: smarter the most probable
value: inference-based such as Bayesian formula or decision tree.
Noisy Data
Noise: random error or variance in a measured variable Incorrect attribute values may due
to faulty data collection instruments data entry problems data transmission problems
technology limitation inconsistency in naming convention Other data problems which requires
data cleaning duplicate records incomplete data inconsistent data.
Divides the range into N intervals, each containing approximately same number of
samples
Good data scaling
Managing categorical attributes can be tricky
Cluster Analysis
Data sets are clustered based on the similar characteristics as shown in figure 3.22
Tuple duplication
In Entity Identification Problem there are two main issues to be considered during data
integration are
Schema Integration
Object matching
Entity Identification problem – Matching the real world entities from different data sources.
Meta data plays a main role to overcome the issues during data integration. Special attention
on structure of the data is needed to overcome the problems due to functional dependencies
and referential constraints.
Tuple Duplication
In addition to detecting the redundancies between the attributes, duplication should also be
detected at tuple level.
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to
Data reduction
Obtain a reduced representation of the data set that is much smaller in volume but yet
produce the same (or almost the same) analytical results.
Best single features under the feature independence assumption: choose by significance tests.
Best step-wise feature selection:
The best single-feature is picked first
Then next best feature condition to the first, ...
Step-wise feature elimination:
Repeatedly eliminate the worst feature
Each input data (vector) is a linear combination of the k principal component vectors
Since the components are sorted, the size of the data can be reduced by eliminating the
weak components, i.e., those with low variance. (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data, as shown
in figure 3.24
Limitations
Works for numeric data only
Used when the number of dimensions is large
1. Define data.
2. State why the data preprocessing an important issue for data warehousing and data mining.
3. What is the need for discretization in data mining?.
4. What are the various forms of data preprocessing?
5. What is concept Hierarchy? Give an example.
6. What are the various forms of data preprocessing?
7. Mention the various tasks to be accomplished as part of data pre-processing.
8. Define Data Mining.
9. List out any four data mining tools.
10. What do data mining functionalities include?
11. Define patterns.
12. Define cluster Analysis
13. What is Outlier Analysis?
14. What makes the pattern interesting?
15. Difference between OLAP and Data mining
16. What do you mean by high performance data mining?
17. What are the Various data mining techniques?
18. What do you mean by predictive data mining?
19. What do you mean by descriptive data mining?
20. What are the steps involved in the data mining process?
21. List the methods available to fill the missing values.
22. Locate the outliers in a sample box plot and explain.
23. List out the major research issues in data mining.
24. Noisy data of price attribute in a data set is as follows.
4, 8, 15, 21, 21, 24, 25, 28, 34.
How the noise can be removed from the above data? Give the data values after data
Smoothing is done.
PART-B
1. Explain the various primitives for specifying Data mining Task.
2. Describe the various descriptive statistical measures for data mining.
3. Discuss about different types of data and functionalities.
4. Describe in detail about Interestingness of patterns.
5. Explain in detail about data mining task primitives.
6. Discuss about different Issues of data mining.
7. Explain in detail about data pre-processing.
8. How data mining system are classified? Discuss each classification with an example.
9. How data mining system can be integrated with a data warehouse? Discuss with an example.
10. Explain data mining applications for Biomedical and DNA data analysis
11. Explain data mining applications for financial data analysis.
12. Explain data mining applications for retail industry.
13. Explain data mining applications for Telecommunication industry.
14. Discuss about data integration and data transformation steps in data pre-processing.
15. List and discuss the steps for integrating a data mining system with a data warehouse.
16. Explain the process of measuring the dispersion of data.
UNIT 4
ASSOCIATION RULE MINING AND CLASSIFICATION
Mining Frequent Patterns, Associations and Correlations – Mining Methods – Mining various
Kinds of Association Rules – Correlation Analysis – Constraint Based Association Mining –
Classification and Prediction - Basic concepts - Decision Tree Induction - Bayesian
Classification – Rule Based Classification – Classification by Back propagation – Support
Vector Machines – Associative Classification – Lazy Learners – Other Classification Methods
– prediction.
Frequent pattern: A pattern (a set of items, subsequences, substructures, etc.) that occurs
frequently in a data set is called as Frequent Pattern. For example, a set of items, such as milk
and bread that appear frequently together in a transaction data set is a frequent item set.
Frequent item-set mining is an interesting branch of data mining that focuses on looking at
sequences of actions or events, for example the order in which we get dressed. Shirt first?
Pants first? Socks second item, or second shirt if wintertime?
Basic Concepts
The Market basket analysis ―Which groups or sets of items are customers likely to
purchase on a given trip to the store?‖ For instance, market basket analysis may help us to
design different store layouts. In one strategy, items that are frequently purchased together can
be placed in proximity to further encourage the combined sale of such items. In an alternative
strategy, placing hardware and software at opposite ends of the store may entice customers
who purchase such items to pick up other items along the way. If we think of the universe as
the set of items available at the store, then each item has a Boolean variable representing the
presence or absence of that item. Each basket can then be represented by a Boolean vector of
values assigned to these variables. The Boolean vectors can be analyzed for buying patterns
that reflect items that are frequently associated or purchased together. These patterns can be
represented in the form of association rules as given below.
support(A=>B) = P(A U B)
confidence(A=>B) = P(B|A)
𝐵 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝐴𝑈𝐵 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 _𝑐𝑜𝑢 𝑛𝑡 (𝐴 𝑈 𝐵)
𝑃 = =
𝐴 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴) 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 _𝑐𝑜𝑢𝑛𝑡 (𝐴)
Support
―The support is the percentage of transactions that demonstrate the rule.‖
Example: Database with transactions ( customer_# : item_a1, item_a2, … )
1: 1, 3, 5.
2: 1, 8, 14, 17, 12.
3: 4, 6, 8, 12, 9, 104.
4: 2, 1, 8.
support {8,12} = 2 (,or 50% ~ 2 of 4 customers)
support {1, 5} = 1 (,or 25% ~ 1 of 4 customers )
support {1} = 3 (,or 75% ~ 3 of 4 customers)
Confidence
The confidence is the conditional probability that, given X present in a transition , Y will also
be present.
Confidence measure, by definition:
Confidence(X=>Y) equals support(X,Y) / support(X)
Example
1. Example: Database with transactions ( customer_# : item_a1, item_a2, … )
1: 3, 5, 8.
2: 2, 6, 8.
3: 1, 4, 7, 10.
4: 3, 8, 10.
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
Conf ( {5} => {8} ) ?
supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4,
then conf( {5} => {8} ) = 4/5 = 0.8 or 80%
1: 3, 5, 8.
2: 2, 6, 8.
3: 1, 4, 7, 10.
4: 3, 8, 10.
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
Conf ( {5} => {8} ) ? 80% Done. Conf ( {8} => {5} ) ?
supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4,
then conf( {8} => {5} ) = 4/7 = 0.57 or 57%
4.1.3Association Rules
Every association rule has a support and a confidence. Find all the rules X Y with
minimum support and confidence
support, s, probability that a transaction contains X Y
confidence, c, conditional probability that a transaction having X also contains
Y
support(A=>B) = P(A U B)
confidence(A=>B) = P(B|A)
Above equation shows that the confidence of rule A->B can be easily derived from the
support counts of A and AUB. Once the support counts of A, B, and AUB are found, it is
straightforward to derive the corresponding association rules.
Association rule mining can be viewed as a two-step process:
Find all frequent itemsets. (By Min Support count value)
Generate strong association rules from the frequent item sets.
(By Min Support count and confidence value)
Additional interesting measures can be given by correlation analysis.
|D| = 9 min_sup = 2
In the first iteration of the algorithm, each item is a member of the set of candidate 1-
itemsets, C1 as shown below in Figure 4.5 to 4.8.
Figure 4.5 Candidate set 1
C3 = L2 L2 = {{I1,I2},{I1.I3},{I1,I5},{I2,I3},{I2,I4}{I2,I5}}
{{I1,I2},{I1.I3},{I1,I5},{I2,I3},{I2,I4}{I2,I5}}
= {{I1,I2,I3},{I1,I2,I5},{I1,I3,I5},{I2,I3,I4},{I2,I3,I5},{I2,I4,I5}}
The 2- item subsets of {I1,I2,I3} are {I1,I2} ,{I1,I3} and {I2,I3}. All 2- item subsets of
{I1,I2,I3} are members of L2. Therefore, keep {I1,I2,I3} in C3.
The 2- item subsets of {I1,I2,I5} are {I1,I2} ,{I1,I5} and {I2,I5}. All 2- item subsets of
{I1,I2,I5} are members of L2. Therefore, keep {I1,I2,I5} in C3.
The 2- item subsets of {I1,I3,I5} are {I1,I3} ,{I1,I5} and {I3,I5}. {I3,I5} is not a
member of I2, and so it is not frequent. therefore, remove {I1,I3,I5} from C 3.
The 2- item subsets of {I2,I3,I4} are {I2,I3} ,{I2,I4} and {I3,I4}. {I3,I4} is not a
member of I2, and so it is not frequent. therefore, remove {I1,I3,I5} from C 3.
The 2- item subsets of {I2,I3,I5} are {I2,I3} ,{I2,I5} and {I3,I5}. {I3,I5} is not a
member of I2, and so it is not frequent. therefore, remove {I1,I3,I5} from C 3.
The 2- item subsets of {I2,I4,I5} are {I2,I4} ,{I2,I5} and {I4,I5}. {I4,I5} is not a
member of I2, and so it is not frequent. therefore, remove {I2,I4,I5} from C 3.
Therefore, C3 = {{I1,I2,I3}, {I1,I2,I3},{I1,I2,I5}} after pruning.
PSEUDO-CODE
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
returnkLk;
Apriori can be improved by general ideas, Reduce passes of transaction database scans, Shrink
number of candidates, Facilitate support counting of candidates
Hash-based technique
A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck,
for k> 1.Hash the items into the different buckets of a hash table structure as shown in Figure
4.11, and increase the corresponding bucket counts.
Transaction reduction
A transaction that does not contain any frequent k-itemsets cannot contain any frequent
(k+1)-itemsets.Therefore, such a transaction can be marked or removed from further
consideration
Partitioning
A partitioning technique can be used that requires just two database scans to mine the
frequent itemsets (Figure 4.12). It consists of two phases. In Phase I, the algorithm subdivides
the transactions of D into n nonoverlapping partitions. If the minimum support threshold for
transactions in D is min sup, then the minimum support count for a partition is min sup × the
number of transactions in that partition. For each partition all frequent itemsets within the
partition are found. These are referred to as local frequent itemsets. The procedure employs a
special data structure that, for each itemset, records the TIDs of the transactions containing the
items in the itemset. This allows it to find all of the local frequent k-itemsets, for k =1,2,..., in
just one scan of the database.
4.2.2 Mining Frequent Item sets without Candidate Generation FP Growth Algorithm
The FP-growth algorithm is the process of mining frequent patterns without candidate
generation. Compress a large database into a compact Frequent-Pattern tree (FPtree) structure –
highly condensed, but complete for frequent pattern mining – avoid costly database scans
Figure 4.13 have the list of items that purchased. FP tree is constructed with the below
example.
Figure 4.13 Transaction Database
Mining FP tree
The FP-tree is mined as follows. Start from each frequent length-1 pattern (as an initial
suffix pattern), construct its conditional pattern base (a ―sub database,‖ which consists of
the set of prefix paths in the FP-tree co-occurring with the suffix pattern), then construct
its (conditional) FP-tree, and perform mining recursively on such a tree. The pattern
growth is achieved by the concatenation of the suffix pattern with the frequent patterns
generated from a conditional FP-tree.
Mining of the FP-tree is summarized Figure 4.15 and detailed as follows. We first
consider I5, which is the last item in L, rather than the first. The reason for starting at the
end of the list will become apparent as we explain the FP-tree mining process. I5 occurs
in two branches of the FP-tree of Figure 4.14. (The occurrences of I5 can easily be found
by following its chain of node-links.) The paths formed by these branches are <I2, I1,
I5: 1> and <I2, I1, I3, I5: 1>. Therefore, considering I5 as a suffix, its corresponding two
prefix paths are <I2, I1: 1> and <I2, I1, I3: 1>, which form its conditional pattern base. Its
conditional FP-tree contains only a single path, <I2: 2, I1: 2>; I3 is not included because
its support count of 1 is less than the minimum support count. The single path generates
all the combinations of frequent patterns: {I2, I5: 2}, {I1, I5: 2}, {I2, I1, I5: 2}.
For I4, its two prefix paths form the conditional pattern base, {{I2 I1: 1}, {I2: 1}},
which generates a single-node conditional FP-tree, <I2: 2>, and derives one frequent pattern,
<I2, I1: 2>. Notice that although I5 follows I4 in the first branch, there is no need to include I5
in the analysis here because any frequent pattern involving I5 is analyzed in the examination of
I5. Similar to the above analysis, I3‘s conditional pattern base is {{I2, I1: 2}, {I2: 2}, {I1: 2}}.
Its conditional FP-tree has two branches, {I2: 4, I1: 2} and {I1: 2}, as shown in Figure 4.16,
which generates the set of patterns, {{I2, I3: 4}, {I1, I3: 4}, {I2, I1, I3: 2}}. Finally, I1‘s
conditional pattern base is {{I2: 4}}, whose FP-tree contains only one node, {I2: 4}, which
generates one frequent pattern, {I2, I1: 4}.
The FP-growth method transforms the problem of finding long frequent patterns to
searching for shorter ones recursively and then concatenating the suffix. It uses the least
frequent items as a suffix, offering good selectivity. The method substantially reduces the
search costs. When the database is large, it is sometimes unrealistic to construct a main
memory based FP-tree. An interesting alternative is to first partition the database into a set of
projected databases, and then construct an FP-tree and mine it in each projected database.
Such a process can be recursively applied to any projected database if its FP-tree still
cannot fit in main memory.
A study on the performance of the FP-growth method shows that it is efficient and
scalable for mining both long and short frequent patterns, and is about an order of magnitude
faster than the Apriori algorithm. It is also faster than a Tree-Projection algorithm,
which recursively projects a database into a tree of projected databases.
Divide-and-conquer :Decompose both the mining task and DB according to the frequent
patterns obtained so far lead to focused search of smaller databases.
No candidate generation, no candidate test
Compressed database: FP-tree structure
No repeated scan of entire database
Basic ops: counting local freq items and building sub FP-tree, no pattern search and
matching
Many association rules so generated are still not interesting to the users. This is especially
true when mining at low support thresholds or mining for long patterns. This has been one of
the major bottlenecks for successful application of association rule mining. Whether or not a
rule is interesting can be assessed either subjectively or objectively.
buys(X, "computer games") => buys (X, "videos") [support = 40%, confidence =60%]
From Association Analysis to Correlation Analysis
A => B [support, confidence. correlation].
Measures for correlation analysis are
Calculating lift value
Calculating x2 Value
From Association Analysis to Correlation Analysis Calculating lift value
Lift is a simple correlation measure.
The occurrence of item set A is independent of item set B if ,
P ( A U B) = P(A)P(B)
𝑃(𝐴 𝑈 𝐵
lift(A,B) = 𝑃 𝐴 𝑃(𝐵)
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
The algorithm is called with three parameters: D, attribute list, and Attribute selection
method. (Information gain, gini index, gain ratio). Tree starts as a single node, N, representing
the training tuples in D. If the tuples in D are all of the same class, then node N becomes a leaf
and is labeled with that class. Otherwise, the algorithm calls Attribute selection method to
determine the splitting criterion.
The splitting criterion tells us which attribute to test at node N. Let A be the splitting attribute.
A has v distinct values, {a1, a2, ..,av} based on the training data. A is discrete-valued: In this
case, the outcomes of the test at node N correspond directly to the known values of A.A is
continuous-valued: In this case, the test at node N has two possible outcomes, corresponding to
the conditions A<split point and A > split point, respectively. A is discrete-valued and a binary
tree must be produced: The test at node N is of the form ―A belongs to SA?‖ shown in figure
4.23
Figure 4.23Decision Tree Strategy
The recursive partitioning stops only when any one of the following terminating conditions is
true:
All of the tuples in partition D (represented at node N) belong to the same class
There are no remaining attributes on which the tuples may be further partitioned.
There are no tuples for a given branch, that is, a partition Dj is empty
M- Distinct values
Ci – Distinct classes
𝑚
Info(D) = - 𝑖=1 𝑃𝑖 log 2 𝑃𝑖 |𝐶𝑖 , D|/|D|
Now, suppose we were to partition the tuples in Don some attribute A having v distinct
values, {a1,a2,....,av}, as observed from the training data. If A is discrete- valued, these values
correspond directly to the v outcomes of a test on A. Attribute A can be used to split D into V
partitions or subsets, {D1,D2,...D}, where Dj contains those tuples in D that have outcome a jof
A. These partitions would correspond to the branches grown from node N. Ideally, we would
like this partitioning to produce an exact classification.
𝑣 |𝐷𝑗 |
InfoA(D) = 𝑗 =1 |𝐷| X Info (Dj)
In this example shown in figure 4.24, each attribute is discrete-valued. The class label
attribute, buys computer, has two distinct values (namely, yes, no); Therefore, there are two
distinct classes (that is, m = 2). Let class C1 correspond to yes and class C2 correspond to
no.There are nine tuples of class yes and five tuples of class no. A (root) node N is created for
the tuples in D. To find the splitting criterion for these tuples, we must compute the
information gain of each attribute
9 9 5 5
Info (D) = -14 𝑙𝑜𝑔 2(14 ) − 𝑙𝑜𝑔 2 (14 )
14
= 0.940 bits.
Next, we need to compute the expected information requirement for each attribute.Let‘s
start with the attribute age. We need to look at the distribution of yes and no tuples for each
category of age.
For the age category youth, there are two yes tuples and three no tuples.
For the category middle aged, there are four yes tuples and zero no tuples.
For the category senior, there are three yes tuples and two no tuples
5 2 2 3 3 4 4 4
Info age (D) = 14 𝑋 ( − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 ) + 𝑋(− 𝑙𝑜𝑔2 −
5 5 5 5 14 4 4
0 0 5 3 3 2 2
𝑙𝑜𝑔2 ) + 14 𝑋 ( − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 )
4 4 5 5 5 5
= 0.694 bits
Gain ratio
Consider an attribute that acts as a unique identifier, such as product ID. A split on
product ID would result in a large number of partitions, each one containing just one tuple.
Because each partition is pure, the information required to classify data set D based on
this partitioning would be Infoproduct ID(D) = 0.
𝑣 |𝐷𝑗 | |𝐷𝑗 |
SplitInfoA (D) = - 𝑗 =1 |𝐷| X log2 ( |𝐷| )
This value represents the potential information generated by splitting the training data
set, D, into v partitions, corresponding to the v outcomes of a test on attribute A.
𝐺𝑎𝑖𝑛 (𝐴)
Gain Ratio(A) =
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜 (𝐴)
4.7Bayesian Classification
Bayesian classification is based on Bayes theorem. Bayes‘ theorem describes the
probability of an event, based on prior knowledge of conditions that might be related to the
event.Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. This assumption is called class conditional
independence.
Bayes’ Theorem
Let X be a data tuple , consider as evidence. Let H be a hypothesis such that tuple X belongs to
class C. For classification problems we want to find P(H|X). P(H|X) , P(X|H) – Posterior
Probability. P(H) , P(X) – Prior probability.
𝑃(𝑋|𝐻) 𝑃(𝐻)
P(𝐻 𝑋) = 𝑃(𝑋)
2. Suppose that there are m classes, C1, C2,….Cm. Given a tuple, X, the classifier will predict
that X belongs to the class having the highest posterior probability, conditioned on X.
That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and
only if
Thus we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is called the
maximum posteriori hypothesis.
𝑃(𝑋|𝐶𝑖)𝑃 (𝐶𝑖)
P(Ci|x) = 𝑃(𝑋)
3. P(X) is constant for all classes, only P(Xj|Ci)P(Ci) need be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally likely,
that is, P(C1) = P(C2) = = P(Cm), and we would therefore maximize
P (X|Ci) =
a)If Ak is categorical, then P(Xk|Ci) is the number of tuples of class Ci in D having the value Xk
for Ak, divided by |Ci,D|, the number of tuples of class Ci in D.
(b) A continuous-valued attribute is typically assumed to have a Gaussian distribution with a
mean μ and standard deviation s,
5. In order to predict the class label of X, P(X|Ci) P(Ci) is evaluated for each class Ci. The
classifier predicts that the class label of tuple X is the class Ci if and only if
The data tuples are described by the attributes age, income, student, and credit rating. The class
label attribute, buys computer, has two distinct values (namely,{yes, no}). LetC 1 correspond to
the class buys computer = yes and C2 correspond to buys computer = no. The tuple we wish to
classify is
=0.222×0.444×0.667×0.667=0.044.
Similarly,
Therefore, the naïve Bayesian classifier predicts buys computer = yes for tuple X.
4.8Rule Based Classification
Using IF-THEN Rules for Classification. Learned model is represented as a set of IF-
THEN rules. IF condition THEN conclusion. An example is rule R1 as given below.
R1: IF age = youth AND student = yes THEN buys computer = yes.
The ―IF‖-part (or left-hand side)of a rule is known as the rule antecedent or
precondition.The ―THEN‖-part (or right-hand side) is the rule consequent.
A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a
classlabeleddataset,D,
𝑛 𝑐𝑜𝑣𝑒𝑟𝑠
Coverage(R) = |𝐷|
𝑛 𝑐𝑜𝑟𝑟𝑒𝑐𝑡
Accuracy(R) = 𝑛 𝑐𝑜𝑣𝑒𝑟𝑠
Where ncovers be the number of tuples covered by R; ncorrect be the number of tuples correctly
classified by R; |D| be the number of tuples in D.
If a rule is satisfied by X, the rule is said to be triggered. For example, suppose we have
X is classified as buys computer. X satisfies R1, which triggers the rule. If R1 is the only
rule satisfied, then the rule fires by returning the class prediction for X. If more than one rule is
triggered, we have problem as What if they each specify a different class? , What if no rule is
satisfied by X? If more than one rule is triggered, we need a conflict resolution strategy to
figure out which rule gets to fire and assign its class prediction to X.
Size ordering
Rule ordering
Class based ordering
Rule-based ordering
To extract rules from a decision tree, one rule is created for each path from the root to a
leaf node. Each splitting criterion along a given path is logically ANDed to form the rule
antecedent (―IF‖ part). The leaf node holds the class prediction, forming the rule consequent
(―THEN‖ part) shown in figure 4.29
Information gain known as the expected information needed to classify a tuple in data
set, D.Here, D is the set of tuples covered by condition` and pi is the probability of classCi
in D. The lower the entropy, the better condition` is. Entropy prefers conditions that cover a
large number of tuples of a single class and few tuples of other classes.
4.9Classification by Back propagation
Back propagation is a neural network learning algorithm. Neural network is a set of
connected input/output unit in which each connection has a weight associated with it. During
the learning phase, the network learns by adjusting the weights so as to be able to predict the
correct class label of the input tuples. Neural network has poor interpretability.
Network Topology
Before training can begin, the user must decide on the network topology by specifying
the number of units in the input layer, the number of hidden layers (if more than one), the
number of units in each hidden layer, and the number of units in the output layer.
To compute the error of a hidden layer unit j, the weighted sum of the errors of the units
connected to unit j in the next layer is considered. The error of a hidden layer unit j is
wherewjkis the weight of the connection from unit j to a unit k in the next higher layer,
and Errkis the error of unit k.The weights and biases are updated to reflect the propagated
errors. Weights are updated by the following equations, where ∆wi j is the change in weight wi
j
4.10 Support Vector Machines
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a
separating hyper plane. In other words, given labeled training data (supervised learning), the
algorithm outputs an optimal hyperplane which categorizes new examples. Support vector
machines are supervised learning models with associated learning algorithms that analyze data
used for classification and regression analysis. Given a set of training examples, each marked
for belonging to one of two categories, an SVM training algorithm builds a model that assigns
new examples into one category or the other. An SVM model is a representation of the
examples as points in space, mapped so that the examples of the separate categories are
divided by a clear gap that is as wide as possible represented in figure 4.33.
A line is bad if it passes too close to the points because it will be noise sensitive and it
will not generalize correctly. Therefore, our goal should be to find the line passing as far as
possible from all points. Then, the operation of the SVM algorithm is based on finding the
hyperplane that gives the largest minimum distance to the training examples. Twice, this
distance receives the important name of margin within SVM‘s theory. Therefore, the optimal
separating hyperplane maximizes the margin of the training data as mentioned in figure 4.34
Figure 4.34: finding maximum margin
W ·X +b = 0;
W is the weight vector {w1,.wn} b is a scalar bias value. X = x1 and x2 are the values
of the attributes A1 and A2
w0 +w1x1 +w2x2 = 0
w0 +w1x1 +w2x2>0
Let P is an attribute value pair of the form (Ai, v) , where Ai is an attribute taking the value v.
A data tuple X = (x1, x2, … ,xn) satisfies an item, p = (Ai, v), if and only if xi = v, where xi is
the value of the ith attribute of X. Association rules can have any number of antecedent rule
consequent. Association rules for classification should be of the form
For a given rule, R, the percentage of tuples in D satisfying the rule antecedent that also have
the class label C is called the confidence of R. Methods of associative classification differ
primarily in the approach used for frequent itemset mining and in how the derived rules are
analyzed and used for classification.
Three methods that are in associative classification are:
CBA - Classification-Based Association.
CMAR – Classification based on multiple association rule
CPAR
CBA
CBA uses iterative approach similar to apriori. Multiple passes are made over the data
and number of passes made is equal to the length of the longest rule. Complete set of rules
satisfying minimum confidence and support thresholds are found and then analyzed for
inclusion in the classifier.CBA construct the classifier (Decision list)., where the rules are
organized according to decreasing precedence based on their confidence and support.
CMAR
CMAR employs another tree structure to store and retrieve rules efficiently and to
prune rules based on confidence, correlation and data base coverage. Rule pruning strategies
are triggered whenever a rule is inserted into the tree. For example, given two rules, R1 and
R2, if the antecedent of R1 is more general than that of R2 and conf(R1) ≥
conf(R2),thenR2ispruned.Therationaleisthathighlyspecializedruleswithlowconfidencecanbeprun
edifamoregeneralizedversionwithhigherconfidenceexists.CMAR also prunes rules for which
the rule antecedent and class are not positively correlated, based on aχ2 test of statistical
significance.
CPAR
CPAR employs a different multiple rule strategy than CMAR. If more than one rule
satisfies a new tuple, X, the rules are divided into groups according to class, similar to CMAR.
However, CPAR uses the best k rules of each group to predict the class label of X, based on
expected accuracy. By considering the best k rules rather than all of the rules of a group, it
avoids the influence of lower ranked rules. The accuracy of CPAR on numerous data sets was
shown to be close to that of CMAR. However, since CPAR generates far fewer rules than
CMAR, it shows much better efficiency with large sets of training data.
4.12 Lazy Learners
Eager Learners – will construct a classification model before receiving new tuples to
classify. Lazy Learners - simply store the training tuple and waits until it is given a test tuple.
Unlike eager learning methods, lazy learners do less work when a training tuple is presented
and more work when making a classification or prediction. When making a classification or
prediction, lazy learners can be computationally expensive k-nearest neighbor, classifiers and
case-based reasoning classifiers.
𝑛
dist(X1,X2) = 𝑖=1(𝑥1𝑖 − 𝑥2𝑖 )2
Min-max normalization, for example, can be used to transform a value v of a numeric attribute
A to v` in the range [0, 1] by computing
𝑣−𝑚𝑖𝑛 𝐴
V‘ = 𝑚𝑎𝑥
𝐴 − 𝑚𝑖𝑛 𝐴
4.13 Prediction
Numeric prediction is the task of predicting continuous (or ordered) values for given
input. It is better to predict a continuous value, rather than a categorical label. Regression
analysis can be used to model the relationship between one or more independent or predictor
variables and a dependent or response variable. Regression analysis can be used to model the
relationship between one or more independent or predictor variables and a dependent or
response variable.
4.13.1Linear Regression
Straight-line regression analysis involves a response variable, y, and a single predictor
variable, x. It is the simplest form of regression, and models y as a linear function of x.
y = b+wx, where b,w are regression coeffecients.
The above expression can be rewritten by using weight values as mentioned below.
y = w 0 + w1x
The training set contains |D| data points of the form (x1, y1), (x2, y2), . . . , (x|D|, y|D|)
The training set contains |D| data points of the form (x1, y1), (x2, y2), . . . , (x|D|, y|D|)
Example
Straight-line regression using the method of least squares Figure 4.36 shows a set of paired
data where x is the number of years of work experience of a college graduate and y is the
salary data.
Figure 4.36 Training Dataset
The training set contains |D| data points of the form (x1, y1), (x2, y2), . . . , (x|D|, y|D|)
y = 23.6 + 3.5x.
y = w0 + w1x1 + w2x2,
y = w0 + w1x1 + w2x2
We can model data that does not show a linear dependence by nonlinear regression. For
example, what if a given response variable and predictor variable have a relationship that may
be modeled by a polynomial function? Polynomial regression is often of interest when there is
just one predictor variable. It can be modeled by adding polynomial terms to the basic linear
model.
By applying transformations to the variables, we can convert the nonlinear model into a
linear one that can then be solved by the method of least squares.
y = w0 +w1x +w2x2+w3x3
x1= x x2= x2x3 = x3
y = w0+ w1x+ w2x2+ w3x3
2 Bread, Jam
3 Bread
4 Bread , Jam
5 Bread, Milk
7 Bread, Jam
PART-B
1. Decision tree induction is a popular classification method. Taking one typical decision tree
induction algorithm , briefly outline the method of decision tree classification.
2. Consider the following training dataset and the original decision tree induction algorithm (ID3).
Risk is the class label attribute. The Height values have been already discredited into disjoint
ranges. Calculate the information gain if Gender is chosen as the test attribute. Calculate the
information gain if Height is chosen as the test attribute. Draw the final decision tree (without
any pruning) for the training dataset. Generate all the “IF-THEN rules from the decision tree.
Gender Height Risk
F (1.5, 1.6) Low
M (1.9, 2.0) High
F (1.8, 1.9) Medium
F (1.8, 1.9) Medium
F (1.6, 1.7) Low
M (1.8, 1.9) Medium
F (1.5, 1.6) Low
M (1.6, 1.7) Low
M (2.0, 8) High
M (2.0, 8) High
F (1.7, 1.8) Medium
M (1.9, 2.0) Medium
F (1.8, 1.9) Medium
F (1.7, 1.8) Medium
F (1.7, 1.8) Medium
(ii) Find all the association rules that involve only B, C.H (in either left or right hand side
of the
rule). The minimum confidence is 70%.
3. Describe the multi-dimensional association rule, giving a suitable example.
4. Explain the algorithm for constructing a decision tree from training samples
5. Explain Bayes theorem.
6. Develop an algorithm for classification using Bayesian classification. Illustrate
the algorithm with a relevant example.
7. Discuss the approaches for mining multi-level association rules from the
transactional databases. Give relevant example.
8. Write and explain the algorithm for mining frequent item sets without candidate
generation. Give relevant example.
9. How attribute is oriented induction implemented? Explain in detail.
10. Discuss in detail about Bayesian classification.
11. Write and explain the algorithm for mining frequent item sets without
candidate generation with an example.
12. A database given below has nine transactions. Let min_sup = 30 %
TID List of Items IDs
1 a, b, e
2 b,d
3 b, c
4 a, b, d
5 a, c
6 b, c
7 a, c
8 a, b, c ,e
9 a, b, c
Apply apriori algorithm to find all frequent item sets.
13. With an example explain various attribute selection measures in classification.
14. Discuss the steps involved in the working of following classifiers.
(i) Bayesian classifier.
(ii) Back Propagation algorithm.
15. Apply the Apriori algorithm for discovering frequent item sets to the following dataset.
Use 0.3 for the minimum support value. Illustrate each step of Apriori algorithm.
16. Construct a decision tree classifier by applying ID3 algorithm to the following dataset.
17. Predict a class label for X using naïve Bayesian classification algorithm.
X = { Color = Red, Type = SUV, Origin = Domestic }
Use the following training data set.
UNIT V
5. CLUSTER ANALYSIS
5.1 Introduction
Cluster is said to be a collection of data objects. Data objects in a cluster will be similar to
one another within the same cluster and dissimilar to the objects in other clusters. Cluster
analysis is a process of finding similarities between data according to the characteristics found
in the data and grouping similar data objects into clusters.
A cluster of data objects can be treated collectively as one group and so may be considered
as a form of data compression. Although classification is an effective means for distinguishing
groups or classes of objects, it requires the often costly collection and labeling of a large set of
training tuples or patterns, which the classifier uses to model each group. It is often more
desirable to proceed in the reverse direction: First partition the set of data into groups based on
data similarity (e.g., using clustering), and then assign labels to the relatively small number of
groups. Additional advantages of such a clustering-based process are that it is adaptable to
changes and helps single out useful features that distinguish different groups. By automated
clustering, we can identify dense and sparse regions in object space and, therefore, discover
overall distribution patterns and interesting correlations among data attributes. Clustering can
also be used for outlier detection, where outliers (values that are ―far away‖ from any cluster)
may be more interesting than common cases.
Clustering can also be used for outlier detection, where outliers (values that are ―far away‖
from any cluster) may be more interesting than common cases. Applications of outlier
detection include the detection of credit card fraud and the monitoring of criminal activities in
electronic commerce. For example, exceptional cases in credit card transactions, such as very
expensive and frequent purchases, may be of interest as possible fraudulent activity
5.1.1 Applications of Clustering
The process of clustering has various applications as listed below:
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access patterns.
5.1.2 Examples of Clustering Applications
Few examples of clustering are listed below:
Marketing: Help marketers discover distinct groups in their customer bases, and then
use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation database
Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
City-planning: Identifying groups of houses according to their house type, value, and
geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults.
There are different types of data that often occur in cluster analysis those data's need to be
preprocessed for cluster analysis.
Clustering algorithms typically operate on either of the following two data structures.
Data matrix
This represents n objects, such as persons, with p variables (also called measurements or
attributes), such as age, height, weight, gender, and so on. The structure is in the form of a
relational table, or n-by-p matrix (n objects ×p variables). This represents n objects, such as
persons, with p variables (also called measurements or attributes), such as age, height, weight,
gender, and so on.
Dissimilarity matrix
This stores a collection of proximities that are available for all pairs of n objects. It is often
represented by an n-by-n table:
where d(i, j) is the measured difference or dissimilarity between objects i and j. In general, d(i,
j) is a nonnegative number that is close to 0 when objects i and j are highly similar or ―near‖
each other, and becomes larger the more they differ.
The mean absolute deviation, Sf , is more robust to outliers than the standard deviation,
σf. When computing the mean absolute deviation, the deviations from the mean
are not squared; hence, the effect of outliers is somewhat reduced. There are more robust
measures of dispersion, such as the median absolute deviation.
After standardization, or without standardization in certain applications, the
dissimilarity (or similarity) between the objects described by interval-scaled variables is
typically computed based on the distance between each pair of objects. The most popular
distance measure is Euclidean distance, which is defined as
Another well-known metric is Manhattan (or city block) distance, defined as
Both the Euclidean distance and Manhattan distance satisfy the following mathematic
requirements of a distance function:
1. d(i, j) ≥ 0: Distance is a nonnegative number.
2. d(i, i) = 0: The distance of an object to itself is 0.
3. d(i, j) = d( j, i): Distance is a symmetric function.
4. d(i, j) ≤ d(i, h)+d(h, j): Going directly from object i to object j in space is no more than
making a detour over any other object h (triangular inequality).
A binary variable is asymmetric if the outcomes of the states are not equally important, such as
the positive and negative outcomes of a disease test. Given two asymmetric binary variables,
the agreement of two 1s (a positive match) is then considered more significant than that of two
0s (a negative match). Therefore, such binary variables are often considered ―monary‖ (as if
having one state). The dissimilarity based on such variables is called asymmetric binary
dissimilarity, where the number of negative matches, t, is considered unimportant and thus is
ignored in the computation, as shown in Equation (7.10).
where m is the number of matches (i.e., the number of variables for which i and j are in the
same state), and p is the total number of variables. Weights can be assigned to increase the
effect of m or to assign greater weight to the matches in variables having a larger number of
states.
where A and B are positive constants, and t typically represents time. Common examples
include the growth of a bacteria population or the decay of a radioactive element.
There are three methods to handle ratio scaled variables:
Treat ratio-scaled variables like interval-scaled variables.
Apply logarithmic transformation to a ratio-scaled variable f having value xif for object
i by using the formula yif = log(xif ). yif can be treated as interval valued variable.
Treat xif as continuous ordinal data and treat their ranks as interval-valued.
where xt is a transposition of vector x, ||x|| is the Euclidean norm of vector x,1 ||y|| is the
Euclidean norm of vector y, and s is essentially the cosine of the angle between vectors x and
y. This value is invariant to rotation and dilation, but it is not invariant to translation and
general linear transformation.
where E is the sum of the square error for all objects in the data set; p is the point in space
representing a given object; and mi is the mean of cluster Ci (both p and mi are
multidimensional). In other words, for each object in each cluster, the distance from the object
to its cluster center is squared, and the distances are summed. This criterion tries to make the
resulting k clusters as compact and as separate as possible. The k-means procedure is
summarized as follows.
Algorithm:
k-means: The k-means algorithm for partitioning, where each cluster‘s center is represented
by the mean value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
1) arbitrarily choose k objects from D as the initial cluster centers;
2) repeat
3) (re)assign each object to the cluster to which the object is the most similar, based on the
mean value of the objects in the cluster;
4) update the cluster means, i.e., calculate the mean value of the objects for each cluster;
5) until no change;
where E is the sum of the absolute error for all objects in the data set; p is the point in space
representing a given object in cluster Cj; and oj is the representative object of Cj.
The initial representative objects (or seeds) are chosen arbitrarily. The iterative
process of replacing representative objects by non representative objects continues as long as
the quality of the resulting clustering is improved. This quality is estimated using a cost
function that measures the average dissimilarity between an object and the representative
object of its cluster. To determine whether a non representative object, Orandom, is a good
replacement for a current representative object, o j, the following four cases are examined for
each of the non representative objects, p, as illustrated in Figure 5.2.
Figure 5.2 Four cases of the cost function for k-medoids clustering.
Agglomerative hierarchical clustering: This bottom-up strategy starts by placing each object
in its own cluster and then merges these atomic clusters into larger and larger clusters, until all
of the objects are in a single cluster or until certain termination conditions are satisfied. Most
hierarchical clustering methods belong to this category. They differ only in their definition of
intercluster similarity.
Divisive hierarchical clustering: This top-down strategy does the reverse of agglomerative
hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into
smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies
certain termination conditions, such as a desired number of clusters is obtained or the diameter
of each cluster is within a certain threshold.
Figure 5.3 Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}
In DIANA, all of the objects are used to form one initial cluster. The cluster is split
according to some principle, such as the maximum Euclidean distance between the closest
neighboring objects in the cluster. The cluster splitting process repeats until, eventually, each
new cluster contains only a single object. In either agglomerative or divisive hierarchical
clustering, the user can specify the desired number of clusters as a termination condition.
A tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering. It shows how objects are grouped together step by step. Figure 5.4
shows a dendrogram for the five objects presented in Figure 5.3, where l = 0 shows the five
objects as singleton clusters at level 0. At l = 1, objects a and b are grouped together to form
the first cluster, and they stay together at all subsequent levels. We can also use a vertical axis
to show the similarity scale between clusters. For example, when the similarity of two groups
of objects, {a, b} and {c, d, e} is roughly 0.16, they are merged together to form a single
cluster.
Figure 5.4 Dendrogram
5.5.3 Distance Measures between clusters
Four widely used measures for distance between clusters are as follows, where |p-p'|
is the distance between two objects or points, p and p'; mi is the mean for cluster, Ci; and ni is
the number of objects in Ci.
When an algorithm uses the minimum distance, d min(Ci, Cj), to measure the distance
between clusters, it is sometimes called a nearest-neighbor clustering algorithm. Moreover, if
the clustering process is terminated when the distance between nearest clusters exceeds an
arbitrary threshold, it is called a single-linkage algorithm.
When an algorithm uses the maximum distance, dmax(Ci, Cj), to measure the distance
between clusters, it is sometimes called a farthest-neighbor clustering algorithm. If the
clustering process is terminated when the maximum distance between nearest clusters exceeds
an arbitrary threshold, it is called a complete-linkage algorithm.
The use of mean or average distance is a compromise between the minimum and
maximum distances and overcomes the outlier sensitivity problem. Whereas the mean distance
is the simplest to compute, the average distance is advantageous in that it can handle categoric
as well as numeric data. The computation of the mean vector for categoric data can be difficult
or impossible to define.
5.6.1 DBSCAN
DBSCAN is a density based clustering algorithm. The algorithm grows regions with
sufficiently high density into clusters and discovers clusters of arbitrary shape in spatial
databases with noise. It defines a cluster as a maximal set of density-connected points. The
basic ideas in working of density-based clustering are as follows.
Let x and y be objects or points in Fd, a d-dimensional input space. The influence
function of data object y on x is a function, , which is defined in terms of a
basic influence function fB:
This reflects the impact of y on x. In principle, the influence function can be an
arbitrary function that can be determined by the distance between two objects in a
neighborhood.
The distance function, d(x, y), should be reflexive and symmetric, such as the Euclidean
distance function. It can be used to compute a square wave influence function,
WaveCluster Working:
A wavelet transform is a signal processing technique that decomposes a signal into
different frequency subbands. The wavelet model can be applied to d-dimensional signals by
applying a one-dimensional wavelet transform d times. In applying a wavelet transform, data
are transformed so as to preserve the relative distance between objects at different levels of
resolution. This allows the natural clusters in the data to become more distinguishable. Clusters
can then be identified by searching for dense regions in the new domain.
Figure 5.9 Multiresolution of the feature space in Figure 7.16 at (a) scale 1 (high resolution);
(b) scale 2 (medium resolution); and (c) scale 3 (low resolution).
Wavelet-based clustering is very fast, with a computational complexity of O(n), where n is the
number of objects in the database. The algorithm implementation can be made parallel.
WaveCluster is a grid-based and density-based algorithm. It conforms with many of the
requirements of a good clustering algorithm: It handles large data sets efficiently, discovers
clusters with arbitrary shape, successfully handles outliers, is insensitive to the order of input,
and does not require the specification of input parameters such as the number of clusters or a
neighborhood radius.
Fig 5.10 Each cluster can be represented by a probability distribution, centered at a mean, and
with a standard deviation.
1. Make an initial guess of the parameter vector: This involves randomly selecting k
objects to represent the cluster means or centers (as in k-means partitioning), as well as
making guesses for the additional parameters.
2. Iteratively refine the parameters (or clusters) based on the following two steps:
(a) Expectation Step: Assign each object xi to cluster Ck with the probability
This step is the ―maximization‖ of the likelihood of the distributions given the data.
The EM algorithm is simple and easy to implement. In practice, it converges fast but may not
reach the global optima. Convergence is guaranteed for certain forms of optimization
functions. The computational complexity is linear in d (the number of input features), n (the
number of objects), and t (the number of iterations).
5.8.2 Conceptual Clustering
Conceptual clustering is a form of clustering in machine learning that, given a set of
unlabeled objects, produces a classification scheme over the objects. Unlike conventional
clustering, which primarily identifies groups of like objects, conceptual clustering goes one
step further by also finding characteristic descriptions for each group, where each group
represents a concept or class. Hence, conceptual clustering is a two-step process: clustering is
performed first, followed by characterization. Here, clustering quality is not solely a function
of the individual objects. Rather, it incorporates factors such as the generality and simplicity of
the derived concept descriptions. Most methods of conceptual clustering adopt a statistical
approach that uses probability measurements in determining the concepts or clusters.
Probabilistic descriptions
are typically used to represent each derived concept. COBWEB is a popular and simple method
of incremental conceptual clustering. Its input objects are described by categorical attribute-
value pairs. COBWEB creates a hierarchical clustering in the form of a classification tree.
Classification Tree and Decision Tree:
Following figure 5.11 shows a classification tree for a set of animal data.
A classification tree differs from a decision tree. Each node in a classification tree
refers to a concept and contains a probabilistic description of that concept, which summarizes
the objects classified under the node. The probabilistic description includes the probability of
the concept and conditional probabilities of the form P(Ai = vij | Ck), where Ai = vij is an
attribute-value pair (that is, the ith attribute takes its jth possible value) and Ck is the concept
class. (Counts are accumulated and stored at each node for computation of the probabilities.)
This is unlike decision trees, which label branches rather than nodes and use logical rather than
probabilistic descriptors.3 The sibling nodes at a given level of a classification tree are said to
form a partition. To classify an object using a classification tree, a partial matching function is
employed to descend the tree along a path of ―best‖ matching nodes.
COBWEB uses a heuristic evaluation measure called category utility to guide construction of
the tree. Category utility (CU) is defined as
where n is the number of nodes, concepts, or ―categories‖ forming a partition, {C1, C2,. . .,
Cn}, at the given level of the tree. Category utility rewards intra class similarity and interclass
dissimilarity, where:
Intraclass similarity is the probability P(Ai = vij | Ck), The larger this value is, the greater the
proportion of class members that share this attribute-value pair and the more predictable the
pair is of class members.
Interclass dissimilarity is the probability P(Ck |Ai = vij), The larger this value is, the fewer the
objects in contrasting classes that share this attribute-value pair and the more predictive the
pair is of the class.
COBWEB Working:
COBWEB incrementally incorporates objects into a classification tree.
―Given a new object, how does COBWEB decide where to incorporate it into the classification
tree?‖ COBWEB descends the tree along an appropriate path, updating counts along the way,
in search of the ―best host‖ or node at which to classify the object. This decision is based on
temporarily placing the object in each node and computing the category utility of the resulting
partition. The placement that results in the highest category utility should be a good host for the
object.
COBWEB computes the category utility of the partition that would result if a new
node were to be created for the object. This is compared to the above computation based on the
existing nodes. The object is then placed in an existing class, or a new class is created for it,
based on the partition with the highest category utility value. Notice that COBWEB has the
ability to automatically adjust the number of classes in a partition. It does not need to rely on
the user to provide such an input parameter.
The two operators mentioned above are highly sensitive to the input order of the
object. COBWEB has two additional operators that help make it less sensitive to input order.
These are merging and splitting. When an object is incorporated, the two best hosts are
considered for merging into a single class. Furthermore, COBWEB considers splitting the
children of the best host among the existing categories. These decisions are based on category
utility. The merging and splitting operators allow COBWEB to perform a bidirectional
search—for example, a merge can undo a previous split.
Limitations of COBWEB
First, it is based on the assumption that probability distributions on separate
attributes are statistically independent of one another. This assumption is, however, not always
true because correlation between attributes often exists. Moreover, the probability distribution
representation of clusters makes it quite expensive to update and store the clusters. This is
especially so when the attributes have a large number of values because the time and space
complexities depend not only on the number of attributes, but also on the number of values for
each attribute. Furthermore, the classification tree is not height-balanced for skewed input data,
which may cause the time and space complexity to degrade dramatically.
Subspace clustering is an extension to attribute subset selection that has shown its
strength at high-dimensional clustering. It is based on the observation that different subspaces
may contain different, meaningful clusters. Subspace clustering searches for groups of clusters
within different subspaces of the same data set. The problem becomes how to find such
subspace clusters effectively and efficiently.
In the first step, CLIQUE partitions the d-dimensional data space into non
overlapping rectangular units, identifying the dense units among these. This is done (in 1-D)
for each dimension. For example, Figure 5.12 shows dense rectangular units found with respect
to age for the dimensions salary and (number of weeks of) vacation. The subspaces
representing these dense units are intersected to form a candidate search space in which dense
units of higher dimensionality may exist.
Figure 5.12 Dense units found with respect to age for the dimensions salary and vacation are
intersected in order to provide a candidate search space for dense units of higher
dimensionality.
CLIQUE confine its search for dense units of higher dimensionality to the
intersection of the dense units in the subspaces because the identification of the candidate
search space is based on the Apriori property used in association rule mining. In general, the
property employs prior knowledge of items in the search space so that portions of the space can
be pruned. The property, adapted for CLIQUE, states the following: If a k-dimensional unit is
dense, then so are its projections in (k−1)-dimensional space. That is, given a k-dimensional
candidate dense unit, if we check its (k−1)-th projection units and find any that are not dense,
then we know that the kth dimensional unit cannot be dense either. Therefore, we can generate
potential or candidate dense units in k-dimensional space from the dense units found in (k −1)-
dimensional space. In general, the resulting space searched is much smaller than the original
space. The dense units are then
examined in order to determine the clusters.
In the second step, CLIQUE generates a minimal description for each cluster as
follows. For each cluster, it determines the maximal region that covers the cluster of connected
dense units. It then determines a minimal cover (logic description) for each cluster. CLIQUE
automatically finds subspaces of the highest dimensionality such that high-density clusters
exist in those subspaces. It is insensitive to the order of input objects and does not presume any
canonical data distribution. It scales linearly with the size of input and has good scalability as
the number of dimensions in the data is increased. However, obtaining meaningful clustering
results is dependent on proper tuning of the grid size (which is a stable structure here) and the
density threshold. This is particularly difficult because the grid size and density threshold are
used across all combinations of dimensions in the data set. Thus, the accuracy of the clustering
results may be degraded at the expense of the simplicity of the method.
5.9.2 PROCLUS: A Dimension-Reduction Subspace Clustering Method
PROCLUS (PROjected CLUStering) is a typical dimension-reduction subspace
clustering method. That is, instead of starting from single-dimensional spaces, it starts by
finding an initial approximation of the clusters in the high-dimensional attribute space. Each
dimension is then assigned a weight for each cluster, and the updated weights are used in the
next iteration to regenerate the clusters. This leads to the exploration of dense regions in all
subspaces of some desired dimensionality and avoids the generation of a large number of
overlapped clusters in projected dimensions of lower dimensionality.
PROCLUS finds the best set of medoids by a hill-climbing process similar to that
used in CLARANS, but generalized to deal with projected clustering. It adopts a distance
measure called Manhattan segmental distance, which is the Manhattan distance on a set of
relevant dimensions. The PROCLUS algorithm consists of three phases: initialization,
iteration, and cluster refinement. In the initialization phase, it uses a greedy algorithm to select
a set of initial medoids that are far apart from each other so as to ensure that each cluster is
represented by at least one object in the selected set. More concretely, it first chooses a random
sample of data points proportional to the number of clusters we wish to generate, and then
applies the greedy algorithm to obtain an even smaller final subset for the next phase. The
iteration phase selects a random set of k medoids from this reduced set (of medoids), and
replaces ―bad‖ medoids with randomly chosen new medoids if the clustering is improved. For
each medoid, a set of dimensions is chosen whose average distances are small compared to
statistical expectation. The total number of dimensions associated to medoids must be k×l,
where l is an input parameter that selects the average dimensionality of cluster subspaces. The
refinement phase computes new dimensions for each medoid based on the clusters found,
reassigns points to medoids, and removes outliers.
Experiments on PROCLUS show that the method is efficient and scalable at finding
high-dimensional clusters. Unlike CLIQUE, which outputs many overlapped clusters,
PROCLUS finds nonoverlapped partitions of points. The discovered clusters may help better
understand the high-dimensional data and facilitate other subsequence analyses.
The following section explains few methods of how efficient constraint-based clustering
methods can be developed for large data sets.
The alternative distribution is very important in determining the power of the test, that is, the
probability that the working hypothesis is rejected when oi is really an outlier. There are
different kinds of alternative distributions.
Inherent alternative distribution: In this case, the working hypothesis that all of the objects
come from distribution F is rejected in favor of the alternative hypothesis that all of the objects
arise from another distribution, G:
F and G may be different distributions or differ only in parameters of the same distribution.
There are constraints on the form of the G distribution in that it must have potential to produce
outliers. For example, it may have a different mean or dispersion, or a longer tail.
Mixture alternative distribution: The mixture alternative states that discordant values are not
outliers in the F population, but contaminants from some other population, G. In this case, the
alternative hypothesis is
Slippage alternative distribution: This alternative states that all of the objects (apart from
some prescribed small number) arise independently from the initial model, F, with its given
parameters, whereas the remaining objects are independent observations from a modified
version of F in which the parameters have been shifted.
The local outlier factor (LOF) of p captures the degree to which we call p an outlier. It
is defined as
It is the average of the ratio of the local reachability density of p and those of p‘s
MinPts-nearest neighbors. It is easy to see that the lower p‘s local reachability density
is, and the higher the local reachability density of p‘s MinPts-nearest neighbors are, the
higher LOF(p) is.
From this definition, if an object p is not a local outlier, LOF(p) is close to 1. The more that p
is qualified to be a local outlier, the higher LOF(p) is. Therefore, we can determine
whether a point p is a local outlier based on the computation of LOF(p). Experiments
based on both synthetic and real-world large data sets have demonstrated the power of
LOF at identifying local outliers.
Euclidean distance. Suppose initially we assign A1, B1 and C1 as the center of each cluster,
respectively. Apply the K-means algorithm to show the three cluster centers after the first round
execution and the final three clusters.
12. Explain hierarchical and density based clustering methods with example.
13. Write the types of data in cluster analysis and explain.
14. What is an outlier? Explain outlier analysis with example.
15. Explain with an example density based outlier detection.
16. Discuss the following clustering algorithms with example 1. K-Means 2. K- Medoids
17. Explain the working of PAM algorithm.
18. Explain how data mining is used for intrusion detection.
19. Write the difference between CLARA and CLARANS.