Assignment Solution 074
Assignment Solution 074
2.
Descriptive Task: These tasks present the general properties of data
stored in database. The descriptive tasks are used to find out patterns
in data i.e. cluster, correlation, trends and anomalies etc.
Predictive Tasks: Predictive data mining tasks predict the value of
one attribute on the bases of values of other attributes, which is known
as target or dependent variable and the attributes used for making the
prediction are known as independent variables.
Prediction: Predictive model determined the future outcome rather
than present behavior. The predictive attribute of a predictive model
can be geometric or categorical. It engross the ruling of set of
characteristics relevant to the attribute of interest and predicting the
value distribution based on the set of data similar to the selected
object (S) for example one may predict the kind of disease based on
the symptoms of patient.
Classification: Classification is used to builds models from data with
predefined classes as the model is used to classify new instance whose
classification is not known. The instances used to create the model are
known as training data. A decision tree or set of classification rules is
based on such type of mechanism of classification which can be
retrieved for identification of future data for example one may classify
the employee’s potential salary on the bases of salary classification of
similar employees in the company.
Clustering: Clustering is the process of partitioning a set of object or
data in a same group called a cluster. These objects are more similar
(in some sense or another) to each other than to those in other groups (
clusters). Clustering is used in many fields, including machine
learning, patterns recognition, bioinformatics, image analysis and
information retrieval.
Mining Frequent patterns, Associations and correlations: Frequent
patterns can be defined as a pattern (a set of items, subsequence,
substructures, etc.) that appears intermittently in data. A intermittent
item set is a set of data that occurs frequently together in a transaction
data set for example, a set of items, such as table and chair.
Subsequence means first of all buying a Computer system, then UPS,
and thereafter a printer. This appears frequently in a shopping history
data base and is called a frequent sequential pattern. Substructure as
particular structural forms such as sub graphs, sub tree. If a
substructure appears intermittently, it is named as a frequent structural
pattern. Discovering such type of frequent pattern plays an important
role in correlation mining association clustering and other data mining
tasks.
Outlier Analysis: Outer analysis is an object in database which is
significantly different from the existing data. “An outlier is an
observation which deviates so much from the other observations as to
arouse suspicions that it was generated by a different mechanism.
3.
Characterization: It is a summarization of general features of objects
in a target class, and produces what is called characteristic rules.
Discrimination: It is bias that occurs when predefined data types
or data sources are intentionally or unintentionally treated differently
than others.
Association and Correlation Analysis: Association rules are if-then
statements that help to show the probability of relationships
between data items within large data sets in various
types of databases. Association rule mining has a
number of applications and is widely used to help discover
sales correlations in transactional data or in medical data sets.
Classification: It is a data mining function that assigns items in a
collection to target categories or classes. The goal of classification is
to accurately predict the target class for each case in the data.
Prediction: Prediction in data mining is to identify data points purely
on the description of another related data value. It is not necessarily
related to future events but the used variables are unknown.
Clustering: It is the process of partitioning the data (or objects) into
the same class; the data in one class is more similar to each other than
to those in other cluster.
Evolution Analysis: It refers to the description and model regularities
or trends for objects whose behavior changes over time.
Unit-2
1. Problem Statement
Your client is a financial distribution company. Over the last 10 years, they have
created an offline distribution channel across the country. They sell financial
products to consumers by hiring agents in their network. These agents are
freelancers and get a commission when they make a product sale.
Data
Variable Definition
ID Unique Application ID
Amount of business sourced by the manager in the last 3 months excluding business
Manager_Business2
from their Category A advisor
Number of products sold by the manager in the last 3 months excluding business from
Manager_Num_Products2
their Category A advisor
2. Data Cleaning:
(a) Methods to handle Missing Values:
Deleting Rows
Replacing with Mean/Median/Mode
Assigning a Unique Category
Predicting the Missing Value
Using algorithms that support the missing values
(b) D={12,14,3,23,16,7,8,4,11,10,20,5}
[i] Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
[ii] Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
(c) Handling Noisy Data:
Binning Method
Clustering
Regression
Combined computer and human inspection
3. Data Integration:
(a)
BASIS FOR SCHEMA INSTANCE
COMPARISON
(b)
Schema Integration: It integrates metadata from different sources. It supports
entity identification problem.
Redundancy: An attribute may be redundant if it can be derived or obtained
from another attribute or set of attributes. Inconsistencies in attributes may
also cause redundancy.
Detection and Resolution of Data Value Conflicts: Attribute values from
another different sources may differ for the same real world entity.
2.