Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
97 views

Assignment Solution 074

The document discusses data mining techniques for analyzing an applicant dataset from a financial services company. It describes the company's agent recruitment process and goal of predicting which applicants will become successful agents. The data contains applicant and manager attributes. Data cleaning steps are outlined, such as handling missing values through imputation or binning, and dealing with noisy data through binning, clustering, or regression. Data integration is also mentioned.

Uploaded by

Atharv Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views

Assignment Solution 074

The document discusses data mining techniques for analyzing an applicant dataset from a financial services company. It describes the company's agent recruitment process and goal of predicting which applicants will become successful agents. The data contains applicant and manager attributes. Data cleaning steps are outlined, such as handling missing values through imputation or binning, and dealing with noisy data through binning, clustering, or regression. Data integration is also mentioned.

Uploaded by

Atharv Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Unit-1

1. Data mining is the process of discovering patterns in large data


sets involving methods at the intersection of machine learning, statistics,
and database systems.

The steps involved in data mining when viewed as a process of knowledge


discovery are as follows:
• Data cleaning, a process that removes or transforms noise and inconsistent
data.
• Data integration, where multiple data sources may be combined.
• Data selection, where data relevant to the analysis task are retrieved from
the database.
• Data transformation, where data are transformed or consolidated into forms
appropriate for mining.
• Data mining, an essential process where intelligent and efficient methods
are applied in order to extract patterns.
• Pattern evaluation, a process that identifies the truly interesting patterns
representing knowledge based on some interestingness measures.
• Knowledge presentation, where visualization and knowledge
representation techniques are used to present the mined knowledge to the
user.

2.
 Descriptive Task: These tasks present the general properties of data
stored in database. The descriptive tasks are used to find out patterns
in data i.e. cluster, correlation, trends and anomalies etc.
 Predictive Tasks: Predictive data mining tasks predict the value of
one attribute on the bases of values of other attributes, which is known
as target or dependent variable and the attributes used for making the
prediction are known as independent variables.
 Prediction: Predictive model determined the future outcome rather
than present behavior. The predictive attribute of a predictive model
can be geometric or categorical. It engross the ruling of set of
characteristics relevant to the attribute of interest and predicting the
value distribution based on the set of data similar to the selected
object (S) for example one may predict the kind of disease based on
the symptoms of patient.
 Classification: Classification is used to builds models from data with
predefined classes as the model is used to classify new instance whose
classification is not known. The instances used to create the model are
known as training data. A decision tree or set of classification rules is
based on such type of mechanism of classification which can be
retrieved for identification of future data for example one may classify
the employee’s potential salary on the bases of salary classification of
similar employees in the company.
 Clustering: Clustering is the process of partitioning a set of object or
data in a same group called a cluster. These objects are more similar
(in some sense or another) to each other than to those in other groups (
clusters). Clustering is used in many fields, including machine
learning, patterns recognition, bioinformatics, image analysis and
information retrieval.
 Mining Frequent patterns, Associations and correlations: Frequent
patterns can be defined as a pattern (a set of items, subsequence,
substructures, etc.) that appears intermittently in data. A intermittent
item set is a set of data that occurs frequently together in a transaction
data set for example, a set of items, such as table and chair.
Subsequence means first of all buying a Computer system, then UPS,
and thereafter a printer. This appears frequently in a shopping history
data base and is called a frequent sequential pattern. Substructure as
particular structural forms such as sub graphs, sub tree. If a
substructure appears intermittently, it is named as a frequent structural
pattern. Discovering such type of frequent pattern plays an important
role in correlation mining association clustering and other data mining
tasks.
 Outlier Analysis: Outer analysis is an object in database which is
significantly different from the existing data. “An outlier is an
observation which deviates so much from the other observations as to
arouse suspicions that it was generated by a different mechanism.
3.
 Characterization: It is a summarization of general features of objects
in a target class, and produces what is called characteristic rules.
 Discrimination: It is bias that occurs when predefined data types
or data sources are intentionally or unintentionally treated differently
than others.
 Association and Correlation Analysis: Association rules are if-then
statements that help to show the probability of relationships
between data items within large data sets in various
types of databases. Association rule mining has a
number of applications and is widely used to help discover
sales correlations in transactional data or in medical data sets.
 Classification: It is a data mining function that assigns items in a
collection to target categories or classes. The goal of classification is
to accurately predict the target class for each case in the data. 
 Prediction: Prediction in data mining is to identify data points purely
on the description of another related data value. It is not necessarily
related to future events but the used variables are unknown.
 Clustering: It is the process of partitioning the data (or objects) into
the same class; the data in one class is more similar to each other than
to those in other cluster.
 Evolution Analysis: It refers to the description and model regularities
or trends for objects whose behavior changes over time.

Unit-2

1. Problem Statement

Your client is a financial distribution company. Over the last 10 years, they have
created an offline distribution channel across the country. They sell financial
products to consumers by hiring agents in their network. These agents are
freelancers and get a commission when they make a product sale.

Overview of your client onboarding process


The managers at your client are primarily responsible for recruiting agents. Once a
manager has identified a potential applicant he would explain the business
opportunity to the agent. Once the agent provides the consent, an application is
made to your client to become an agent. In the next 3 months, this potential agent
has to undergo a 7 days training at your client's branch (about sales processes and
various products) and clear a subsequent examination in order to become an agent.

The problem - who are the best agents?


As it is obvious in the above process, there is a significant investment which your
client makes in identifying, training, and recruiting these agents. However, there
are a set of agents who do not bring in the expected resultant business. Your client
is looking for help from data scientists like you to help them provide insights using
their past recruitment data. They want to predict the target variable for each
potential agent which would help them identify the right agents to hire.
Key Points: The evaluation metric to be used is ROC-AUC.

Data 

Variable Definition

ID Unique Application ID

Office_PIN PINCODE of Your client's Offices

Applicant_City_PIN PINCODE of Applicant Address

Applicant_Gender Applicant's Gender

Applicant_Marital_Status Applicant's Marital Status

Applicant_Occupation Applicant's Occupation


Applicant_Qualification Applicant's Educational Qualification

Manager_Joining_Designation Manager's Joining Designation

Manager_Current_Designation Manager's Designation at the time of application sourcing

Manager_Grade Manager's Grade

Manager_Status Current Employment Status (Probation/Confirmation)

Manager_Gender Manager's Gender

Manager_Num_Application Number of Applications sourced in the last 3 months by the Manager

Manager_Num_Coded Number of agents recruited by the manager in the last 3 months

Manager_Business Amount of business sourced by the manager in the last 3 months

Manager_Num_Products Number of products sold by the manager in the last 3 months

Amount of business sourced by the manager in the last 3 months excluding business
Manager_Business2
from their Category A advisor

Number of products sold by the manager in the last 3 months excluding business from
Manager_Num_Products2
their Category A advisor

Business_Sourced(Target) Business sourced by the applicant within 3 months [1/0] of recruitment

2. Data Cleaning:
(a) Methods to handle Missing Values:
 Deleting Rows
 Replacing with Mean/Median/Mode
 Assigning a Unique Category
 Predicting the Missing Value
 Using algorithms that support the missing values
(b) D={12,14,3,23,16,7,8,4,11,10,20,5}
[i] Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
[ii] Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
(c) Handling Noisy Data:
 Binning Method
 Clustering
 Regression
 Combined computer and human inspection
3. Data Integration:
(a)
BASIS FOR SCHEMA INSTANCE

COMPARISON

Basic Description of the database. Snapshot of a database at a specific moment.

Change occurrence Rare Frequent

Initial state Empty Always have some data.

(b)
 Schema Integration: It integrates metadata from different sources. It supports
entity identification problem.
 Redundancy: An attribute may be redundant if it can be derived or obtained
from another attribute or set of attributes. Inconsistencies in attributes may
also cause redundancy.
 Detection and Resolution of Data Value Conflicts: Attribute values from
another different sources may differ for the same real world entity.

4. Data Reduction and Transformation:


(a) Hierarchy Creation
Unit-3
1.

1. === Run information ===


2.
3. Evaluator: weka.attributeSelection.CfsSubsetEval -P 1 -E 1
4. Search: weka.attributeSelection.BestFirst -D 1 -N 5
5. Relation: supermarket
6. Instances: 4627
7. Attributes: 217
8. [list of attributes omitted]
9. Evaluation mode: evaluate on all training data
10.
11.
12.
13.=== Attribute Selection on all input data ===
14.
15.Search Method:
16. Best first.
17. Start set: no attributes
18. Search direction: forward
19. Stale search after 5 node expansions
20. Total number of subsets evaluated: 1076
21. Merit of best subset found: 0
22.
23.Attribute Subset Evaluator (supervised, Class (nominal): 217 total):
24. CFS Subset Evaluator
25. Including locally predictive attributes
26.
27.Selected attributes: 1 : 1
28. department1

2.

You might also like