0% found this document useful (0 votes)

97 views

Assignment Solution 074

The document discusses data mining techniques for analyzing an applicant dataset from a financial services company. It describes the company's agent recruitment process and goal of predicting which applicants will become successful agents. The data contains applicant and manager attributes. Data cleaning steps are outlined, such as handling missing values through imputation or binning, and dealing with noisy data through binning, clustering, or regression. Data integration is also mentioned.

Uploaded by

Atharv Sharma

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

97 views

Assignment Solution 074

Uploaded by

Atharv Sharma

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Unit-1

1. Data mining is the process of discovering patterns in large data

sets involving methods at the intersection of machine learning, statistics,
and database systems.

The steps involved in data mining when viewed as a process of knowledge

discovery are as follows:
• Data cleaning, a process that removes or transforms noise and inconsistent
data.
• Data integration, where multiple data sources may be combined.
• Data selection, where data relevant to the analysis task are retrieved from
the database.
• Data transformation, where data are transformed or consolidated into forms
appropriate for mining.
• Data mining, an essential process where intelligent and efficient methods
are applied in order to extract patterns.
• Pattern evaluation, a process that identifies the truly interesting patterns
representing knowledge based on some interestingness measures.
• Knowledge presentation, where visualization and knowledge
representation techniques are used to present the mined knowledge to the
user.

2.
 Descriptive Task: These tasks present the general properties of data
stored in database. The descriptive tasks are used to find out patterns
in data i.e. cluster, correlation, trends and anomalies etc.
 Predictive Tasks: Predictive data mining tasks predict the value of
one attribute on the bases of values of other attributes, which is known
as target or dependent variable and the attributes used for making the
prediction are known as independent variables.
 Prediction: Predictive model determined the future outcome rather
than present behavior. The predictive attribute of a predictive model
can be geometric or categorical. It engross the ruling of set of
characteristics relevant to the attribute of interest and predicting the
value distribution based on the set of data similar to the selected
object (S) for example one may predict the kind of disease based on
the symptoms of patient.
 Classification: Classification is used to builds models from data with
predefined classes as the model is used to classify new instance whose
classification is not known. The instances used to create the model are
known as training data. A decision tree or set of classification rules is
based on such type of mechanism of classification which can be
retrieved for identification of future data for example one may classify
the employee’s potential salary on the bases of salary classification of
similar employees in the company.
 Clustering: Clustering is the process of partitioning a set of object or
data in a same group called a cluster. These objects are more similar
(in some sense or another) to each other than to those in other groups (
clusters). Clustering is used in many fields, including machine
learning, patterns recognition, bioinformatics, image analysis and
information retrieval.
 Mining Frequent patterns, Associations and correlations: Frequent
patterns can be defined as a pattern (a set of items, subsequence,
substructures, etc.) that appears intermittently in data. A intermittent
item set is a set of data that occurs frequently together in a transaction
data set for example, a set of items, such as table and chair.
Subsequence means first of all buying a Computer system, then UPS,
and thereafter a printer. This appears frequently in a shopping history
data base and is called a frequent sequential pattern. Substructure as
particular structural forms such as sub graphs, sub tree. If a
substructure appears intermittently, it is named as a frequent structural
pattern. Discovering such type of frequent pattern plays an important
role in correlation mining association clustering and other data mining
tasks.
 Outlier Analysis: Outer analysis is an object in database which is
significantly different from the existing data. “An outlier is an
observation which deviates so much from the other observations as to
arouse suspicions that it was generated by a different mechanism.
3.
 Characterization: It is a summarization of general features of objects
in a target class, and produces what is called characteristic rules.
 Discrimination: It is bias that occurs when predefined data types
or data sources are intentionally or unintentionally treated differently
than others.
 Association and Correlation Analysis: Association rules are if-then
statements that help to show the probability of relationships
between data items within large data sets in various
types of databases. Association rule mining has a
number of applications and is widely used to help discover
sales correlations in transactional data or in medical data sets.
 Classification: It is a data mining function that assigns items in a
collection to target categories or classes. The goal of classification is
to accurately predict the target class for each case in the data.
 Prediction: Prediction in data mining is to identify data points purely
on the description of another related data value. It is not necessarily
related to future events but the used variables are unknown.
 Clustering: It is the process of partitioning the data (or objects) into
the same class; the data in one class is more similar to each other than
to those in other cluster.
 Evolution Analysis: It refers to the description and model regularities
or trends for objects whose behavior changes over time.

Unit-2

1. Problem Statement

Your client is a financial distribution company. Over the last 10 years, they have
created an offline distribution channel across the country. They sell financial
products to consumers by hiring agents in their network. These agents are
freelancers and get a commission when they make a product sale.

Overview of your client onboarding process

The managers at your client are primarily responsible for recruiting agents. Once a
manager has identified a potential applicant he would explain the business
opportunity to the agent. Once the agent provides the consent, an application is
made to your client to become an agent. In the next 3 months, this potential agent
has to undergo a 7 days training at your client's branch (about sales processes and
various products) and clear a subsequent examination in order to become an agent.

The problem - who are the best agents?

As it is obvious in the above process, there is a significant investment which your
client makes in identifying, training, and recruiting these agents. However, there
are a set of agents who do not bring in the expected resultant business. Your client
is looking for help from data scientists like you to help them provide insights using
their past recruitment data. They want to predict the target variable for each
potential agent which would help them identify the right agents to hire.
Key Points: The evaluation metric to be used is ROC-AUC.

Data

Variable Definition

ID Unique Application ID

Office_PIN PINCODE of Your client's Offices

Applicant_City_PIN PINCODE of Applicant Address

Applicant_Gender Applicant's Gender

Applicant_Marital_Status Applicant's Marital Status

Applicant_Occupation Applicant's Occupation

Applicant_Qualification Applicant's Educational Qualification

Manager_Joining_Designation Manager's Joining Designation

Manager_Current_Designation Manager's Designation at the time of application sourcing

Manager_Grade Manager's Grade

Manager_Status Current Employment Status (Probation/Confirmation)

Manager_Gender Manager's Gender

Manager_Num_Application Number of Applications sourced in the last 3 months by the Manager

Manager_Num_Coded Number of agents recruited by the manager in the last 3 months

Manager_Business Amount of business sourced by the manager in the last 3 months

Manager_Num_Products Number of products sold by the manager in the last 3 months

Amount of business sourced by the manager in the last 3 months excluding business
Manager_Business2
from their Category A advisor

Number of products sold by the manager in the last 3 months excluding business from
Manager_Num_Products2
their Category A advisor

Business_Sourced(Target) Business sourced by the applicant within 3 months [1/0] of recruitment

2. Data Cleaning:
(a) Methods to handle Missing Values:
 Deleting Rows
 Replacing with Mean/Median/Mode
 Assigning a Unique Category
 Predicting the Missing Value
 Using algorithms that support the missing values
(b) D={12,14,3,23,16,7,8,4,11,10,20,5}
[i] Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
[ii] Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
(c) Handling Noisy Data:
 Binning Method
 Clustering
 Regression
 Combined computer and human inspection
3. Data Integration:
(a)
BASIS FOR SCHEMA INSTANCE

COMPARISON

Basic Description of the database. Snapshot of a database at a specific moment.

Change occurrence Rare Frequent

Initial state Empty Always have some data.

(b)
 Schema Integration: It integrates metadata from different sources. It supports
entity identification problem.
 Redundancy: An attribute may be redundant if it can be derived or obtained
from another attribute or set of attributes. Inconsistencies in attributes may
also cause redundancy.
 Detection and Resolution of Data Value Conflicts: Attribute values from
another different sources may differ for the same real world entity.

4. Data Reduction and Transformation:

(a) Hierarchy Creation
Unit-3
1.

1. === Run information ===

2.
3. Evaluator: weka.attributeSelection.CfsSubsetEval -P 1 -E 1
4. Search: weka.attributeSelection.BestFirst -D 1 -N 5
5. Relation: supermarket
6. Instances: 4627
7. Attributes: 217
8. [list of attributes omitted]
9. Evaluation mode: evaluate on all training data
10.
11.
12.
13.=== Attribute Selection on all input data ===
14.
15.Search Method:
16. Best first.
17. Start set: no attributes
18. Search direction: forward
19. Stale search after 5 node expansions
20. Total number of subsets evaluated: 1076
21. Merit of best subset found: 0
22.
23.Attribute Subset Evaluator (supervised, Class (nominal): 217 total):
24. CFS Subset Evaluator
25. Including locally predictive attributes
26.
27.Selected attributes: 1 : 1
28. department1

Useful Not True - Derek Sivers - 2024 - Anna's Archive
No ratings yet
Useful Not True - Derek Sivers - 2024 - Anna's Archive
138 pages
The Lion and The Hare
100% (2)
The Lion and The Hare
3 pages
My Journey From Skeptic To Believer
No ratings yet
My Journey From Skeptic To Believer
7 pages
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
10 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
24 pages
Unit 1
No ratings yet
Unit 1
27 pages
Dataming T PDF
No ratings yet
Dataming T PDF
48 pages
Chapter 1
No ratings yet
Chapter 1
16 pages
DM Sem U-1
No ratings yet
DM Sem U-1
50 pages
Data Mining AND Warehousing: Abstract
No ratings yet
Data Mining AND Warehousing: Abstract
12 pages
Data Mining Questions
100% (1)
Data Mining Questions
7 pages
Data Mining
No ratings yet
Data Mining
14 pages
Subject Data Warehouse
No ratings yet
Subject Data Warehouse
42 pages
wibd
No ratings yet
wibd
10 pages
Data Mining Technique Using Weka Tool
No ratings yet
Data Mining Technique Using Weka Tool
21 pages
CH 2
No ratings yet
CH 2
37 pages
Data Warehousing&Dat Mining
No ratings yet
Data Warehousing&Dat Mining
12 pages
DMBI Theory
No ratings yet
DMBI Theory
15 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
important questions unit-1
No ratings yet
important questions unit-1
20 pages
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
No ratings yet
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
6 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
U1_1
No ratings yet
U1_1
13 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
38 pages
Applying Data Mining Techniques in Property Casualty Insurance
No ratings yet
Applying Data Mining Techniques in Property Casualty Insurance
25 pages
206 Data Mining
No ratings yet
206 Data Mining
28 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
1.1 Data and Information Mining
No ratings yet
1.1 Data and Information Mining
24 pages
Data Mining-Unit-1
No ratings yet
Data Mining-Unit-1
21 pages
Data Mining Classification Prediction
No ratings yet
Data Mining Classification Prediction
3 pages
Unit III Data Mining Techniques
No ratings yet
Unit III Data Mining Techniques
17 pages
Data Mining Unit-1 Notes
No ratings yet
Data Mining Unit-1 Notes
18 pages
Data Mining
No ratings yet
Data Mining
25 pages
CIA 4
No ratings yet
CIA 4
18 pages
Unit-2 Introduction To Data Mining
100% (1)
Unit-2 Introduction To Data Mining
11 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
Unit 4 Data warehousing and Data mining
No ratings yet
Unit 4 Data warehousing and Data mining
15 pages
Question
No ratings yet
Question
27 pages
Data Mining Roles in Extracting The Knowledge: Full Length Research Paper
No ratings yet
Data Mining Roles in Extracting The Knowledge: Full Length Research Paper
6 pages
Discuss The Role of Data Mining Techniques and Data Visualization in e Commerce Data Mining
No ratings yet
Discuss The Role of Data Mining Techniques and Data Visualization in e Commerce Data Mining
13 pages
DATA MINING MODULE 3
No ratings yet
DATA MINING MODULE 3
27 pages
CC Unit - 4 Imp Questions
No ratings yet
CC Unit - 4 Imp Questions
4 pages
Data Mining
100% (1)
Data Mining
18 pages
Data Mining Tasks
No ratings yet
Data Mining Tasks
3 pages
What Is Data Mining
No ratings yet
What Is Data Mining
8 pages
Introduction To Data Mining Techniques: Dr. Rajni Jain
No ratings yet
Introduction To Data Mining Techniques: Dr. Rajni Jain
11 pages
data analytics-1
No ratings yet
data analytics-1
21 pages
test (1)
No ratings yet
test (1)
7 pages
UNIT 4
No ratings yet
UNIT 4
39 pages
FDS Unit 2
No ratings yet
FDS Unit 2
15 pages
Data Mining Chapter 1
0% (1)
Data Mining Chapter 1
12 pages
unit2
No ratings yet
unit2
20 pages
Data Mining UNIT-2 Notes
No ratings yet
Data Mining UNIT-2 Notes
91 pages
DM Module1
No ratings yet
DM Module1
15 pages
Unit 7
No ratings yet
Unit 7
43 pages
9 Data Mining - Classification & Prediction
No ratings yet
9 Data Mining - Classification & Prediction
4 pages
202396123846584_26076Classification - Data Mining
No ratings yet
202396123846584_26076Classification - Data Mining
4 pages
Data Mining Mid 2
No ratings yet
Data Mining Mid 2
20 pages
Task 1
No ratings yet
Task 1
3 pages
module 1
No ratings yet
module 1
41 pages
Data Mining unit-1 complete
No ratings yet
Data Mining unit-1 complete
45 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Excerpt: "Drama" by John Lithgow
No ratings yet
Excerpt: "Drama" by John Lithgow
25 pages
Collaborative Writing Strategies
No ratings yet
Collaborative Writing Strategies
2 pages
Recollection Module
No ratings yet
Recollection Module
7 pages
04 Pentagon V Adelantar
No ratings yet
04 Pentagon V Adelantar
2 pages
2015 Eckankar Ontario Seminar
100% (1)
2015 Eckankar Ontario Seminar
2 pages
Download full (Ebook) Handbook of Moral Development by Melanie Killen, Judith G. Smetana ISBN 9780367497545, 0367497549 ebook all chapters
100% (3)
Download full (Ebook) Handbook of Moral Development by Melanie Killen, Judith G. Smetana ISBN 9780367497545, 0367497549 ebook all chapters
77 pages
An Ancient Town Where The Sun Was Eclips PDF
No ratings yet
An Ancient Town Where The Sun Was Eclips PDF
5 pages
Journal of Sexual Medicine - 2010 - Brody - The Relative Health Benefits of Different Sexual Activities
No ratings yet
Journal of Sexual Medicine - 2010 - Brody - The Relative Health Benefits of Different Sexual Activities
26 pages
8B - Lab 4 - Mirrors, Lenses and Optical Instruments
No ratings yet
8B - Lab 4 - Mirrors, Lenses and Optical Instruments
11 pages
Mock Trial Script Rape
80% (5)
Mock Trial Script Rape
5 pages
APR-2017.05.12-Delegation of Powers To Chief Engineer-Regional Officer (CE-RO) at MoRT - H Regional Offices
No ratings yet
APR-2017.05.12-Delegation of Powers To Chief Engineer-Regional Officer (CE-RO) at MoRT - H Regional Offices
15 pages
382 385 Edu F Cerebroprotein Hydrosylate
No ratings yet
382 385 Edu F Cerebroprotein Hydrosylate
4 pages
Thoughtivity For Kids
No ratings yet
Thoughtivity For Kids
41 pages
A Guide To CPFR Implementation PDF
No ratings yet
A Guide To CPFR Implementation PDF
63 pages
Reading Ii Module PDF (1) ...
No ratings yet
Reading Ii Module PDF (1) ...
34 pages
Joins SQL PDF
No ratings yet
Joins SQL PDF
10 pages
Ricardo Padilla Perez v. Secretary of Health & Human Services, 985 F.2d 552, 1st Cir. (1993)
No ratings yet
Ricardo Padilla Perez v. Secretary of Health & Human Services, 985 F.2d 552, 1st Cir. (1993)
10 pages
Tkam Final Draft Francis Anisco
No ratings yet
Tkam Final Draft Francis Anisco
4 pages
hall_ticket_915693911406 (1)
No ratings yet
hall_ticket_915693911406 (1)
1 page
Iodine Clock Reaction
No ratings yet
Iodine Clock Reaction
3 pages
The Directed Chinese Postman Problem
No ratings yet
The Directed Chinese Postman Problem
17 pages
USA vs. Hon. Eliodoro Guinto Digest
No ratings yet
USA vs. Hon. Eliodoro Guinto Digest
2 pages
A Study On Consumer Satisfaction Towards E - Banking" With Special Reference To Syndicate Bank, Vidya Nagara, Shivamogga
No ratings yet
A Study On Consumer Satisfaction Towards E - Banking" With Special Reference To Syndicate Bank, Vidya Nagara, Shivamogga
55 pages
Test Bank Management System Applying Rasch Model and Data Encryption Standard (DES) Algorithm
No ratings yet
Test Bank Management System Applying Rasch Model and Data Encryption Standard (DES) Algorithm
9 pages
Defining A Mentoring Relationship New Mentees
No ratings yet
Defining A Mentoring Relationship New Mentees
3 pages
MCQ
No ratings yet
MCQ
4 pages