0% found this document useful (0 votes)

129 views

Data Preparation Process PDF

This document discusses the process of data preparation and processing. It involves validating, editing, and cleaning raw data collected through questionnaires. Key steps include checking questionnaires for completeness and logical responses, coding open-ended questions, entering data into a computer file using a codebook for reference, and conducting consistency checks to find errors or outliers. The goal is to translate raw data into a clean and organized format suitable for analysis.

Uploaded by

Irfan Zubair

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

129 views

Data Preparation Process PDF

Uploaded by

Irfan Zubair

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

DATA PREPARATION

AND
PROCESSING

1
DATA PREPARATION

• Once data is collected, process of analysis

begins.
• But, data has to be translated in an appropriate
form.
• This process is known as Data Preparation

2
STEPS IN DATA PREPARATION
• Validate data
• Questionnaire checking
• Edit acceptable questionnaires
• Code the questionnaires
• Keypunch the data
• Clean the data set
• Statistically adjust the data
• Store the data set for analysis
• Analyse data 3
VALIDATION
• Validity exists when the data actually measure
what they are suppose to measure. If they fail
to, they are misleading and should not be
accepted.
• One of the most serious concerns is errors in
survey data.
• When secondary data are involved, they may
be ancient or unimportant.
• With primary data also, this review is
important.
4
QUESTIONNAIRE CHECKING
• A questionnaire returned from the field may be
unacceptable for several reasons.
– Parts of the questionnaire may be
incomplete. Inadequate answers. No
responses to specific questions
– The pattern of responses may indicate that
the respondent did not understand or follow
the instructions.
– The responses show little variance.
– One or more pages are missing. 5
QUESTIONNAIRE CHECKING
– The questionnaire is answered by someone
who does not qualify for participation.
– Fictitious interviews
– Inconsistencies
– Illegible responses
– Yea- or nay-saying patterns
– Middle-of-the-road patterns 6
EDITING
• Next phase of data preparation involves
editing of the raw data.
• Three basic approaches:
- Go back to the respondents for clarification
- Infer from other responses
- Discard the response altogether

7
Treatment of Unsatisfactory Responses

Treatment of
Unsatisfactory
Responses

Return to the Assign Missing Discard

Unsatisfactory
Field Values Respondents

Substitute a Casewise Pairwise

Neutral Value Deletion Deletion
8
Treatment of Unsatisfactory Results:
- Returning to the Field – The
questionnaires with unsatisfactory responses
may be returned to the field, where the
interviewers recontact the respondents.
- Assigning Missing Values – If returning the
questionnaires to the field is not feasible, the
editor may assign missing values to
unsatisfactory responses.
- Discarding Unsatisfactory Respondents –
In this approach, the respondents with
unsatisfactory responses are simply 9 discarded
CODING
• Data entry refers to the creation of a
computer file that holds the raw data taken
from all of the questionnaires deemed suitable
for analysis
• Coding means assigning a code, usually a
number, to each possible response to each
question. The code includes an indication of
the column position (field) and data record it
will occupy. 10
CODING

• Fixed field codes, which mean that the

number of records for each respondent is the
same and the same data appear in the same
column(s) for all respondents, are highly
desirable.
– If possible, standard codes should be used
for missing data. Coding of structured
questions is relatively simple, since the
response options are predetermined.
11
CODING
– In questions that permit a large number of
responses, each possible response option
should be assigned a separate column.
– Guidelines for coding unstructured questions:
– Category codes should be mutually exclusive and
collectively exhaustive.
– Only a few (10% or less) of the responses should fall into
the “other” category.
– Category codes should be assigned for critical issues even
if no one has mentioned them.
– Data should be coded to retain as much detail
12 as possible.
CODING
• Principles for establishing categories for
coding:
- Convenient number of categories
- Similar responses within categories
- Differences of responses between categories
- Mutually exclusive categories
- Exhaustive categories
- Avoid open-ended class intervals
- Class interval of the same width
- Midpoints of class intervals 13
CODE BOOK
• A codebook contains coding instructions and
the necessary information about variables in
the data set. A codebook generally contains
the following information:
- column number
- record number
- variable number
- variable name
- question number
14
- instructions for coding
CODE BOOK
• Thus, a Data code book identifies all of the
variable names and code numbers associated
with each possible response to each question
that makes up the data set

15
Restaurant Preference
ID PREFER. QUALITY QUANTITY VALUE SERVICE INCOME
1 2 2 3 1 3 6
2 6 5 6 5 7 2
3 4 4 3 4 5 3
4 1 2 1 1 2 5
5 7 6 6 5 4 1
6 5 4 4 5 4 3
7 2 2 3 2 3 5
8 3 3 4 2 3 4
9 7 6 7 6 5 2
10 2 3 2 2 2 5
11 2 3 2 1 3 6
12 6 6 6 6 7 2
13 4 4 3 3 4 3
14 1 1 3 1 2 4
15 7 7 5 5 4 2
16 5 5 4 5 5 3
17 2 3 1 2 3 4
18 4 4 3 3 3 3
19 7 5 5 7 5 5
20 3 2 2 3 16
3 3
A Codebook Excerpt
Column Variable Variable Question Coding
Number Number Name Number Instructions
1 1 ID 1 to 20 as coded

2 2 Preference 1 input the number circled.

1=Weak Preference
7=Strong Preference

3 3 Quality 2 Input the number circled.

1=Poor
7=Excellent

4 4 Quantity 3 Input the number circled.

1=Poor
7=Excellent
17
A Codebook Excerpt
Column Variable Variable Question Coding
Number Number Name Number Instructions
5 5 Value 4 Input the number circled.
1=Poor
7=Excellent
6 6 Service 5 Input the number circled.
1=Poor
7=Excellent

7 7 Income 6 Input the number circled.

1 = Less than $20,000
2 = $20,000 to 34,999
3 = $35,000 to 49,999
4 = $50,000 to 74,999
5 = $75,000 to 99,999
6 = $100,00 or more
18
SPSS Variable View of the Data of Table

19
Keypunch the data / Data
transcription
• Transcribing data is the process of
transferring the coded data from the
questionnaire or coding sheets onto
disks or magnetic tapes or directly into
computers by keypunching.

20
Keypunch the data / Data transcription

Raw Data

CATI / Keypunching via Mark Sense Optical Computerized

CAPI CRT Terminal Forms Scanning Sensory
Analysis
Verification:Correct
Keypunching Errors

Magnetic
Computer Disks
Memory Tapes

Transcribed Data 21
Data Cleaning
• Consistency Checks
- Consistency checks identify data that are out of
range, logically inconsistent, or have extreme
values.
- Computer packages like SPSS, SAS, EXCEL and
MINITAB can be programmed to identify out-of-
range values for each variable and print out the
respondent code, variable code, variable name,
record number, column number, and out-of-range
value.
- Extreme values should be closely examined.
22
Data Cleaning
• Treatment of Missing Responses
• Substitute a Neutral Value – A neutral value, typically the
mean response to the variable, is substituted for the missing
responses.
• Substitute an Imputed Response – The respondents' pattern
of responses to other questions are used to impute or
calculate a suitable response to the missing questions.
• In case wise deletion, cases, or respondents, with any
missing responses are discarded from the analysis.
• In pair wise deletion, instead of discarding all cases with
any missing values, the researcher uses only the cases or
respondents with complete responses for each calculation.
23
Statistically Adjusting the Data
• Weighting
• In weighting, each case or respondent in the
database is assigned a weight to reflect its
importance relative to other cases or respondents.
• Weighting is most widely used to make the sample
data more representative of a target population on
specific characteristics.
• Yet another use of weighting is to adjust the
sample so that greater importance is attached to
respondents with certain characteristics.
24
Statistically Adjusting the Data
Use of Weighting for Representativeness
Years of Sample Population
Education Percentage Percentage Weight

Elementary School
0 to 7 years 2.49 4.23 1.70
8 years 1.26 2.19 1.74

High School
1 to 3 years 6.39 8.65 1.35
4 years 25.39 29.24 1.15

College
1 to 3 years 22.33 29.42 1.32
4 years 15.02 12.01 0.80
5 to 6 years 14.94 7.36 0.49
7 years or more 12.18 6.90 0.57

Totals 100.00 100.00 25

Statistically Adjusting the Data
• Variable Respecification
• Variable respecification involves the transformation of
data to create new variables or modify existing
variables.
• E.G., the researcher may create new variables that are
composites of several other variables.
• Dummy variables are used for respecifying categorical
variables. The general rule is that to respecify a
categorical variable with K categories, K-1 dummy
variables are needed

26
Statistically Adjusting the Data
Product Usage Original Dummy Variable Code
Category Variable
Code X1 X2 X3

Nonusers 1 1 0 0
Light users 2 0 1 0
Medium users 3 0 0 1
Heavy users 4 0 0 0

Note that X1 = 1 for nonusers and 0 for all others. Likewise, X2 =

1 for light users and 0 for all others, and X3 = 1 for medium users
and 0 for all others. In analyzing the data, X1, X2, and X3 are
used to represent all user/nonuser groups
27
Statistically Adjusting the Data

• Scale Transformation and Standardization:

- Scale transformation involves a manipulation of scale
values to ensure comparability with other scales or
otherwise make the data suitable for analysis.

- A more common transformation procedure is

standardization. Standardized scores, Zi, may be
obtained as:

Zi = (Xi -X )/sx
28
A Classification of Univariate Techniques
Univariate Techniques

Metric Data Non-numeric Data

One Sample Two or More One Sample Two or More

Samples Samples
* t test * Frequency
* Z test * Chi-Square
* K-S
* Runs
* Binomial
Independent Related
Independent Related
* Two- Group test * Paired
* Z test t test * Chi-Square * Sign
* One-Way * Mann-Whitney * Wilcoxon
ANOVA * Median * McNemar
* K-S * Chi-Square
* K-W ANOVA29
A Classification of Multivariate Techniques
Multivariate Techniques

Dependence Interdependence
Technique Technique

One Dependent More Than One Variable Interobject

Variable Dependent Interdependence Similarity
Variable
* Cross- * Multivariate * Factor * Cluster Analysis
Tabulation Analysis of Analysis * Multidimensional
* Analysis of Variance and Scaling
Variance and Covariance
Covariance * Canonical
* Multiple Correlation
Regression * Multiple
* Conjoint Discriminant
Analysis Analysis
30

1000 Most Common Words in English
100% (2)
1000 Most Common Words in English
4 pages
Ritz Installation Instructions
No ratings yet
Ritz Installation Instructions
130 pages
6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
Review Question #5
0% (2)
Review Question #5
2 pages
Basic Computer Assign NEW
No ratings yet
Basic Computer Assign NEW
13 pages
Data Preparation
100% (1)
Data Preparation
87 pages
Data Wrangling PDF
No ratings yet
Data Wrangling PDF
14 pages
Data Wrangling, Also Known As Data Munging, Is An Iterative Process That Involves Data
No ratings yet
Data Wrangling, Also Known As Data Munging, Is An Iterative Process That Involves Data
9 pages
Data Cleaning
100% (2)
Data Cleaning
19 pages
Performance Comparison Between Naïve Bayes, Decision Tree and K-Nearest Neighbor in Searching Alternative Design in An Energy Simulation Tool
No ratings yet
Performance Comparison Between Naïve Bayes, Decision Tree and K-Nearest Neighbor in Searching Alternative Design in An Energy Simulation Tool
7 pages
DATA MINING TOOLS & ACTIVITIES PPT by Me.....
50% (2)
DATA MINING TOOLS & ACTIVITIES PPT by Me.....
25 pages
Rapid Minder Assignment
No ratings yet
Rapid Minder Assignment
38 pages
Powerbi Intro
No ratings yet
Powerbi Intro
46 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
47 pages
Data Mining in Medicine
No ratings yet
Data Mining in Medicine
42 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Introduction To Data Mining With Case Studies - Sample Index
0% (1)
Introduction To Data Mining With Case Studies - Sample Index
16 pages
Unit3 160420200647 PDF
No ratings yet
Unit3 160420200647 PDF
146 pages
Data Science
100% (1)
Data Science
7 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
117 pages
Different Types of Regression Models
No ratings yet
Different Types of Regression Models
18 pages
Business Analytics and Data Science
No ratings yet
Business Analytics and Data Science
25 pages
Social Network Analysis in R PDF
No ratings yet
Social Network Analysis in R PDF
35 pages
Introduction To Sentiment Analysis
No ratings yet
Introduction To Sentiment Analysis
7 pages
ST2195 Complete
No ratings yet
ST2195 Complete
430 pages
Data Visualization Ebook
No ratings yet
Data Visualization Ebook
15 pages
Tableau Desktop Training: About Intellipaat
No ratings yet
Tableau Desktop Training: About Intellipaat
10 pages
Employee Turnover Prediction
No ratings yet
Employee Turnover Prediction
9 pages
Rapid Miner
No ratings yet
Rapid Miner
24 pages
Assignment 1&2
No ratings yet
Assignment 1&2
4 pages
Rapid Miner - Data Preparation
100% (1)
Rapid Miner - Data Preparation
17 pages
100 plus Statistics Interview Questions
0% (1)
100 plus Statistics Interview Questions
44 pages
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia
100% (1)
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia
107 pages
K Means R and Rapid Miner Patient and Mall Case Study
No ratings yet
K Means R and Rapid Miner Patient and Mall Case Study
80 pages
Deployment: Cheat Sheet: Machine Learning With KNIME Analytics Platform
No ratings yet
Deployment: Cheat Sheet: Machine Learning With KNIME Analytics Platform
1 page
SAS Presentation
No ratings yet
SAS Presentation
49 pages
Classification Algorithms Used in Data Mining. This Is A Lecture Given To MSC Students.
100% (5)
Classification Algorithms Used in Data Mining. This Is A Lecture Given To MSC Students.
63 pages
Basic Statistics: Simple Linear Regression
No ratings yet
Basic Statistics: Simple Linear Regression
8 pages
School of Data Science and Forecasting: M.B.A. (Business Analytics)
No ratings yet
School of Data Science and Forecasting: M.B.A. (Business Analytics)
2 pages
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
No ratings yet
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
9 pages
Qlik Sense Concat
No ratings yet
Qlik Sense Concat
5 pages
ML0101EN Clus K Means Customer Seg Py v1
100% (1)
ML0101EN Clus K Means Customer Seg Py v1
8 pages
Python in A Day - Cheet Sheet
No ratings yet
Python in A Day - Cheet Sheet
2 pages
Application of Data Mining Predict Employee Competency at PT. ABC
No ratings yet
Application of Data Mining Predict Employee Competency at PT. ABC
13 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
A Guide To Teaching Data Science PDF
No ratings yet
A Guide To Teaching Data Science PDF
26 pages
Week 1 Analytics in Practice
100% (2)
Week 1 Analytics in Practice
12 pages
Data Preparation For Machine Learning Mini Course
No ratings yet
Data Preparation For Machine Learning Mini Course
19 pages
Create A Crosstab Report in People Analytics
No ratings yet
Create A Crosstab Report in People Analytics
4 pages
02 - Decision Tree Classification On Iris Dataset
No ratings yet
02 - Decision Tree Classification On Iris Dataset
6 pages
Data Science in R
No ratings yet
Data Science in R
17 pages
02 - Data Analytics Prefessional Course
100% (1)
02 - Data Analytics Prefessional Course
16 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Employee Attrition Prediction
No ratings yet
Employee Attrition Prediction
3 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Data Science: Where Does A Physics Degree Fit?
100% (1)
Data Science: Where Does A Physics Degree Fit?
31 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
17 pages
Statistics For Data Analysis Lec 1 Introduction and Visualization
No ratings yet
Statistics For Data Analysis Lec 1 Introduction and Visualization
8 pages
RDataMining Slides Text Mining
No ratings yet
RDataMining Slides Text Mining
34 pages
Feature engineering Complete Self-Assessment Guide
From Everand
Feature engineering Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Data Preparation and Processing
No ratings yet
Data Preparation and Processing
30 pages
Session 1
No ratings yet
Session 1
23 pages
08 Chapter3 PDF
No ratings yet
08 Chapter3 PDF
38 pages
18MBC24 Business Research Methods Notes For Test II Qualitative Data Collection Methods
No ratings yet
18MBC24 Business Research Methods Notes For Test II Qualitative Data Collection Methods
35 pages
Sustainable Corporate Strategies: BY DEVAYALINI. M (18MBR014)
No ratings yet
Sustainable Corporate Strategies: BY DEVAYALINI. M (18MBR014)
6 pages
Compliaint Number
No ratings yet
Compliaint Number
1 page
Microprocessor Lab Manual - Final
100% (6)
Microprocessor Lab Manual - Final
157 pages
Railway Management System
No ratings yet
Railway Management System
2 pages
Op Tim Us Prime
No ratings yet
Op Tim Us Prime
1 page
On The Method of Ship's Transoceanic Route Planning
No ratings yet
On The Method of Ship's Transoceanic Route Planning
8 pages
17 Ketosteroid
No ratings yet
17 Ketosteroid
4 pages
B2016 Interpersonal and Intrapersonal Expectancies, Trusz ROUTLEDGE
100% (1)
B2016 Interpersonal and Intrapersonal Expectancies, Trusz ROUTLEDGE
203 pages
MQ131 PDF
No ratings yet
MQ131 PDF
3 pages
Stakeholder Analysis Matrix (3.3.2) (10.1)
No ratings yet
Stakeholder Analysis Matrix (3.3.2) (10.1)
1 page
Multiplication Tips
No ratings yet
Multiplication Tips
10 pages
Zelio Logic SR3B101FU
No ratings yet
Zelio Logic SR3B101FU
6 pages
Advances in Adaptive Control Theory Grad
No ratings yet
Advances in Adaptive Control Theory Grad
165 pages
Williams, Kenneth R. - The Natural Calculator (1991)
100% (2)
Williams, Kenneth R. - The Natural Calculator (1991)
111 pages
Clincal Guidelines For Genetics Services 2021
No ratings yet
Clincal Guidelines For Genetics Services 2021
73 pages
Modul Minggu Ke-2
No ratings yet
Modul Minggu Ke-2
14 pages
Deutsche Rosalind Art and Public Space
No ratings yet
Deutsche Rosalind Art and Public Space
21 pages
AT Attachment With Packet Interface - 7 Volume 1
No ratings yet
AT Attachment With Packet Interface - 7 Volume 1
390 pages
Yu Hua - On Brothers and Chaotic Aesthetics, (Chinese Literature Today, Spring 2011)
No ratings yet
Yu Hua - On Brothers and Chaotic Aesthetics, (Chinese Literature Today, Spring 2011)
6 pages
Today Days Noah
100% (2)
Today Days Noah
75 pages
Probability of Independent Events
No ratings yet
Probability of Independent Events
10 pages
Wallace Strenger - Carta de Lo Salvaje
No ratings yet
Wallace Strenger - Carta de Lo Salvaje
5 pages
Chapter 7 Integrals
No ratings yet
Chapter 7 Integrals
37 pages
DLL - English 6 - Q1 - W5
No ratings yet
DLL - English 6 - Q1 - W5
5 pages
Explorations in Phenomenology
100% (1)
Explorations in Phenomenology
449 pages
Wardha AR 9-10 For Web
No ratings yet
Wardha AR 9-10 For Web
56 pages
Handshaking
No ratings yet
Handshaking
3 pages
Human Resource Management Assignment Case Study - Job Analysis at Go-Forward
No ratings yet
Human Resource Management Assignment Case Study - Job Analysis at Go-Forward
11 pages
The Ultimate GP Referral Script Final - Compressed
No ratings yet
The Ultimate GP Referral Script Final - Compressed
12 pages
Knee Rehab Self Assessment Checklist 2
100% (1)
Knee Rehab Self Assessment Checklist 2
32 pages
Coastal Processes Knowledge Organiser
No ratings yet
Coastal Processes Knowledge Organiser
2 pages
UGC HR Recruitment
No ratings yet
UGC HR Recruitment
2 pages