Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
16 views

In-Class Exercise #1 Notes

Uploaded by

CSstudent
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

In-Class Exercise #1 Notes

Uploaded by

CSstudent
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

In-Class Exercise #1 Notes

Chapter 0: First Things First


DCOVA FRAMEWORK
In order to minimise errors, you use a framework that organises the set of
tasks that you follow to apply statistics correctly:
- Define the data that you want to study to solve a problem.
- Collect the data from appropriate sources.
- Organise the data collected, by developing tables.
- Visualise the data collected, by developing charts.
- Analyse the data collected, reach conclusions and present results.
OPERATIONAL DEFINITIONS
Big Data is data that are being collected in large volumes, at very fast
velocities and near real time.
Unstructured Data have little repeating internal structure and an irregular
pattern, which requires pre-processing prior to analysis.
Variable defines a characteristic or property of an item or individual that can
vary among the occurrences of those items or individuals.
Descriptive Statistics are the methods that primarily help summarise and
present data.
Inferential Statistics are the methods that use data collected from a small
group to reach conclusions about a larger group.
Statistic refers to a value that summarises the data of a particular variable.
Logical Causality means that you can plausibly claim something directly causes
something else.
Data are numerical or textual facts and figures that are collected through some
type of measurement process.
Information is the result of analysing data, i.e. extracting meaning from data to
support evaluation and decision making.
Chapter 1: Defining and Collecting Data
CLASSIFYING VARIABLES BY TYPE
Numerical – variables whose data represent a counted or measured quantity.
Categorical – variables whose data represent categories, e.g. ‘yes’ or ‘no’.
Discrete – variables have data that arise from a counting process.
Continuous – variables have data that arise from a measuring process.
MEASUREMENT SCALES
Measurement Scale defines the ordering of values and determines if
differences among pairs of values for a variable are equivalent and whether
one value can be expressed in terms of another.
Interval Scale expresses a difference between measurements that do not
include a true zero point.
Ratio Scale expresses an ordered scale that includes a true zero point.
Nominal Scale classifies data where category values express no order or
ranking.
Ordinal Scale classifies data into distinct categories in which ranking is implied.
POPULATIONS AND SAMPLES
A population contains all the items or individuals of interest that one seeks to
study.
A sample contains only a portion of a population of interest.
A population parameter summarises the value of a population for a specific
variable.
A sample statistic summarises the value of a specific variable for sample data.
SOURCES OF DATA
Primary Data Source is when the data collector is the one using data for
analysis.
Secondary Data Source is when the person performing data analysis is not the
data collector.
A treatment is when researchers that collect data are looking for the effect of
some change on a variable of interest.
TYPES OF SAMPLING METHODS
The frame is a complete or partial listing of the items that make up the
population from which the sample will be selected.
Non-probability sample is where items or individuals are selected without
knowing their probabilities of selection.
Probability sample is where items or individuals are selected based on known
probabilities.
Convenience sample is where select items are chosen because they are easy,
inexpensive or convenient to sample.
Judgement sample is where the opinions of pre-selected experts in the subject
matter are chosen.
SIMPLE RANDOM SAMPLE
 Every individual or item from frame has an equal chance of being
selected.
 Selection may be with or without replacement.
 Samples are obtained from table of random numbers or computer
random number generators.

SYSTEMATIC SAMPLE
 Partition the N items in the frame into n groups of k items, this is where
k = N / n.
 Round k to the nearest integer. To select a systematic sample, choose
the first item to be selected at random from the first k items in the
frame.
 Then, select the remaining n - 1 items by taking every kth item
thereafter from the entire frame.
 
STRATIFIED SAMPLE
 Divide population into two or more subgroups (known as strata)
according to some common characteristic.
 A simple random sample is selected from each subgroup, sample sizes
being proportional to strata sizes.
 Samples from subgroups are combined into one.
 Application: Population of voters
 
CLUSTER SAMPLE
 Population is divided into several "clusters", each representative of the
population.
 A simple random sample of clusters is selected.
 All items in selected clusters can be used, or items can be chosen from a
cluster using alternative techniques.
 Application: Election exit polls
DATA CLEANING
 Data Cleaning corrects defects in inconsistent data and ensures the data
contain suitable quality for analysis.
 Invalid Variable Values can be identified as being incorrect by simple
scanning techniques so long as operational definitions for the variables
the data represent exist.
 Coding Errors can result from poor recording or entry of data values or
as the result of computerized operations such as copy-and-paste or data
import.
 Data Integration Errors arise when data from two different
computerised sources, such as two different data repositories are
combined into one data set for analysis.
 Missing Values are values that were not collected for a variable.
 Outliers are values that seem excessively different from most of the
other values.

DATA PRE-PROCESSING TASKS


 Data Formatting includes rearranging the structure of the data or
changing the electronic encoding of the data or both.
 Stacking and Unstacking Data – when collecting data for a numerical
variable, subdividing that data into two or more groups for analysis may
be necessary.
 Unstacked Arrangement – create separate numerical variables for each
group.
 Stacked Arrangement – pair the single numerical variable with a second,
categorical variable that contains two categories.
 Recoding Variables – After data have been collected, categories defined
for a categorical variable may need to be reconsidered or a numerical
variable may need to be transformed into a categorical variable by
assigning individual numeric values to one of several groups.
 A Recoded Variable – is one that supplements or replaces the original
variable in your analysis.
 Mutually Exclusive – one and only one of the new categories can be
assigned to any particular value being recoded.
 Collectively Exhaustive – each value can be recoded successfully by one
of your new categories.
TYPES OF SURVEY ERRORS

 Coverage Error occurs if


1. Certain groups of items are excluded from the frame so that they have
no chance of being selected in the sample.
2. Items are included from outside the frame.
Coverage Error may result in a selection bias.

 Nonresponse Error arises from failure to collect data on all items in the
sample and results in a nonresponse bias.

 Sampling Error reflects the variation, or “chance differences”, from


sample to sample, based on probability of particular individuals or items
being selected in particular samples.
Margin of Error = Sampling Error

 Measurement Error can arise when surveys rely on self-reported


information, the mode of data collection or the respondent to the
survey.

ETHICAL ISSUES
Coverage error can result in selection bias and becomes an ethical issue if
particular groups or individuals are purposely excluded from the frame so that
the survey results are more favourable to the survey’s sponsor.

Nonresponse error can lead to nonresponse bias and becomes an ethical issue
if the sponsor knowingly designs the survey so that particular groups or
individuals are less likely than others to respond.

Sampling error becomes an ethical issue if the findings are purposely


presented without reference to sample size and margin of error so that the
sponsor can promote a viewpoint that might otherwise be inappropriate.

Measurement error can become an ethical issue in one of three ways:


(1) a survey sponsor chooses leading questions that guide the respondent in a
particular direction.
(2) an interviewer, through mannerisms and tone, purposely makes a
respondent obligated to please the interviewer or otherwise guides the
respondent in a particular direction.
(3) a respondent willfully provides false information.

You might also like