Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
38 views

Multivariant Data.

This document discusses techniques for screening data prior to analysis, including checking for accuracy of values, identifying and addressing missing data, identifying outliers, and checking assumptions such as additivity, normality, linearity, homogeneity, and homoscedasticity. It outlines specific steps and statistical tests to screen for each of these issues in the data.

Uploaded by

Abdullah Afzal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Multivariant Data.

This document discusses techniques for screening data prior to analysis, including checking for accuracy of values, identifying and addressing missing data, identifying outliers, and checking assumptions such as additivity, normality, linearity, homogeneity, and homoscedasticity. It outlines specific steps and statistical tests to screen for each of these issues in the data.

Uploaded by

Abdullah Afzal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

The Way of those on whom You have bestowed Your Grace,

not (the way) of those who earned Your Anger, nor of those
who went astray.
(The Qur'an- Surah Al- Fatihah)”
Subject
MULTIVARIATE DATA
ANALYSIS

Presented To DR Muhammad Shoaib

Presented By Abdullah Afzal

Roll Number L-21305 MBA(1.5)


TOPIC

What is screening of data and how many


techniques are available to screen the
data.
What is Data Screening
• Data screening is very important to make sure you’ve
met all your assumptions, outliers, and error problems.
Each type of analysis will have different types of data
screening.
• It is very easy to make mistakes when entering data.
• Some errors can miss up your analysis.
• So, it is important to spend the time for checking for
the mistakes initially, rather than trying to repair the
damage later, try another person to check your data.
TECHNIQUES OF SCREENING DATA:

IN THIS ORDER:
• Accuracy
• Missing
• Outliers

Assumptions:
• Additivity
• Normality
• Linearity
• Homogeneity / Homoscedasticity
Accuracy:
•Look for problems with the database.
•Generally, you want values ​that are outside
the scope; check the minimum and
maximum to see if it is within your
expectations.
•Fix it or just delete that data point. Do not
delete everyone, just the wrong data point.
Missing Data:
• Values in a data set are missing completely at random
(MCAR) if the events that lead to any particular data-item
being missing are independent both of observable variables
and of unobservable parameters of interest, and occur
entirely at random. When data are MCAR, the analysis
performed on the data is unbiased; however, data are rarely
MCAR. In the case of , the missingness of data is unrelated to
any study variable: thus, the participants with completely
observed data are in effect a random sample of all the
participant assigned a particular intervention. With MCAR,
the random assignment of treatments is assumed to be
preserved, but that is usually an unrealistically strong
assumption in practice.
Missing at random (MAR)
• Occurswhen the missingness is not random, but where
missingness can be fully accounted for by variables
where there is complete information. Since MAR is an
assumption that is impossible to verify statistically, we
must rely on its substantive reasonableness.
Missing not at random (MNAR)
• Data that is neither MAR nor MCAR (i.e. the value of the
variable that's missing is related to the reason it's
missing). To extend the previous example, this would
occur if men failed to fill in a depression survey because
of their level of depression.
Univariate
• Univariateis a term commonly used in statistics to
describe a type of data which consists of observations
on only a single characteristic or attribute. A simple
example of univariate data would be the salaries of
workers in industry. Like all the other data, univariate
data can be visualized using graphs, images or other
analysis tools after the data is measured, collected,
reported, and analyzed.
Multivariate
• outliersrefer to records that do not fit the
standard sets of correlations exhibited by the
other records in the dataset, with regards to
your causal model. So, if all but one person in
the dataset reports that diet has a positive
effect on weight loss, but this one guy reports
that he gains weight when he diets, then his
record would be considered a multivariate
outlier.
OUTLIERS
• Outliers are case(s) with extreme value on one
variable or multiple variables. – Univariate
outliers: you are an outlier for one variable. –
Multivariate outliers: you are an outlier for
multiple variables. Your pattern of data is weird.
OUTLIERS
• Youcan check for univariate outliers, but when
you have a big dataset, that’s really not
necessary. – We will cover how to check for
univariate outliers when necessary, since it
really only applies to those analyses (when you
have 1 DV like ANOVA).
OUTLIERS
• Multivariate – check with Mahalanobis distance.
– Mahalanobis distance – distance of a case
from the centroid of rest of cases. Centroid is
created by plotting the means of the all the
variables (like an average of averages), and then
seeing how far each person’s scores are from
the middle.
OUTLIERS
• How to check: – Create Mahalanobis scores. –
Use the chi-square function to find the cut off
score (anything past this score is an outlier). – df
= the number of variables you are testing
(important: testing! Not all the variables!) – Use
the p <.001 value.
OUTLIERS
• Regression
based outlier analyses: – Leverage –
a score that is far out on line but doesn’t
influence regression slopes (measured by
leverage values). – Discrepancy – a score that is
far away from everyone else that affects the
slope – Influence – product of leverage and
discrepancy (measured by Cook’s values).
OUTLIERS
• What do I do with them when I find them? – Ask
yourself: Did they do the study correctly? Are
they part of the population you wanted? –
Eliminate them – Leave them in
Assumptions Checks
•Additivity
•Normality
•Linearity
•Homogeneity
•Homoscedasticity
Additivity
• Additivityis the assumption that each variable
adds something to the analysis. Often, this
assumption is thought of as the lack of
multicollinearity. – Which is when variables are
too highly correlated. – What is too high?
General rule = r >.90
Additivity
• So,we don’t want to use two variables in an
analysis they are essentially the same. – Lowers
power – Doesn’t run – Waste of time If you get a
“singular matrix” error, you’ve used two
variables in an analysis that are too similar or
the variance of a variable is essentially zero.
Assumption Checks
•A quick note: for many of the statistical tests
you would run, there are diagnostic plots /
assumptions built into them. This guide lets you
apply data screening to any analysis, if you
wanted to learn one set of rules, rather than one
for each analysis. (BUT there are still things that
only apply to ANOVA that you’d want to add
when you run ANOVA).
Assumption Checks
• ForANOVA, t-tests, correlation: you will use a
fake regression analyses – it’s considered fake
because it’s not the real analysis, just a way to
get the information you need to do data
screening. For regression based tests: you can
run the real regression analysis to get the same
information. The rules are altered slightly, so
make sure you make notes in the regression
section on what’s different.
Assumption Checks
• Checks The random thing and why chi square: –
Why chi-square? – For many of these
assumptions, the errors should be chi-square
distributed (aka lots of small errors, only a few
big ones). – However, the standardized errors
should be normally distributed around zero
(don’t get these two things confused – we want
the actual error numbers to be chi-square
distributed, the standardized ones to be
normal).
Normality
•The assumption of normality is that the
sampling distribution is normal. – Not the
sample. – Not the population. – Distribution
bunnies to the rescue!
Normality
• Multivariate
Normality – each variable and all linear
combinations of variables are normal Given the Central
Limit Theorem – with N > 30 tests are robust (meaning
that you can violate this assumption and get
reasonably accurate statistics).
Normality
•So,the way to check for normality
though, is to use the sample. – If N > 30,
we tend not to worry too much. – If N <
30, and it’s bad, you may consider
changing analyses.
Linearity
•Assumption that there is a straight line
relationship between two variables (or the
combination of all the variables)
HomoG + S
•Homogeneity: equal variances – the
variables (or groups) have roughly equal
variances. Homoscedasticity: spread of the
variance of a variable is the same across all
values of the other variables.
HomoG + S
•How to check them: – Both of these can be
checked by looking at the residual scatterplot.
– Fitted values = the predicted score for a
person in your regression. – Residuals = the
difference between the predicted score and a
person’s actual score in the regression (y – y
hat).
HomoG + S
• How to check them: – We are plotting them
against each other. In theory, the residuals should
be randomly distributed (hence why we created a
random variable to test with). – Therefore, they
should look like a bunch of random dots.
R-Tips
•Multiple datasets will be created. –
Remember to use the right data set.
Sometimes, you don’t have problems. –
So, you would just SKIP a step making
sure to use the right dataset.

You might also like