MVN Packages R
MVN Packages R
MVN Packages R
Abstract
We previously presented MVN (https://cran.r-project.org/web/packages/MVN/index.
html) package to assess multivariate normality. We also published the paper of the package
(https://journal.r-project.org/archive/2014/RJ-2014-031/RJ-2014-031.pdf). Now, we
present an updated version of the package. The web-tool of the package available at http:
//opensoft.turcosa.com.tr/MVN/.
Similarly, Iris data can be loaded from the R database by using the following R code:
The Iris data set consists of 150 samples from each of the three species of Iris including setosa,
virginica and versicolor. For each sample, four variables were measured including the length
and width of the sepals and petals, in centimeters.
Example I: For simplicity, we will work with a subset of these data which contain only 50 samples
of setosa flowers, and check MVN assumption using Mardia’s, Royston’s and Henze-Zirkler’s tests.
1
1.1 mvn function
In this section we will introduce our mvn function. This function includes all the arguments to
assess multivariate normality through multivariate normality tests, multivariate plots, multivariate
outlier detection, univariate normality tests and univariate plots.
Arguments Definition
data a numeric matrix or data frame
subset define a variable name if subset analysis is required
mvnTest select one of the MVN tests. Type ’mardia’ for Mardia’s test, ’hz’
for Henze-Zirkler’s test, ’royston’ for Royston’s test, ’dh’ for
Doornik-Hansen’s test and energy for E-statistic. See details for
further information.
covariance this option works for ’mardia’ and ’royston’. If TRUE covariance
matrix is normalized by n, if FALSE it is normalized by n-1
scale if TRUE scales the colums of data
desc a logical argument. If TRUE calculates descriptive statistics
transform select a transformation method to transform univariate marginal via
logarithm (’log’), square root (’sqrt’) and square (’square’)
R number of bootstrap replicates for Energy test, default is 1000
univariateTest select one of the univariate normality tests, Shapiro-Wilk (’SW’),
Cramer-von Mises (’CVM’), Lilliefors (’Lillie’), Shapiro-Francia
(’SF’), Anderson-Darling (’AD’)
univariatePlot select one of the univariate normality plots, Q-Q plot (’qq’),
histogram (’histogram’), box plot (’box’), scatter (’scatter’)
multivariatePlot ’qq’ for chi-square Q-Q plot, ’persp’ for perspective plot, ’contour’
for contour plot
multivariateOutlierMethod select multivariate outlier detection method, ’quan’ quantile method
based on Mahalanobis distance and ’adj’ adjusted quantile method
based on Mahalanobis distance
showOutliers if TRUE prints multivariate outliers
showNewData if TRUE prints new data without outliers
2
result <- mvn(data = setosa, mvnTest = "mardia")
result$multivariateNormalityResult
## NULL
This function performs multivariate skewness and kurtosis tests at the same time and combines
test results for multivariate normality. If both tests indicates multivariate normality, then data
follows a multivariate normality distribution at the 0.05 significance level.
## NULL
## NULL
3
Chi−Square Q−Q Plot
12
●
10
Chi−Square Quantile
●
●
●
8
●
●
●
●
●
6
●
●
●
●
●●
● ●
●
●●
4
●
●●
●
●●
● ●●
●●
●
●
●●
●
2
●
●●
●
●
●
●
●●
●●●
●
0
0 2 4 6 8 10 12
As seen from Figure 2, Petal.Width has a right-skewed distribution whereas other variables
have approximately normal distributions. Thus, we can conclude that problems with multivariate
normality arise from the skewed distribution of Petal.Width. In addition to the univariate plots,
one can also perform univariate normality tests using the univariateTest argument in the mvn
function. It provides several widely used univariate normality tests, including "SW" for Shapiro-
Wilk test, "CVM" for Cramer-von Mises test, texttt"Lillie" for Lilliefors test, "SF" for Shapiro-Francia
test and "AD" Anderson-Darling test. For example, the following code chunk is used to perform the
Shapiro-Wilk’s normality test on each variable and it also displays descriptive statistics including
mean, standard deviation, median, minimum, maximum, 25th and 75th percentiles, skewness and
kurtosis:
result <- mvn(data = setosa, mvnTest = "royston", univariateTest = "SW", desc = TRUE)
result$univariateNormalityResult
4
Normal Q−Q Plot (Sepal.Length) Normal Q−Q Plot (Sepal.Width)
1.2
● ●
1.0
● ●
●
●
4.0
5.5
●● ●
0.8
Sample Quantiles
Sample Quantiles
●●
●●●●● ●●●●
0.8
● ●●●
●●●
0.6
Density
Density
3.5
●●●
●●●●●●
●●●●●●●● ●●●
●●●
●●●
5.0
●●●
●●●
●● ●●
●●●●●
0.4
●●●● ●●●●
3.0
0.4
●●●●● ● ●●●●●
●● ●
0.2
●●●●
4.5
2.5
● ●●
0.0
0.0
● ●
0.6
7
● ● ●
2.5
1.8
6
0.5
●
2.0
Sample Quantiles
●●● ●
Sample Quantiles
5
1.6
●●●●●●●
0.4
1.5
Density
Density
●●●●●● ●
4
●
●●●
●●●●●●●●●
1.4
●●●●●●●●●●●
●●
3
0.3
1.0
●●●●●●●
●●●●●●●
2
1.2
0.2
0.5
●● ●●●●●●●●●●●●●●●●●
●●●
●●●
●●●●●●
1
●
1.0
0.1
0.0
● ● ● ● ●●
0
−2 −1 0 1 2 −2 −1 0 1 2 1.0 1.2 1.4 1.6 1.8 0.1 0.2 0.3 0.4 0.5 0.6
From the above results, we can see that all variables, except Petal.Width in the setosa data
set, have univariate normal distributions at significance level 0.05. We can now drop Petal.With
from setosa data and recheck the multivariate normality. MVN results are given in Table 2.
According to the three MVN test results in Table 2, setosa without Petal.Width has a multi-
variate normal distribution at significance level 0.05.
Example II: Whilst the Q-Q plot is a general approach for assessing MVN in all types of nu-
merical multivariate datasets, perspective and contour plots can only be used for bivariate data. To
5
demonstrate the applicability of these two approaches, we will use a subset of Iris data, named
setosa2, including the sepal length and sepal width variables of the setosa species.
# perspective plot
result <- mvn(setosa2, mvnTest = "hz", multivariatePlot = "persp")
# contour plot
result <- mvn(setosa2, mvnTest = "hz", multivariatePlot = "contour")
Since neither the univariate plots in Figure 2 nor the multivariate plots in Figure 3 show any
significant deviation from MVN, we can now perform the MVN tests to evaluate the statistical
significance of bivariate normal distribution of the setosa2 data set.
All three tests in Table 2 indicate that the data set satisfies bivariate normality assumption at
the significance level 0.05. Moreover, the perspective and contour plots are in agreement with the
test results and indicate approximate bivariate normality.
6
0.1
0.2
0.3
4.0
0.5
0.7
0.9
1.1
Sepal.Width
3.5
1.4
1.3
1.2
3.0
0.8 1
0.4 0.6
Dens
0.2
Sepa
0.1
2.5
ity
l.
Width
0.1
Sepal.Length Sepal.Length
Figure 3: Perspective and contour plot for bivariate setosa2 data set.
Figures 3a and 3b were drawn using a pre-defined graphical option by the authors. However,
users may change these options by setting function entry to default = FALSE. If the default is
FALSE, optional arguments from the plot, persp and contour functions may be introduced to the
corresponding graphs.
Mahalanobis Distance:
2. Compute the 97.5 percent adjusted quantile (AQ) of the chi-Square distribution,
7
multivariate outliers as given below. It also returns a new data set in which declared outliers are
removed. Moreover, this argument creates Q-Q plots for visual inspection of the possible outliers.
For this example, we will use another subset of the Iris data, which is versicolor flowers, with
the first three variables.
84
● ●
● Outliers (n=2) ● Outliers (n=0)
● Non−outliers (n=48) ● Non−outliers (n=50)
10
10
99
● ●
Chi−Square Quantile
Chi−Square Quantile
8
8
● ●
● ●
● ●
● ●
6
6
● ●
● ●
● ●
● ●
● ●
Quantile: 9.348
● ●
● ●
4
4
● ●
● ●● ● ●●
● ●
●●
● ●●
●
●● ● ●● ●
●● ●●
●●● ●●●
2
2
●
●● ●
●●
●
● ●
●
●●
●
● ●●
●
●
●
● ●
●
●●
●● ●●
●●
●●●
● ●●●
●
●● ●●
0
0 2 4 6 8 10 0 2 4 6 8 10
head(iris)
8
result <- mvn(data = iris, subset = "Species", mvnTest = "hz")
result$multivariateNormality
## $setosa
## Test HZ p value MVN
## 1 Henze-Zirkler 0.9488453 0.04995356 NO
##
## $versicolor
## Test HZ p value MVN
## 1 Henze-Zirkler 0.8388009 0.2261991 YES
##
## $virginica
## Test HZ p value MVN
## 1 Henze-Zirkler 0.7570095 0.4970237 YES
According to the Henze-Zirkler’s test results, dataset for setosa does not follow a multivariate
normal distribution, whereas dataset versicolor and virginica follow a multivariate normal distribu-
tion.
References
[1] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics,
7(2):179–188, 1936.
[2] Edgar Anderson. The species problem in Iris. Missouri Botanical Garden Press, 23(3):457–509,
1936.
[3] Tom Burdenski. Evaluating univariate, bivariate, and multivariate normality using graphical
and statistical procedures. Multiple Linear Regression Viewpoints, 26(2):15–28, 2000.
[4] James P Stevens. Applied multivariate statistics for the social sciences. Routledge, 2012.
[5] Robert E Kass, Uri T Eden, and Emery N Brown. Analysis of Neural Data. Springer, 2014.
[6] P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. John Wiley & Sons,
Inc., New York, NY, USA, 1987.
[7] Peter Filzmoser, Robert G. Garrett, and Clemens Reimann. Multivariate outlier detection in
exploration geochemistry. Computers & Geosciences, 31(5):579–587, 2005.
[8] RStudio, Inc. shiny: Web Application Framework for R, 2014. R package version 0.10.1.
1
http://www.rstudio.com/shiny/