Assignment 2 Group 1 Report
Assignment 2 Group 1 Report
Assignment 2 Group 1 Report
BIOL 520/820
Assignment 2
Report
Group Number: 1
1 Introduction
The Low birth weight data contains maternal risk factors associated with low birth
weight of neonates provided by Hosmer and Lemeshow (2000) and indicated as the birth
weight of a baby is under 2500 g as low, or over is normal. This dataset includes 10
variables such as: low for low birth weight baby and normal weight; smoke for the history
of maternal smoking and nonsmoker; race included white, black, or other; the age of the
mother from 14-45 years; lwt for the weight (lbs) at last menstrual period from 80 to 250
lbs; ptl for a number of false of premature labours from 0 to 3; ht for the history of
hypertension and no hypertension; ui for uterine irritability and no irritability; ftv for
number of physician visits in 1st trimester from 0 to 6; and bwt for babies birth weight in
grams from 709 - 4990 gr. Data were collected at Baystate Medical Center, Springfield,
Massachusetts in 1986.
Our hypothesis is that there is an association between smoke and uterine irritability and
neonates weight at birth.
To analyze our specific object we have done several data analysis techniques, which is
explained below.
Linear Regression - modeling the relationship between observed and target variables
using linear functions. It mathematically models the unknown or dependent variable and the
known or independent variable in the form of a linear equation. Linear regression is a well-
established statistical technique that is easily applied to software and computing [1].
Companies use it to reliably and predictably transform raw data into business intelligence
and useful analytics. Scientists in many fields, including biology and behavioral,
environmental and social sciences, use linear regression to conduct preliminary data
analysis and predict future trends. Many data science methods, such as machine learning
and artificial intelligence, use linear regression to solve complex problems.
Linear regression makes several assumptions about the data [2], such as :
1. Linearity of the data. The relationship between the predictor (x) and the outcome
(y) is assumed to be linear.
2. Normality of residuals. In statistics, normality tests are used to determine whether a
dataset is well modeled using a normal distribution and to calculate the probability
of a normal distribution of the random variable underlying the dataset. Normality
tests include tests such as the D'Agostino K-square, the Zhark– Bera test, the
Smirnov criterion, adjusted for the estimation of the mean and variance of the data,
the Shapiro–Wilk criterion and the Pearson chi-square criterion [3].
Among them, the Shapiro–Wilk test is useful for determining whether a given data set
comes from a normal distribution, which is a common assumption used in many statistical
tests, including regression, analysis of variance, t-tests, and many others [4].
Also, the QQ (quantile-quantile) graph is a probability graph that is a graphical
method for comparing two distributions by constructing their quantiles. The QQ graph
compares data sets of theoretical and sample (empirical) distributions. If the two
distributions being compared are similar, then the points on the QQ graph will
approximately lie on the y=x line. The main step in constructing a QQ graph is the
calculation or estimation of quantiles [4].
3. Homogeneity of residuals variance. The residuals are assumed to have a constant
variance (homoscedasticity). Homoscedasticity is a property of data used to
construct a linear regression model, which consists in the fact that their variance
along a straight regression is constant. Homoscedasticity is one of the conditions for
the effectiveness of the regression model [5]. In statistics, a sequence or vector of
random variables is homoscedastic if all random variables in the sequence or vector
have the same variance.
Dispersion analysis (from the Latin Dispersio – dispersion / in English Analysis Of
Variation - ANOVA) is used to study the influence of one or more qualitative variables
(factors) on one dependent quantitative variable.
The main purpose of variance analysis (ANOVA) is to study the significance of the
difference between averages by comparing (analyzing) variances [6]. Dividing the total
variance into several sources makes it possible to compare the variance caused by the
difference between groups with the variance caused by intra-group variability. If the null
hypothesis is true (about the equality of averages in several groups of observations selected
from the general population), the estimate of the variance associated with intra-group
variability should be close to the estimate of the intergroup variance. If you simply compare
the averages in two samples, the analysis of variance will give the same result as the usual t-
test for independent samples (if two independent groups of objects or observations are
compared) or the t-test for dependent samples (if two variables are compared on the same
set of objects or observations).
The essence of variance analysis is to divide the total variance of the studied trait into
individual components due to the influence of specific factors, and to test hypotheses about
the significance of the influence of these factors on the trait under study. Comparing the
variance components with each other by means of Fischer's F—test, it is possible to
determine what proportion of the total variability of the effective feature is due to the action
of regulated factors [6].
The initial material for the analysis of variance is the data from the study of three or more
samples: x_1, ...,x_n, which can be both equal and unequal in number, both connected and
disconnected [6]. According to the number of controlled factors detected, the analysis of
variance can be one-factor (while studying the influence of one factor on the results of the
experiment), two-factor (when studying the influence of two factors) and multifactorial
(allows you to evaluate not only the influence of each of the factors separately, but also their
interaction).
- The One-way Analysis of Variance (ANOVA) procedure performs a one-factor
analysis of variance for a quantitative dependent variable based on a single factor
(independent) variable and estimates the effect size in a one-factor analysis of
variance ANOVA. Analysis of variance is used to test the hypothesis of the equality
of several average values corresponding to different groups or levels of a factor
variable [6]. This method is an extension of the two-sample t-test.
- The Two-way analysis of variance ("variance analysis") is used to determine
whether there is a statistically significant difference between the averages of three or
more independent groups divided into two variables (sometimes called "factors").
This type of analysis of variance is used when you want to find out how two factors
affect a response variable and whether there is an interaction effect between two
factors on the response variable [6].
Pair comparison is the process of comparing objects in pairs to determine which one is
preferred, or has a greater number of certain quantitative properties, or whether two objects
are identical. The pair comparison method is used in the scientific study of preferences,
relationships, voting systems, social choice, public choice, requirements engineering and
multi-agent systems. Paired multiple comparisons check the differences between each pair
of averages and output a matrix in which asterisks indicate group averages that differ
significantly at the alpha level equal to 0.05 [7].
When conducting paired comparisons of group averages with the Bonferroni test, the t-
criterion is used, but to control the overall error level by the error level of each check, the
probability of an erroneous decision is divided by the total number of checks. The
confidence intervals and the significance level are adjusted to take into account the multiple
comparisons being made [8].
Logistic regression is a data analysis technique that uses mathematics to find
relationships between two data factors. This relationship is then used to predict the value of
one of these factors based on the other. A prediction usually has a finite number of results,
such as "yes" or "no" [9].
Logistic regression is useful for situations in which you want to be able to predict the
presence or absence of a characteristic or outcome based on the values of a set of predictor
variables. It is similar to the linear regression model, but is suitable for models where the
dependent variable has only two values. Logistic regression coefficients can be used to
estimate the odds ratios for each dependent variable of the model. Logistic regression is
applicable to a wider range of situations than discriminant analysis [10].
Then we decided to analyze if there exists any two-way interaction between uterine
irritability and smoking status explaining the birth weight of neonates.
Two-way ANOVA is used to evaluate simultaneously the effect of two different
grouping variables on a continuous outcome variable, in detail the effect of smoking/non-
smoking and presence or absence of uterine irretability on birthweight.
To perform the two way ANOVA we made up the data to categorize risk factors.
We built the linear model of two-way interaction between uterine irritability and smoking
status associated with LBW. In the R code, the asterisk indicates the interaction effect.
2.2 Visualization
We can see from the box plot of the birthweight by uterine irritability levels of mothers,
faceted by smoking status differences in means of birthweight by groups. We were
specifically interested to check if the grouped factor of smoking and UI might be associated
with the low birthweight.
Fig. 1 A box plot of the birthweight by uterine irritability levels of mothers, faceted by
smoking status
2.4 Normality
In the QQ plot (Figure 2.)all residual points fall approximately along the reference
line, thus we can assume normal distribution. This conclusion is supported by the Shapiro-
Wilk test. The p-value is not significant (p = 0.41), so we can assume normality.
Fig. 2 QQ plot of the model residuals
An analysis of simple main effects for smoking and UI independently were performed
with statistical significance receiving a Bonferroni adjustment. For both maternal risk
factors, smoking and uterine irritability pairwise comparison test was significant.
Birthwei No Yes
ght UI UI
Normal 116 14
Low 45 14
Table 5 Mother’ history of hypertension (ht) during pregnancy and babies birthweight
Birthwei No Yes
ght HT HT
Normal 124 5
Low 52 7
Call:
glm(formula = low ~ smoke + ht + age + race + ui, family = binomial,
data = lbw)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6886 -0.8586 -0.5869 1.1020 2.0684
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.71234 0.99298 -1.724 0.08463 .
smoke 1.06901 0.38151 2.802 0.00508 **
ht 1.39636 0.62557 2.232 0.02561 *
age -0.03422 0.03436 -0.996 0.31933
race 0.51975 0.20718 2.509 0.01212 *
ui 0.95361 0.44191 2.158 0.03093 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Conclusion
The dataset contains 189 women on 10 variables. It was collected at Baystate
Medical Center, Springfield, Massachusetts in 1986.
During the study, we evaluated maternal risk factors. Firstly, by studying the linear
model of the data, we found out that smoking and uterine irritability were significantly
associated with birth weight. Then, we decided to study these two variables by visualizing
them on a boxplot of the birthweight by uterine irritability levels of mothers, faceted by
smoking status differences in means of birthweight by groups. Then, in order to check for
the normality of the dataset, we did QQ-plots and the Shapiro-Wilk test. As a result, the
dataset was normally distributed. After that, we decided to use a two-way ANOVA test.
However, there was not a statistically significant interaction between smoking status and
uterine irritability associated with birth weight. After that, we needed to determine whether
we had any statistically significant main effects from the ANOVA output by a pairwise
comparison test between groups. For both smoking and uterine irritability pairwise
comparison tests were significant.
Our hypothesis was that there is an association between smoke and uterine
irritability and neonates’ weight at birth. During the study, the hypothesis was proved, and
we found out that smoking and uterine irritability were significantly associated with birth
weight. However, combining them does not.
References
1. Kasza, J., & Wolfe, R. (2014). Interpretation of commonly used statistical regression
models. Respirology (Carlton, Vic.), 19(1), 14–21.
https://doi.org/10.1111/resp.12221
2. Hickey, G. L., Kontopantelis, E., Takkenberg, J. J. M., & Beyersdorf, F. (2019).
Statistical primer: checking model assumptions with regression diagnostics.
Interactive cardiovascular and thoracic surgery, 28(1), 1–8.
https://doi.org/10.1093/icvts/ivy207
3. Casson, R. J., & Farmer, L. D. (2014). Understanding and checking the assumptions
of linear regression: a primer for medical researchers. Clinical & experimental
ophthalmology, 42(6), 590–596. https://doi.org/10.1111/ceo.12358
4. Vetter T. R. (2017). Fundamentals of Research Data and Variables: The Devil Is in
the Details. Anesthesia and analgesia, 125(4), 1375–1380.
https://doi.org/10.1213/ANE.0000000000002370
5. O'Neill, M. E., & Mathews, K. L. (2002). Levene tests of homogeneity of variance
for general block and treatment designs. Biometrics, 58(1), 216–224.
https://doi.org/10.1111/j.0006-341x.2002.00216.x
6. Mishra, P., Singh, U., Pandey, C. M., Mishra, P., & Pandey, G. (2019). Application
of student's t-test, analysis of variance, and covariance. Annals of cardiac
anaesthesia, 22(4), 407–411. https://doi.org/10.4103/aca.ACA_94_19
7. Tarrow, S. (2010). The Strategy of Paired Comparison: Toward a Theory of
Practice. Comparative Political Studies, 43(2), 230–259.
https://doi.org/10.1177/0010414009350044
8. Narum, S.R. Beyond Bonferroni: Less conservative analyses for conservation
genetics. Conserv Genet 7, 783–787 (2006). https://doi.org/10.1007/s10592-005-
9056-y
9. Sperandei S. (2014). Understanding logistic regression analysis. Biochemia medica,
24(1), 12–18. https://doi.org/10.11613/BM.2014.003
10. Wang, Q. Q., Yu, S. C., Qi, X., Hu, Y. H., Zheng, W. J., Shi, J. X., & Yao, H. Y.
(2019). Zhonghua yu fang yi xue za zhi [Chinese journal of preventive medicine],
53(9), 955–960. https://doi.org/10.3760/cma.j.issn.0253-9624.2019.09.018
Appendix. R Script
library(haven)
library(carData)
library(car)
library(datarium)
library(tidyverse)
library(ggpubr)
library(rstatix)
library(readr)
library(ggplot2)
library(foreign)
library(corrplot)
library(olsrr)
source("http://www.sthda.com/upload/rquery_cormat.r")
set.seed(123)
lbw <- read_csv("Desktop/Data/lbw.csv")
View(lbw)
#1
## Recoding lbw1
lbw <- within(lbw1, {
## Relabel race
race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))
## Categorize smoke ht ui
smoke <- factor(lbw$smoke, levels = 0:1, labels = c("Not smoke","Smoke"))
ht <- factor(lbw$ht, levels = 0:1, labels = c("No","Yes"))
ui <- factor(lbw$ui, levels = 0:1, labels = c("No","Yes"))
low <- factor(lbw$low, levels = 0:1, labels = c("Normal","Low bw"))
race <- factor(lbw$race, levels = 1:3, labels = c("White","Black", "Other"))
})
#Visualization. boxplot
bxp <- ggboxplot(
lbw, x = "ui", y = "bwt", palette = "jco", facet.by = "smoke")
bxp
#2.3.Assumptions
#2.3.2 outliers
lbw %>%
group_by(smoke, ui) %>%
identify_outliers(bwt)
#2.3.3 Normality
lbwmodel <- lm(bwt ~ smoke*ui, data = lbw)
# Create a QQ plot of residuals
ggqqplot(residuals(lbwmodel))
lbw %>%
group_by(smoke, ui) %>%
shapiro_test(bwt)
#ANOVA 2 way
res.aov <- lbw %>% anova_test(bwt~smoke*ui)
res.aov
# pairwise comparisons
lbw %>%
pairwise_t_test(
bwt ~ smoke,
p.adjust.method = "bonferroni")
lbw %>%
pairwise_t_test(
bwt ~ ui,
p.adjust.method = "bonferroni")
View(lbw)
xtabs(~low+race)
xtabs(~low+age, data=lbw)
xtabs(~low+smoke)
xtabs(~low+ui)
xtabs(~low+ht)
#Logistic regression
logistic<-glm(low~smoke+ht+age+race+ui, data=lbw, family=binomial)
summary(logistic)
#ANOVA 1 way
smlowAnova <- lbw %>% anova_test(low~smoke)
smlowAnova