9 Review of Discrete Data Analysis: Comparing Two Proportions: Independent Samples

9
REVIEW OF DISCRETE DATA ANALYSIS
Review of Discrete Data Analysis
The material in this section was covered last semester. Since Stata diers from Minitab in how the methods are implemented, we will review those methods and see how to use Stata for them. The huge dierence from what we have been doing is that the response or outcome variable is now categorical instead of continuous. Our goal is to extend all the t-test, regression, ANOVA, and ANCOVA methods we have studied to the case of categorical outcomes.
Comparing Two Proportions: Independent Samples

The New Mexico state legislature is interested in how the proportion of registered voters that support Indian gaming diers between New Mexico and Colorado. Assuming neither population proportion is known, the states statistician might recommend that the state conduct a survey of registered voters sampled independently from the two states, followed by a comparison of the sample proportions in favor of Indian gaming. Statistical methods for comparing two proportions using independent samples can be formulated as follows. Let p1 and p2 be the proportion of populations 1 and 2, respectively, with the attribute of interest. Let p 1 and p 2 be the corresponding sample proportions, based on independent random or representative samples of size n1 and n2 from the two populations.
Large Sample CI and Tests for p1 p2

A large sample CI for p1 p2 is ( p1 p 2 ) zcrit SECI ( p1 p 2 ), where zcrit is the standard normal critical value for the desired condence level, and SECI ( p1 p 2 ) = p 1 (1 p 1 ) p 2 (1 p 2 ) + n1 n2
is the CI standard error. A large sample p-value for a test of the null hypothesis H0 : p1 p2 = 0 against the two-sided alternative HA : p1 p2 = 0 is evaluated using tail areas of the standard normal distribution (identical to 1 sample evaluation) in conjunction with the test statistic zs = where SEtest ( p1 p 2 ) = p (1 p ) p (1 p ) + = n1 n2 p (1 p ) 1 1 + n1 n2 p 1 p 2 , SEtest ( p1 p 2 )
is the test standard error for p 1 p 2 . The pooled proportion p = n1 p 1 + n2 p 2 n1 + n2
is the proportion of successes in the two samples combined. The test standard error has the same functional form as the CI standard error, with p replacing the individual sample proportions. The pooled proportion is the best guess at the common population proportion when H0 : p1 = p2 is true. The test standard error estimates the standard deviation of p 1 p 2 assuming H0 is true. Example Two hundred and seventy nine French skiers were studied during two one-week periods in 1961. One group of 140 skiers receiving a placebo each day, and the other 139 receiving 1 gram of ascorbic acid (Vitamin C) per day. The study was double blind - neither the subjects 94
nor the researchers knew who received what treatment. Let p1 be the probability that a member of the ascorbic acid group contracts a cold during the study period, and p2 be the corresponding probability for the placebo group. Linus Pauling and I are interested in testing whether p1 = p2 . The data are summarized below as a two-by-two table of counts (a contingency table) Outcome # with cold # with no cold Totals Ascorbic Acid 17 122 139 Placebo 31 109 140
The sample sizes are n1 = 139 and n2 = 140. The sample proportion of skiers developing colds in the placebo and treatment groups are p 2 = 31/140 = .221 and p 1 = 17/139 = .122, respectively. The pooled proportion is the number of skiers that developed colds divided by the number of skiers in the study: p = 48/279 = .172. The test standard error is: SEtest ( p1 p 2 ) = The test statistic is .172 (1 .172) 1 1 + 139 140 = .0452.
.122 .221 = 2.19. .0452 The p-value for a two-sided test is twice the area under the standard normal curve to the right of 2.19 (or twice the area to the left of -2.19), which is 2 (.014) = .028 At the 5% level, we reject the hypothesis that the probability of contracting a cold is the same whether you are given a placebo or Vitamin C. A CI for p1 p2 provides a measure of the size of the treatment eect. For a 95% CI zs = zcrit SECI ( p1 p 2 ) = 1.96 .221 (1 .221) .122 (1 .122) + = 1.96 (.04472) = .088. 140 139
The 95% CI for p1 p2 is (.122 .221) .088, or (.187, .011). We are 95% condent that p2 exceeds p1 by at least .011 but not by more than .187. On the surface, we would conclude that a daily dose of Vitamin C decreases a French skiers chance of developing a cold by between .011 and .187 (with 95% condence). This conclusion was somewhat controversial. Several reviews of the study felt that the experimenters evaluations of cold symptoms were unreliable. Many other studies refute the benet of Vitamin C as a treatment for the common cold. To implement this test and obtain a CI using Statas prtesti command (immediate from of prtest command uses data on the command line rather than in memory), we must provide the raw number of skiers receiving ascorbic acid (139) along with the proportion of these skiers that got a cold ( p1 = 0.122), as well as the raw number of skiers receiving placebo (140) along with the proportion of these skiers that got a cold ( p2 = 0.221). I actually like using the GUI (Statistics -> Summaries, tables & tests -> Classical tests of hypotheses -> Two sample proportion calculator) instead of the command line for this, in which case it all looks just like Minitab. Options and entries are a little more obvious from the GUI.
95
x: Number of obs = 139 y: Number of obs = 140 -----------------------------------------------------------------------------Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------x | .122 .02776 .0675914 .1764086 y | .221 .0350672 .1522696 .2897304 -------------+---------------------------------------------------------------diff | -.099 .044725 -.1866594 -.0113406 | under Ho: .045153 -2.19 0.028 -----------------------------------------------------------------------------Ho: proportion(x) - proportion(y) = diff = 0 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 z = -2.193 z = -2.193 z = -2.193 P < z = 0.0142 P > |z| = 0.0283 P > z = 0.9858 It actually is a little more direct to use counts instead of proportions you calculate, by typing prtesti 139 17 140 31, count. Example A case-control study was designed to examine risk factors for cervical dysplasia (Becker et al. 194). All the women in the study were patients at UNM clinics. The 175 cases were women, aged 18-40, who had cervical dysplasia. The 308 controls were women aged 18-40 who did not have cervical dysplasia. Each women was classied as positive or negative, depending on the presence of HPV (human papilloma virus). The data are summarized below. HPV Outcome Positive Negative Sample size Cases 164 11 175 Controls 130 178 308
. prtesti 139 0.122 140 0.221 Two-sample test of proportion
Let p1 be the probability that a case is HPV positive and let p2 be the probability that a control is HPV positive. The sample sizes are n1 = 175 and n2 = 308. The sample proportions of positive cases and controls are p 1 = 164/175 = .937 and p 2 = 130/308 = .422. For a 95% CI zcrit SECI ( p1 p 2 ) = 1.96 .937 (1 .937) .422 (1 .422) + = 1.96 (.03336) = .0659. 175 308
A 95% CI for p1 p2 is (.937 .422) .066, or .515 .066, or (.449, .581). I am 95% condent that p1 exceeds p2 by at least .45 but not by more than .58. Not surprisingly, a two-sided test at the 5% level would reject H0 : p1 = p2 . In this problem one might wish to do a one-sided test, instead of a two-sided test. Can you nd the p-value for the one-sided test in the Stata output below? . prtesti 175 0.937 308 0.422 Two-sample test of proportion x: Number of obs = 175 y: Number of obs = 308 -----------------------------------------------------------------------------Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------x | .937 .0183663 .9010028 .9729972 y | .422 .0281413 .366844 .477156 -------------+---------------------------------------------------------------diff | .515 .0336044 .4491366 .5808634 | under Ho: .0462016 11.15 0.000 -----------------------------------------------------------------------------Ho: proportion(x) - proportion(y) = diff = 0 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 z = 11.147 z = 11.147 z = 11.147 P < z = 1.0000 P > |z| = 0.0000 P > z = 0.0000 96
Appropriateness of the Large Sample Test and CI

The standard two sample CI and test used above are appropriate when each sample is large. A rule of thumb suggests a minimum of at least ve successes (i.e. observations with the characteristic of interest) and failures (i.e. observations without the characteristic of interest) in each sample before using these methods. This condition is satised in our two examples.
Eect Measures in Two-by-Two Tables

Consider a study of a particular disease, where each individual is either exposed or not-exposed to a risk factor. Let p1 be the proportion diseased among the individuals in the exposed population, and p2 be the proportion diseased among the non-exposed population. This population information can be summarized as a two-by-two table of population proportions: Outcome Diseased Non-Diseased Exposed population p1 1 p1 Non-Exposed population p2 1 p2
A standard measure of the dierence between the exposed and non-exposed populations is the absolute dierence: p1 p2 . We have discussed statistical methods for assessing this dierence. In many epidemiological and biostatistical settings, other measures of the dierence between populations are considered. For example, the relative risk p1 RR = p2 is commonly reported when the individual risks p1 and p2 are small. The odds ratio OR = p1 /(1 p1 ) p2 /(1 p2 )
is another standard measure. Here p1 /(1 p1 ) is the odds of being diseased in the exposed group, whereas p2 /(1 p2 ) is the odds of being diseased in the non-exposed group. Note that each of these measures can be easily estimated from data, using the sample proportions as estimates of the unknown population proportions. For example, in the vitamin C study: Outcome # with cold # with no cold Totals Ascorbic Acid 17 122 139 Placebo 31 109 140
the proportion with colds in the placebo group is p 2 = 31/140 = .221. The proportion with colds in the vitamin C group is p 1 = 17/139 = .122. The estimated absolute dierence in risk is p 1 p 2 = .122 .221 = .099. The estimated risk ratio and odds ratio are .122 RR = = .55 .221 and .122/(1 .122) OR = = .49, .221/(1 .221) respectively. In the literature it probably is most common to see OR (actually OR or adjusted OR) reported, usually from a logistic regression analysis that will be covered in the next section). We will be interested in testing H0 : OR = 1 (or H0 : RR = 1). We will estimate OR with OR and will need the sampling distribution of OR in order to construct tests and condence intervals.
97
Testing for Homogeneity of Proportions

Example The following two-way table of counts summarizes the location of death and age at death from a study of 2989 cancer deaths (Public Health Reports, 1983): (Obs Counts) Age 15-54 55-64 65-74 75+ Col Total Location of death Acute Care Chronic 418 524 581 558 2081
Home 94 116 156 138 504
care 23 34 109 238 404
Row Total 535 674 846 934 2989
The researchers want to compare the age distributions across locations. A one-way ANOVA would be ideal if the actual ages were given. Because the ages are grouped, the data should be treated as categorical. Given the dierences in numbers that died at the three types of facilities, a comparison of proportions or percentages in the age groups is appropriate. A comparison of counts is not. The table below summarizes the proportion in the four age groups at each location. For example, in the acute care facility 418/2081 = .201 and 558/2081 = .268. The pooled proportions are the Row Totals divided by the total sample size of 2989. The pooled summary gives the proportions in the four age categories, ignoring location of death. The age distributions for home and for the acute care facilities are similar, but are very dierent from the age distribution at chronic care facilities. To formally compare the observed proportions, one might view the data as representative sample of ages at death from the three locations. Assuming independent samples from the three locations (populations), a chi-squared statistic is used to test whether the population proportions of ages at death are identical (homogeneous) across locations. The chi-squared test for homogeneity of population proportions can be dened in terms of proportions, but is traditionally dened in terms of counts. (Proportions) Age 15-54 55-64 65-74 75+ Total Location of death Acute Care Chronic care .201 .057 .252 .084 .279 .270 .268 .589 1.000 1.000
Home .187 .230 .310 .273 1.000
Pooled .179 .226 .283 .312 1.000
In general, assume that the data are independent samples from c populations (strata, groups, sub-populations), and that each individual is placed into one of r levels of a categorical variable. The raw data will be summarized as a r c contingency table of counts, where the columns correspond to the samples, and the rows are the levels of the categorical variable. In the age distribution problem, r = 4 and c = 3. (SW uses k to identify the number of columns.) To implement the test: 1. Compute the (estimated) expected count for each cell in the table as follows: E= Row Total Column Total . Total Sample Size 98
2. Compute the Pearson test statistic 2 S = where O is the observed count.

2 2 3. For a size test, reject the hypothesis of homogeneity if 2 S crit , where crit is the upper critical value from the chi-squared distribution with df = (r 1)(c 1).
(O E )2 , E all cells
The p-value for the chi-squared test of homogeneity is equal to the area under the chi-squared curve to the right of 2 S ; see Figure 1.
2 with 4 degrees of freedom
2 with 4 degrees of freedom 2 S significant
= .05 (fixed)
p value (random)
2 Reject H0 for 2 Crit S here
10
15
2 Crit
10
2 S
15
Figure 1: The p-value is the shaded area on the right For a two-by-two table of counts, the chi-squared test of homogeneity of proportions is identical to the two-sample proportion test we discussed earlier.
Stata Analysis
One way to obtain the test statistic and p-value in Stata is to use the tabi command. The tables put out from that command are too poorly labelled to be very useful, though, so its preferable to put the data into the worksheet so that it looks like this: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Age 1 1 1 2 2 2 3 3 3 4 4 4 Location 1 2 3 1 2 3 1 2 3 1 2 3 Count 94 418 23 116 524 34 156 581 109 138 558 238
The Hills and De Stavola book explains the following sequence, 99
. . . . .
label define label define label values label values list,clean Age 1. 15-54 2. 15-54 3. 15-54 4. 55-64 5. 55-64 6. 55-64 7. 65-74 8. 65-74 9. 65-74 10. 75+ 11. 75+ 12. 75+
agemap 1 "15-54" 2 "55-64" 3 "65-74" 4 "75+" locmap 1 "Home" 2 "Acute Care" 3 "Chronic Care" Age agemap Location locmap Location Home Acute Care Chronic Care Home Acute Care Chronic Care Home Acute Care Chronic Care Home Acute Care Chronic Care Count 94 418 23 116 524 34 156 581 109 138 558 238
If I typed list,clean nolabel I would get the original listing. Why am I bothering with this? I actually could put those labels in as variable values, and not bother with labels. When I form tables, though, Stata wants to alphabetize according to variable values which will force Home as the last column. By keeping values numeric I can get Stata to order correctly and print the correct labels. I nd it easiest to go through the menu path Summaries, tables, & tests -> Tables -> Two-way tables with measures of association to generate the following commands. Note in particular the [fweight = Count] (frequency weight given by Count variable) syntax to tell Stata that each line represents many observations. Minitab and SAS have similar options. . tabulate Age Location [fweight = Count], chi2 column expected lrchi2 row +--------------------+ | Key | |--------------------| | frequency | | expected frequency | | row percentage | | column percentage | +--------------------+ | Location Age | Home Acute Car Chronic C | Total -----------+---------------------------------+---------15-54 | 94 418 23 | 535 | 90.2 372.5 72.3 | 535.0 | 17.57 78.13 4.30 | 100.00 | 18.65 20.09 5.69 | 17.90 -----------+---------------------------------+---------55-64 | 116 524 34 | 674 | 113.6 469.3 91.1 | 674.0 | 17.21 77.74 5.04 | 100.00 | 23.02 25.18 8.42 | 22.55 -----------+---------------------------------+---------65-74 | 156 581 109 | 846 | 142.7 589.0 114.3 | 846.0 | 18.44 68.68 12.88 | 100.00 | 30.95 27.92 26.98 | 28.30 -----------+---------------------------------+---------75+ | 138 558 238 | 934 | 157.5 650.3 126.2 | 934.0 | 14.78 59.74 25.48 | 100.00 | 27.38 26.81 58.91 | 31.25 -----------+---------------------------------+---------Total | 504 2,081 404 | 2,989 | 504.0 2,081.0 404.0 | 2,989.0 | 16.86 69.62 13.52 | 100.00 | 100.00 100.00 100.00 | 100.00 Pearson chi2(6) = 197.6241 Pr = 0.000 likelihood-ratio chi2(6) = 200.9722 Pr = 0.000 100
The Pearson statistic is 197.6241 on 6 = (4-1)(3-1) df . The p-value is 0 to three places. The data strongly suggest that there are dierences in the age distributions among locations.
Testing for Homogeneity in Cross-Sectional and Stratied Studies

Two-way tables of counts are often collected either by stratied sampling or by cross-sectional sampling. In a stratied design, distinct groups, strata, or sub-populations are identied. Independent samples are selected from each group, and the sampled individuals are classied into categories. The HPV study is an illustration of a stratied design (and a case-control study). Stratied designs provide estimates for the strata (population) proportion in each of the categories. A test for homogeneity of proportions is used to compare the strata. In a cross-sectional design, individuals are randomly selected from a population and classied by the levels of two categorical variables. With cross-sectional samples you can test homogeneity of proportions by comparing either the row proportions or by comparing the column proportions. Example The following data (The Journal of Advertising, 1983, p. 34-42) are from a cross-sectional study that involved soliciting opinions on anti-smoking advertisements. Each subject was asked whether they smoked and their reaction (on a ve-point ordinal scale) to the ad. The data are summarized as a two-way table of counts, given below: Smoker Non-smoker Col Total The row proportions are (Row Prop) Smoker Non-smoker Str. Dislike .082 .110 Dislike .144 .149 Neutral .361 .278 Like .216 .217 Str. Like .196 .245 Row Tot 1.000 1.000 Str. Dislike 8 31 39 Dislike 14 42 56 Neutral 35 78 113 Like 21 61 82 Str. Like 19 69 88 Row Tot 97 281 378
For example, the entry for the (Smoker, Str. Dislike ) cell is: 8/97 = .082. Similarly, the column proportions are (Col Prop) Smoker Non-smoker Total Str. Dislike .205 .795 1.000 Dislike .250 .750 1.000 Neutral .310 .690 1.000 Like .256 .744 1.000 Str. Like .216 .784 1.000
Although it may be more natural to compare the smoker and non-smoker row proportions, the column proportions can be compared across ad responses. There is no advantage to comparing rows instead of columns in a formal test of homogeneity of proportions with cross-sectional data. The Pearson chi-squared test treats the rows and columns interchangeably, so you get the same result regardless of how you view the comparison. However, one of the two comparisons may be more natural to interpret. Note that checking for homogeneity of proportions is meaningful in stratied studies only when the comparison is across strata! Further, if the strata correspond to columns of the table, then the column proportions or percentages are meaningful whereas the row proportions are not. Question: How do these ideas apply to the age distribution problem? 101
Testing for Independence in a Two-Way Contingency Table

The row and column classications for a population where each individual is cross-classied by two categorical variables are said to be independent if each population cell proportion in the two-way table is the product of the proportion in a given row and the proportion in a given column. One can show that independence is equivalent to homogeneity of proportions. In particular, the two-way table of population cell proportions satises independence if and only if the population column proportions are homogeneous. If the population column proportions are homogeneous then so are the population row proportions. This suggests that a test for independence or no association between two variables based on a cross-sectional study can be implemented using the chi-squared test for homogeneity of proportions. This suggestion is correct. If independence is not plausible, I tend to interpret the dependence as a deviation from homogeneity, using the classication for which the interpretation is most natural. Example: Stata output for testing independence between smoking status and ad reaction is given below. The Pearson chi-squared test is not signicant (p-value = 0.559). The observed association between smoking status and the ad reaction is not signicant. This suggests, for example, that the smokers reactions to the ad were not statistically signicantly dierent from the non-smokers reactions, which is consistent with the smokers and non-smokers attitudes being fairly similar. The data were coded as opinion from 1 to 5 and smoke as 1 or 2, and then label dene applied as before. . tabulate Smoke Opinion [fweight=count],chi2 lrchi2 exp col row +--------------------+ | Key | |--------------------| | frequency | | expected frequency | | row percentage | | column percentage | +--------------------+ | Opinion Smoke | Str. Disl Dislike Neutral Like Str. Like | Total -----------+-------------------------------------------------------+---------Smoker | 8 14 35 21 19 | 97 | 10.0 14.4 29.0 21.0 22.6 | 97.0 | 8.25 14.43 36.08 21.65 19.59 | 100.00 | 20.51 25.00 30.97 25.61 21.59 | 25.66 -----------+-------------------------------------------------------+---------Non-smoker | 31 42 78 61 69 | 281 | 29.0 41.6 84.0 61.0 65.4 | 281.0 | 11.03 14.95 27.76 21.71 24.56 | 100.00 | 79.49 75.00 69.03 74.39 78.41 | 74.34 -----------+-------------------------------------------------------+---------Total | 39 56 113 82 88 | 378 | 39.0 56.0 113.0 82.0 88.0 | 378.0 | 10.32 14.81 29.89 21.69 23.28 | 100.00 | 100.00 100.00 100.00 100.00 100.00 | 100.00 Pearson chi2(4) = 2.9907 Pr = 0.559 likelihood-ratio chi2(4) = 2.9797 Pr = 0.561
One-sample procedures
Last semester we spent some time on the situation where we obtained a SRS of n observations from a binomial population (binary outcome variable) with probability p of Success. We learned how to calculate CIs for p and tests of H0 : p = p0 for some xed p0 . The large sample form of this is also done with the prtesti command or through the GUI, and the (preferable) exact binomial test is done through the bitesti command (or through the menus). The extension to 3 or more categories was the chi-squared goodness of t test, done in Stata using the csgof command. That command is not automatically installed but you can locate and install it from the findit csgof 102
10
LOGISTIC REGRESSION - TWO INTRODUCTORY EXAMPLES
command. Since we do these one sample procedures relatively infrequently, I am going to leave it to you to learn them in Stata if you need them.
10
Logistic Regression - Two Introductory Examples
The chi-squared tests in the previous section are used very frequently, along with Fishers exact test (asked for with the ,fisher option in tabulate note that it is often feasible to calculate only for small sample sizes). Those classical methods have been around a very long time and are often the best choice for analysis. In order to consider problems with more complicated predictors we need newer technology, so we now turn to logistic regression. The data below are from a study conducted by Milicer and Szczotka on pre-teen and teenage girls in Warsaw. The subjects were classied into 25 age categories. The number of girls in each group (sample size) and the number that reached menarche (# RM) at the time of the study were recorded. The age for a group corresponds to the midpoint for the age interval. Sample size 376 200 93 120 90 88 105 111 100 93 100 108 99 # RM 0 0 0 2 2 5 10 17 16 29 39 51 47 Age 9.21 10.21 10.58 10.83 11.08 11.33 11.58 11.83 12.08 12.33 12.58 12.83 13.08 Sample size 106 105 117 98 97 120 102 122 111 94 114 1049 # RM 67 81 88 79 90 113 95 117 107 92 112 1049 Age 13.33 13.58 13.83 14.08 14.33 14.58 14.83 15.08 15.33 15.58 15.83 17.58
The researchers were interested in whether the proportion of girls that reached menarche ( # RM/ sample size ) varied with age. One could perform a test of homogeneity by arranging the data as a 2 by 25 contingency table with columns indexed by age and two rows: ROW1 = # RM and ROW2 = # that have not RM = sample size # RM. A more powerful approach treats these as regression data, using the proportion of girls reaching menarche as the response and age as a predictor. The data were imported into Stata using the infile command and labelled menarche, total, and age. A plot of the observed proportion of girls that have reached menarche (obtained in Stata with the two commands generate phat = menarche / total and twoway (scatter phat age)) shows that the proportion increases as age increases, but that the relationship is nonlinear. The observed proportions, which are bounded between zero and one, have a lazy S -shape (a sigmoidal function) when plotted against age. The change in the observed proportions for a given change in age is much smaller when the proportion is near 0 or 1 than when the proportion is near 1/2. This phenomenon is common with regression data where the response is a proportion. The trend is nonlinear so linear regression is inappropriate. A sensible alternative might be to transform the response or the predictor to achieve near linearity. A better approach is to use a non-linear model for the proportions. A common choice is the logistic regression model.
103
10
phat 0 .2 .4
.6
.8
10
12 age
14
16
18
Figure 2: Estimated proportions p i versus AGEi , for i = 1, . . . , 25.
The Simple Logistic Regression Model

The simple logistic regression model expresses the population proportion p of individuals with a given attribute (called a success) as a function of a single predictor variable X . The model assumes that p is related to X through logit(p) = log or, equivalently, as p= p 1p = + X (1)
exp( + X ) . 1 + exp( + X )
The logistic regression model is a binary response model, where the response for each case falls into one of 2 exclusive and exhaustive categories, often called success (cases with the attribute of interest) and failure (cases without the attribute of interest). In many biostatistical applications, the success category is presence of a disease, or death from a disease. I will often write p as p(X ) to emphasize that p is the proportion of all individuals with score X that have the attribute of interest. In the menarche data, p = p(X ) is the population proportion of girls at age X that have reached menarche. The odds of success are p/(1 p). For example, the odds of success are 1 (or 1 to 1) when p = 1/2. The odds of success are 2 (or 2 to 1) when p = 2/3. The logistic model assumes that the log-odds of success is linearly related to X . Graphs of the logistic model relating p to X are given in Figure 3. The sign of the slope refers to the sign of . There are a variety of other binary response models that are used in practice. The probit regression model or the complementary log-log regression model might be appropriate when the logistic model does not t the data.
Data for Simple Logistic Regression

For the formulas below, I assume that the data are given in summarized or aggregate form:
104
10
Logit Scale
1.0
Probability Scale
+ slope Probability 0.4 0.6 0 slope - slope -5 0.2 + slope 0.0 Log-Odds 0
0 slope
0.8
- slope
-5
0 X
-5
0 X
Figure 3: logit(p) and p as a function of X X X1 X2 . . Xm n n1 n2 . . nm D d1 d2 . . dm
where di is the number of individuals with the attribute of interest (number of diseased) among ni randomly selected or representative individuals with predictor variable value Xi . The subscripts identify the group of cases in the data set. In many situations, the sample size is 1 in each group, and for this situation di is 0 or 1. For raw data on individual cases, the sample size column n is usually omitted and D takes on 1 of two coded levels, depending on whether the case at Xi is a success or not. The values 0 and 1 are typically used to identify failures and successes respectively.
Estimating Regression Coecients

The principle of maximum likelihood is commonly used to estimate the two unknown parameters in the logistic model: p log = + X. 1p The maximum likelihood estimates (MLE) of the regression coecients are estimated iteratively by maximizing the so-called Binomial likelihood function for the responses, or equivalently, by minimizing the deviance function (also called the likelihood ratio LR chi-squared statistic)
m
LR = 2
i=1
di log
di ni pi pi 1 pi
+ (ni di )log
ni di ni ni pi
over all possible values of and , where the pi s satisfy log = + Xi .
105
10
The ML method also gives standard errors and signicance tests for the regression estimates. The deviance is an analog of the residual sums of squares in linear regression. The choices for and that minimize the deviance are the parameter values that make the observed and tted proportions as close together as possible in a likelihood sense. are the MLEs of and . The deviance evaluated at the MLEs: Suppose that and
m
LR = 2
i=1
di log
di ni p i
+ (ni di )log
ni di ni ni p i
where the tted probabilities p i satisfy log p i 1p i i, = + X
is used to test the adequacy of the model. The deviance is small when the data ts the model, that is, when the observed and tted proportions are close together. Large values of LR occur when one or more of the observed and tted proportions are far apart, which suggests that the model is inappropriate. If the logistic model holds, then LR has a chi-squared distribution with m r degrees of freedom, where m is the number of groups and r (here 2) is the number of estimated regression parameters. A p-value for the deviance is given by the area under the chi-squared curve to the right of LR. A small p-value indicates that the data does not t the model. Stata does not provide the deviance statistic, but rather the Pearson chi-squared test statistic, which is dened similarly to the deviance statistic and is interpreted in the same manner: X2 =
m i=1
(di ni p i )2 . ni p i (1 p i )
This statistic can be interpreted as the sum of standardized, squared dierences between the observed number of successes di and expected number of successes ni p i for each covariate Xi . When what we expect to see under the model agrees with what we see, the Pearson statistic is close to zero, indicating good model t to the data. When the Pearson statistic is large, we have an indication of lack of t. Often the Pearson residuals ri = (di ni p i )/ ni p i (1 p i ) are used to determine exactly where lack of t occurs. These residuals are obtained in Stata using the predict command after the logistic command. Examining these residuals is very similar to looking for E )2 in a 2 analysis of a contingency table as discussed in the last lecture. We large values of (O E will not talk further of logistic regression diagnostics.
Age at Menarche Data: Stata Implementation

A logistic model for these data implies that the probability p of reaching menarche is related to age through p log = + AGE. 1p If the model holds, then a slope of = 0 implies that p does not depend on AGE, i.e. the proportion of girls that have reached menarche is identical across age groups. However, the power of the logistic regression model is that if the model holds, and if the proportions change with age, then you have a way to quantify the eect of age on the proportion reaching menarche. This is more appealing and useful than just testing homogeneity across age groups. A logistic regression model with a single predictor can be t using one of the many commands available in Stata depending on the data type and desired results: logistic (raw data, outputs 106
10
odds ratios), logit (raw data, outputs model parameter estimates), and blogit (grouped data). The logistic command has many more options than either logit or blogit, but requires you to reformat the data into individual records, one for each girl. For an example of how to do this, check out the online Stata help at http://www.stata.com/support/faqs/stat/grouped.html. The Stata command blogit menarche total age yields the following output: Logit estimates Number of obs = 3918 LR chi2(1) = 3667.18 Prob > chi2 = 0.0000 Log likelihood = -819.65237 Pseudo R2 = 0.6911 -----------------------------------------------------------------------------_outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------age | 1.631968 .0589509 27.68 0.000 1.516427 1.74751 _cons | -21.22639 .7706558 -27.54 0.000 -22.73685 -19.71594 ----------------------------------------------------------------------------- = 1.63. Thus, the tted or The output tables the MLEs of the parameters: = 21.23 and predicted probabilities satisfy: log or p (AGE ) = p = 21.23 + 1.63AGE 1p exp(21.23 + 1.63AGE) . 1 + exp(21.23 + 1.63AGE)
The p-value for testing H0 : = 0 (i.e. the slope for the regression model is zero) based upon the chi-squared test p-value (P>|z|) is 0.000, which leads to rejecting H0 at any of the usual test levels. Thus, the proportion of girls that have reached menarche is not constant across age groups. The likelihood ratio test statistic of no logistic regression relationship (LR chi2(1) = 3667.18) and p-value (Prob > chi2 = 0.0000) gives the logistic regression analogue of the overall F-statistic that no predictors are important to multiple regression. In general, the chi-squared statistic provided here is used to test the hypothesis that the regression coecients are zero for each predictor in the model. There is a single predictor here, AGE, so this test and the test for the AGE eect are both testing H0 : = 0. To obtain the Pearson goodness of t statistic and p-value we must reformat the data and use the logistic command as described in the webpage above: generate w0 = total - menarche rename menarche w1 generate id = _n reshape long w, i(id) j(y) logistic y age [fw=w] lfit We obtain the following output: Number of obs = 3918 LR chi2(1) = 3667.18 Prob > chi2 = 0.0000 Log likelihood = -819.65237 Pseudo R2 = 0.6911 -----------------------------------------------------------------------------y | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------age | 5.113931 .3014706 27.68 0.000 4.555917 5.740291 -----------------------------------------------------------------------------Logistic model for y, goodness-of-fit test number of observations = 3918 number of covariate patterns = 25 Pearson chi2(23) = 21.87 Prob > chi2 = 0.5281 107 Logistic regression
10
Using properties of exponential functions, the odds of reaching menarche is exp(1.632) = 5.11 times larger for every year older a girl is. To see this, let p(Age + 1) and p(Age) be probabilities of reaching menarche for ages one year apart. The odds ratio OR satises log(OR) = log p(Age + 1)/(1 p(Age + 1)) p(Age)/(1 p(Age)) = log (p(Age + 1)/(1 p(Age + 1))) log (p(Age)/(1 p(Age)))
= ( + (Age + 1)) ( + Age) = so OR = e . If we considered ages 5 years apart, the same derivation would give us OR = e5 = (e )5 . You often see a continuous variable with a signicant though apparently small OR, but when you examine the OR for a reasonable range of values (by raising to the power of the range in this way), then the OR is substantial. = 1.632 and the estimated odds ratio You should pick out the estimated regression coecient exp( ) = exp(1.632) = 5.11 from the output obtained using the blogit and logistic commands respectively. We would say that, for example, that the odds of 15 year old girls having reached menarche are between 4.5 and 5.7 times larger than for 14 year old girls. The Pearson chi-square statistic is 21.87 on 23 df, with a p-value of 0.5281. The large p-value suggests no gross deciencies with the logistic model.
Logistic Regression with Two Eects: Leukemia Data

Feigl and Zelen reported the survival time in weeks and the white cell blood count (WBC) at time of diagnosis for 33 patients who eventually died of acute leukemia. Each person was classied as AG+ or AG- (coded as IAG = 1 and 0, respectively), indicating the presence or absence of a certain morphological characteristic in the white cells. The researchers are interested in modelling the probability p of surviving at least one year as a function of WBC and IAG. They believe that WBC should be transformed to a log scale, given the skewness in the WBC values. Where Live=0, 1 indicates whether the patient died or lived respectively, the data are IAG WBC Live IAG WBC Live IAG WBC Live --------------------------------------------1 75 1 1 230 1 1 430 1 1 260 1 1 600 0 1 1050 1 1 1000 1 1 1700 0 1 540 0 1 700 1 1 940 1 1 3200 0 1 3500 0 1 5200 0 1 10000 1 1 10000 0 1 10000 0 0 440 1 0 300 1 0 400 0 0 150 0 0 900 0 0 530 0 0 1000 0 0 1900 0 0 2700 0 0 2800 0 0 3100 0 0 2600 0 0 2100 0 0 7900 0 0 10000 0 0 10000 0 As an initial step in the analysis, consider the following model: log p 1p = + 1 LWBC + 2 IAG,
108
10
where LWBC = log WBC. This is a logistic regression model with 2 eects, t using the logistic command. The parameters , 1 and 2 are estimated by maximum likelihood. The model is best understood by separating the AG+ and AG- cases. For AG- individuals, IAG=0 so the model reduces to log p 1p = + 1 LWBC + 2 0 = + 1 LWBC.
For AG+ individuals, IAG=1 and the model implies log p 1p = + 1 LWBC + 2 1 = ( + 2 ) + 1 LWBC.
The model without IAG (i.e. 2 = 0) is a simple logistic model where the log-odds of surviving one year is linearly related to LWBC, and is independent of AG. The reduced model with 2 = 0 implies that there is no eect of the AG level on the survival probability once LWBC has been taken into account. Including the binary predictor IAG in the model implies that there is a linear relationship between the log-odds of surviving one year and LWBC, with a constant slope for the two AG levels. This model includes an eect for the AG morphological factor, but more general models are possible. Thinking of IAG as a factor, the proposed model is a logistic regression analog of ANCOVA. The parameters are easily interpreted: and + 2 are intercepts for the population logistic regression lines for AG- and AG+, respectively. The lines have a common slope, 1 . The 2 coecient for the IAG indicator is the dierence between intercepts for the AG+ and AG- regression lines. A picture of the assumed relationship is given below for 1 < 0. The population regression lines are parallel on the logit (i.e. log odds ) scale only, but the order between IAG groups is preserved on the probability scale.
Logit Scale
1.0
Probability Scale
Probability 0.4 0.6
0.8
Log-Odds 0
IAG=1
IAG=1
-5
IAG=0
0.2 -10 -5 0 LWBC 5 0.0
IAG=0
-5
0 LWBC
Figure 4: Predicted relationships on the logit and probability scales The data are in the raw data form for individual cases. There are three columns: the binary or indicator variable iag (with value 1 for AG+, 0 for AG-), wbc (continuous), live (with value 1 if the patient lived at least 1 year and 0 if not). Note that a frequency column is not needed with 109
10
raw data (and hence using the logistic command) and that the success category corresponds to surviving at least 1 year. Before looking at output for the equal slopes model, note that the data set has 30 distinct IAG and WBC combinations, or 30 groups or samples that could be constructed from the 33 individual cases. Only two samples have more than 1 observation. The majority of the observed proportions surviving at least one year (number surviving 1 year/ group sample size) are 0 (i.e. 0/1) or 1 (i.e. 1/1). This sparseness of the data makes it dicult to graphically assess the suitability of the logistic model (Why?). Although signicance tests on the regression coecients do not require large group sizes, the chi-squared approximations to the deviance and Pearson goodness-of-t statistics are suspect in sparse data settings. With small group sizes as we have here, most researchers would not interpret the p-values for the deviance or Pearson tests literally. Instead, they would use the p-values to informally check the t of the model. Diagnostics would be used to highlight problems with the model. We obtain the following modied output: . infile iag wbc live using c:/biostat/notes/leuk.txt . generate lwbc = log(wbc) . logistic live iag lwbc . logit . lfit Logistic regression Number of obs = 33 LR chi2(2) = 15.18 Prob > chi2 = 0.0005 Log likelihood = -13.416354 Pseudo R2 = 0.3613 -----------------------------------------------------------------------------live | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------iag | 12.42316 13.5497 2.31 0.021 1.465017 105.3468 lwbc | .3299682 .1520981 -2.41 0.016 .1336942 .8143885 -----------------------------------------------------------------------------Logit estimates Number of obs = 33 LR chi2(2) = 15.18 Prob > chi2 = 0.0005 Log likelihood = -13.416354 Pseudo R2 = 0.3613 -----------------------------------------------------------------------------live | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------iag | 2.519562 1.090681 2.31 0.021 .3818672 4.657257 lwbc | -1.108759 .4609479 -2.41 0.016 -2.0122 -.2053178 _cons | 5.543349 3.022416 1.83 0.067 -.380477 11.46718 -----------------------------------------------------------------------------Logistic model for live, goodness-of-fit test number of observations = 33 number of covariate patterns = 30 Pearson chi2(27) = 19.81 Prob > chi2 = 0.8387 The large p-value (0.8387) for the lack-of-t chi-square (i.e. the Pearson statistic) indicates that there are no gross deciencies with the model. Given that the model ts reasonably well, a test of H0 : 2 = 0 might be a primary interest here. This checks whether the regression lines are identical for the two AG levels, which is a test for whether AG aects the survival probability, after taking LWBC into account. The test that H0 : 2 = 0 is equivalent to testing that the odds ratio exp(2 ) is equal to 1: H0 : e2 = 1. The p-value for this test is 0.021. The test is rejected at any of the usual signicance levels, suggesting that the AG level aects the survival probability (assuming a very specic model). In fact we estimate that the odds of surviving past a year in the AG+ population is 12.4 times the odds of surviving past a year in the AG- population, with a 95% CI of (1.4, 105.4); see below for this computation carried out explicitly.
110
10
The estimated survival probabilities satisfy log p = 5.54 1.11LWBC + 2.52IAG. 1p
For AG- individuals with IAG=0, this reduces to log or equivalently, p = For AG+ individuals with IAG=1, log or p = p = 5.54 1.11LWBC + 2.52 (1) = 8.06 1.11LWBC, 1p exp(8.06 1.11LWBC) . 1 + exp(8.06 1.11LWBC) p = 5.54 1.11LWBC, 1p exp(5.54 1.11LWBC) . 1 + exp(5.54 1.11LWBC)
Using the logit scale, the dierence between AG+ and AG- individuals in the estimated logodds of surviving at least one year, at a xed but arbitrary LWBC, is the estimated IAG regression coecient: (8.06 1.11LWBC) (5.54 1.11LWBC) = 2.52. Using properties of exponential functions, the odds that an AG+ patient lives at least one year is exp(2.52) = 12.42 times larger than the odds that an AG- patient lives at least one year, regardless of LWBC. Although the equal slopes model appears to t well, a more general model might t better. A natural generalization here would be to add an interaction, or product term, IAG LWBC to the model. The logistic model with an IAG eect and the IAG LWBC interaction is equivalent to tting separate logistic regression lines to the two AG groups. This interaction model provides an easy way to test whether the slopes are equal across AG levels. I will note that the interaction term is not needed here.
111

9 Review of Discrete Data Analysis: Comparing Two Proportions: Independent Samples

Uploaded by

Copyright:

Available Formats

9 Review of Discrete Data Analysis: Comparing Two Proportions: Independent Samples

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

9 Review of Discrete Data Analysis: Comparing Two Proportions: Independent Samples

Uploaded by

Copyright:

Available Formats

9

REVIEW OF DISCRETE DATA ANALYSIS

Review of Discrete Data Analysis

Comparing Two Proportions: Independent Samples

Large Sample CI and Tests for p1 p2

is the test standard error for p 1 p 2 . The pooled proportion p = n1 p 1 + n2 p 2 n1 + n2

REVIEW OF DISCRETE DATA ANALYSIS

REVIEW OF DISCRETE DATA ANALYSIS

. prtesti 139 0.122 140 0.221 Two-sample test of proportion

REVIEW OF DISCRETE DATA ANALYSIS

Appropriateness of the Large Sample Test and CI

Eect Measures in Two-by-Two Tables

REVIEW OF DISCRETE DATA ANALYSIS

Testing for Homogeneity of Proportions

Home 94 116 156 138 504

care 23 34 109 238 404

Row Total 535 674 846 934 2989

Home .187 .230 .310 .273 1.000

Pooled .179 .226 .283 .312 1.000

REVIEW OF DISCRETE DATA ANALYSIS

2. Compute the Pearson test statistic 2 S = where O is the observed count.

2 with 4 degrees of freedom

2 with 4 degrees of freedom 2 S significant

2 Reject H0 for 2 Crit S here

The Hills and De Stavola book explains the following sequence, 99

REVIEW OF DISCRETE DATA ANALYSIS

REVIEW OF DISCRETE DATA ANALYSIS

Testing for Homogeneity in Cross-Sectional and Stratied Studies

REVIEW OF DISCRETE DATA ANALYSIS

Testing for Independence in a Two-Way Contingency Table

LOGISTIC REGRESSION - TWO INTRODUCTORY EXAMPLES

Logistic Regression - Two Introductory Examples

LOGISTIC REGRESSION - TWO INTRODUCTORY EXAMPLES

Figure 2: Estimated proportions p i versus AGEi , for i = 1, . . . , 25.

The Simple Logistic Regression Model

Data for Simple Logistic Regression

LOGISTIC REGRESSION - TWO INTRODUCTORY EXAMPLES

Figure 3: logit(p) and p as a function of X X X1 X2 . . Xm n n1 n2 . . nm D d1 d2 . . dm

Estimating Regression Coecients

over all possible values of and , where the pi s satisfy log = + Xi .

LOGISTIC REGRESSION - TWO INTRODUCTORY EXAMPLES

where the tted probabilities p i satisfy log p i 1p i i, = + X

Age at Menarche Data: Stata Implementation

LOGISTIC REGRESSION - TWO INTRODUCTORY EXAMPLES

LOGISTIC REGRESSION - TWO INTRODUCTORY EXAMPLES

Logistic Regression with Two Eects: Leukemia Data

LOGISTIC REGRESSION - TWO INTRODUCTORY EXAMPLES

Probability 0.4 0.6

0.2 -10 -5 0 LWBC 5 0.0

LOGISTIC REGRESSION - TWO INTRODUCTORY EXAMPLES

LOGISTIC REGRESSION - TWO INTRODUCTORY EXAMPLES

The estimated survival probabilities satisfy log p = 5.54 1.11LWBC + 2.52IAG. 1p

You might also like