Rss Grad Diploma Module5 Solutions Specimen B PDF
Rss Grad Diploma Module5 Solutions Specimen B PDF
Rss Grad Diploma Module5 Solutions Specimen B PDF
MODULE 5
SOLUTIONS FOR SPECIMEN PAPER B
THE QUESTIONS ARE CONTAINED IN A SEPARATE FILE
The time for the examination is 3 hours. The paper contains eight questions, of which
candidates are to attempt five. Each question carries 20 marks. An indicative mark
scheme is shown within the questions, by giving an outline of the marks available for
each part-question. The pass mark for the paper as a whole is 50%.
The solutions should not be seen as "model answers". Rather, they have been written
out in considerable detail and are intended as learning aids. For this reason, they do
not carry mark schemes. Please note that in many cases there are valid alternative
methods and that, in cases where discussion is called for, there may be other valid
points that could be made.
While every care has been taken with the preparation of the questions and solutions,
the Society will not be responsible for any errors or omissions.
The Society will not enter into any correspondence in respect of the questions or
solutions.
Note. In accordance with the convention used in all the Society's examination papers, the notation log denotes
logarithm to base e. Logarithms to any other base are explicitly identified, e.g. log10.
RSS 2008
The original variables, the answers to the questions, are likely to be highly
correlated. Principal component analysis (PCA) gives linear combinations of
the variables that are uncorrelated. The first PC accounts for the largest
amount of variation in the data, the second for the next largest, and so on. If
the questions form themselves into relatively distinct clusters then PCs are
useful to define subsets, and possibly to suggest ways of combining scores.
PCs are only strictly valid for numeric data, but the data here are nearer to
being categorical at best ordinal. However, PCA is often used for data such
as these.
(ii)
(iii)
(iv)
The first three eigenvalues add to 5.14, i.e. 5.14/6 or 85.7% of the total
variation, and should be enough.
The first PC (54% of total variation) is an overall score of concern about cost
note that the "direction" of questions 2, 3, 4 is opposite to that of 1, 5, 6. The
second PC (23% of total variation) measures the tendency of respondents to
answer all questions in the same way, i.e. with similar scores. The third PC
(9% of total variation and so relatively much less important) is dominated by
question 4, perhaps contrasting its answers with those for question 2, perhaps
also taking question 5 into account. The first two PCs therefore give most of
the useful, easily understood, information.
(v)
The two unsatisfactory features of the data are the large amount of missing
information, leading to 9 of the 15 questions being discarded, and the
suggestion from the second PC that the respondents do not complete the form
validly. Hence these results are not reliable. A fresh start is needed, with
reworded questions and boxes to tick as in a survey.
(ii)
(ii)
(iv)
Method 1
After constructing and applying the discriminant function, 14/17 (1) and 12/15
(2) are found to have been correctly classified. This is good, but is likely to be
an overestimate of the future success rate (since the same data have been used
to construct the function and to "check" it).
Cross-validation may be carried out by, for example, a jack-knife method:
calculate the function omitting one observation, and use the function to predict
class membership of that item; repeat this for each item in turn and observe
the number of correct predictions. [In a large data-set, the discriminant
function would be calculated on some of the data and then used to check the
success rate of the remainder. Here we do not have enough data for that.]
This gave 12/17 (1) and 9/15 (2) correct.
Method 2
Note that x4 was identified in (iii) as a useful variate. This method correctly
classifies 12/17 (1) and 12/15 (2), and the numbers on cross-validation are the
same. This seems the better method.
With these sample sizes, using 5 variables (Method 1) may be over-fitting.
The univariate (as it has turned out) Method (2) is more successful.
The log likelihood (log L) is not given for the null model, but the coefficient
for x1 in model A (which contains x1 only) is large relative to its standard
error; further, the hazard ratio is high. These suggest that x1 is important.
The difference in 2logL between models B and C is small and certainly not
statistically significant. Thus it seems likely that there is not an interaction
effect (the interaction term is the only difference between the two models).
The difference in 2logL between models A and B is 2(71.518 75.640) =
8.244. This is significant as an observation from 21 . So we conclude that the
best model is model B.
(ii)
(iii)
The fault reduces the expected lifetime. Given two items, both with the same
level of the chemical (x1), the one with the fault is about 3 [exp(1.172)] times
as likely to fail at any time.
(iv)
If the proportional hazards assumption is not valid then the model is deficient.
In particular the interpretation of the hazard ratios is invalid, and the answers
to parts (iii) and (iv) may be inaccurate. The consequences depend to some
extent on the type and seriousness of any departures from the assumptions.
(ii)
n( j ) d ( j )
n( j )
There are 24 patients. The calculation is shown in detail in the table below.
Some of the detail might be omitted in practice, and is often not shown in
computer output. Rows for the censored observations have been omitted, but
care must be taken to ensure that n(j) is always correct. Users of this solution
should carefully verify the values in the table by reference to the data in the
question.
Time n(j) as defined in
t(j)
text above
[i.e. number
remaining just
before time t(j)]
6
8
12
20
24
30
42
23
19
17
10
8
4
1
d(j) as defined in
text above
[i.e. number of
events at time
t(j)]
n( j ) d ( j )
n( j )
4
2
2
1
1
1
1
19/23
17/19
15/17
9/10
7/8
3/4
0
Cumulative
survival
estimate
S (t )
at each t(j)
0.8261
0.7391
0.6522
0.5870
0.5136
0.3852
0
Greenwood's formula for the standard error for the Kaplan-Meier estimate at
12 months follow-up (corresponding to the third row in the calculation above)
is
1
3
2
dj
SE = S (12 )
j =1 n j ( n j d j )
2
2 2
4
= 0.6522
+
+
23 19 19 17 17 15
1/2
(iii)
The interval is 0.6522 (1.96 0.0993) = 0.6522 0.1946 , i.e. (0.46, 0.85).
(iv)
(v)
(b)
Age
100
Males
Females
90
80
70
60
50
40
30
20
10
350
300
250
200
150
100
50
50
100
150
200
250
300
350
(i)
1000
(ii)
Odds =
P (event happens)
.
1 P(event happens)
The odds ratio is the ratio of odds of disease in the exposed group of patients
(i.e. here the smokers) to that in the unexposed group, i.e. here
odds ratio =
(iii)
P(disease, smoker)
1 P(disease, smoker)
P(disease, non-smoker)
1 P(disease, non-smoker)
Smokers
89
13
102
Non-smokers
394
434
828
Total
483
447
930
Using the relative frequencies from this table, the odds ratio may be calculated
as
89 434
= 7.54
13 394
which is substantially greater than 1 and indicates greater prevalence of cancer
among smokers.
(iv)
a d /n
b c / n
i
i i
This gives
58 271 31163
+
41.537
580
350
=
= 7.53 .
5.514
6 245 7 149
+
580
350
This is virtually the same as for the pooled data (7.54). This often turns out to
happen when both subsets of the data are large and of the same order of size;
also, in this case, we might suppose that sex is not in fact an important factor.
To obtain a 95% confidence interval for the odds ratio, we work via
logarithms and first use the pooled data to obtain the standard error of the log
odds ratio using the formula
Var(log of odds ratio) =
1 1 1 1
+ + +
a b c d
(= 0.093001 here) so that the standard error is 0.093001 = 0.305. The log of
the Mantel-Haenszel estimate of the odds ratio is log(41.537/5.514) = 2.0193,
so the 95% confidence interval for the logarithm is given by
The confidence interval does not contain 1.00, so we may reject the null
hypothesis that smoking status and occurrence of lung cancer are unrelated.
There is definite evidence of an association. As mentioned above, the odds
ratio strongly indicates greater prevalence of cancer among smokers than
among non-smokers. It does not appear that sex is an important factor in this.
( ) ( )
( )
( )
E R E ( x ) = E ( y ) E R E ( x ) = Y X E R .
Cov R , x = E Rx
(a)
( )
1
Y
This gives E R = Cov R , x + ,
X
X
f =
(b)
( )
i.e. E R R =
Cov R , x
X
).
n
y
= y or y Rx
=0.
and R = , so that Rx
x
N
( )
nx 2 n 1
{( y y ) R ( x x )}
i
{ ( y y )
1 f 1
.
nx 2 n 1
1 f
2
2
sY 2 R s X sY + R 2 s X
2
nx
2
2 R ( yi y )( xi x ) + R 2 ( xi x )
2
2
in which sY , s X are the estimated variances of Y and X, and is the estimated
correlation coefficient for X and Y.
Part (ii)
The ratio method works well when Y is proportional to X, with the relation passing
through the origin. It will not be better than a simple random sample when is less
than 0 or when the relation does not pass through the origin (in which case a
regression estimator is required instead).
Part (iii)
(a)
The sugar content of an individual fruit should be roughly proportional to its
weight, in fruit from the same source and batch.
(b)
Since we are not told N, the total number of oranges, a ratio estimator is used
rather than regression. Counting the whole batch would take a very long time for
what might be a very small improvement in precision.
x = 1975, y = 110.9.
y y
R = =
= 0.05615.
x x
( )
X T = 820.
= 46.045 (kg).
YT = RX
T
( )
2
We have Var YT = X T Var R .
Also, on neglecting f which will be very small (as n is only 10), we have that the value
( )
1 1
2
y + R 2 x 2
. yi 2 Rx
i i
i
2
10 x 9
1
1
2
.
1268.69 2 0.05615 22194.8 + ( 0.05615 ) 392389
2
90 (197.5 )
1
13.34687
(1268.69 2492.476 + 1237.133) =
351056.25
351056.25
( )
which on multiplying by (820)2 gives that the value of the estimator of Var T is
25.5641, i.e. the standard error is 5.056.
(c)
The half-width of the interval, ts / n , is to be less than 2. Thus s / n < 1
and 25 oranges will achieve this approximately.
(ii)
sh
74.7 2
= 400 ( 400 98 )
+ ... = 84123268.3 ,
N h ( N h nh )
98
nh
h =1
L
N h Sh
wh
= ( N h S h ) N h Sh , so that
h =1
N S
hw h
h
( N S )
h
Also we have NhSh2 = 47882186 (this appears in the denominator of the formula).
Finally, we need V. The criterion of d = 8000 with (one-sided) tail probability 0.025
gives V = (8000/1.96)2.
n =
(133903)
8000
+ 47882186
1.96
= 277.804 .
N i Si
,
N h Sh
N h ( N h nh )
h =1
sh
74.7 2
= 400 ( 400 71)
+ ... = 18610118.04
71
nh
(note there is a ZERO contribution to the sum from stratum 3, where we have a 100%
sample), so the estimated standard error is 4313.9.