Ap Stats
Ap Stats
Ap Stats
Analysis I. Categorical vs. Quantitative Variables II. Meaning of Distribution III. Graphs A. Bar charts & pie charts (of limited use) B. Dotplots and histograms C. Stem and Leaf (Sideways stemplot = histogram!) D. Ogive (cumulative frequency) IV. Interpretation A. Center 1. Mean: balance point of curve 2. Median: equal areas point of curve 3. Mode: highest mountain(s) of curve B. Spread 1. Standard Deviation: avg distance from the mean 2. Quartiles 3. Outliers? KNOW THE IQR Test - Page 80 C. Shape 1. Symmetric? 2. Bimodal? 3. Uniform? 4. Skewness? (Remember - wherever the tail is tells us where the skewness is!) D. Keep in mind Resistant Measures vs. Nonresistant Measures of Center and Spread Chapter 2: The Normal Distributions; Calculator Stuff Covered: normcdf, invnorm, statplots Theme: Introduction to its use and meaning I. Normal Distribution A. Type of Density Curve 1. Total Area = 1; always above horizontal axis B. Symmetric, bell-shaped C. Follows empirical rule: 68% within 1 std. dev; 95% within 2 std. dev, 99.7% within 3 std. dev D. Defined by mean and standard deviation: N(, s) E. Standard Normal Curve: N(0, 1). mean = 0, std. dev = 1. F. There are infinitely many normal curves! G. z-score: # of std devs from the mean; z =
1. Converts any normal curve into z-units along the standard normal curve H. Also used for probability problems I. To determine if a distribution is normal: 1. Normal quantile plots 2. Histogram or stemplot Chapter 3: Regression; Calculator Stuff Covered: LinReg(ax + b); statplots Theme: Pictorial and mathematical representation of bivariate data I. Scatterplots A. Explanatory (input) vs. Response (output) QUANTITATIVE variables B. Interpretation 1. Direction - positive or negative or no association 2. Form - clusters, gaps, outliers, influentials 3. Strength - correlation II. Correlation: measures strength and direction of an association; -1 < r < 1 A. r = 1 perfect positive association B. r = -1 perfect negative association C. r = 0 no correlation (but that doesnt necessarily mean the data is randomly scattered) D. Unitless measure; nonresistant E. r 2 ; coefficient of determination; the amount of variability in the response variable explained by the regression line on the explanatory variable
III. Least-Squares Regression A. Needs explanatory-response relationship B. Used for prediction! C. Makes the sum of the squares of the vertical distances of the data points to the line as small as possible D. Residual: observed y - predicted y 1. Residual plot should be scattered otherwise a different model would probably be better 2. Mean residuals = 0 Chapter 4: More on Bivariate Data; Calculator Stuff Covered: LinReg(ax + b) (using logs) or ExpReg or PwrReg; statplots Theme: Modeling nonlinear data; interpreting correlation and regression; categorical data
I. Nonlinear data - exponential regression A. Curved scatterplot, residual plot suggests different model required B. Take the log of the response variable. C. Perform linear regression on the log of the response variable vs. the regular explanatory variable II. Power regression vs. exponential regression A. If plot of log y vs. x is linear, use exponential regression (pg 273-275) B. If plot of log y vs. log x is linear, use power regression (pg 280-284) III. Interpreting Correlation and Regression A. Beware of: 1. Extrapolation 2. Lurking variables a ) Common response (1) BOTH variables are changing with respect to some unobserved third variable. Direct relationship between x and y unlikely. b) Confounding (1) The response is mixed with many explanatory variables. It might be x causing y or it might be other variables. Cant separate effects c) Pg 212 for pictorial representations 3. ASSOCIATION CAUSATION! IV. Categorical Data A. Often uses counts or percents B. Often uses a two-way table C. Marginal distribution: distribution of row or column variable ALONE (pg 293) D. Conditional Distribution: distribution of row variable with respect to a certain column or distribution of a column variable with respect to a certain row E. Simpsons Paradox: Aggregation of data can give misleading results (pg 300)
Chapter 5: Design of Experiments and Studies; Calculator Stuff Covered: RandInt(). (Dont really need since you can always use a TRD) Theme: Proper experimental and study design I. Designing Samples A. Randomness a must 1. Random sample: every individual has an equal chance of being chosen 2. SRS: every set of n individuals has an equal chance of being chosen B. Bad stuff: voluntary response, nonresponse; anecdotal evidence; bias; lack of realism; convenience sampling; question wording; response bias C. Sampling error: all the stuff listed above. Fatal errors to a survey. D. Nonsampling error = natural variation (this is OK); we can reduce this with larger sample sizes E. Table of Random Digits: used for simulation and assignment of subjects to groups F. Other ways of sampling: 1. Stratification 2. Multistage II. Designing Experiments A. Must have a treatment
B. Principles of Experimental Design 1. Control: lurking variables by comparing several treatments a ) Blocking b) Placebo group (if possible) c) Double blind (if possible) 2. Randomization: state how randomization will occur in the selection process 3. Replication: larger sample sizes = less natural variation error C. Pictures are a great way to explain experiments: pg 272 D. Statistically Significant: effect too large to attribute to chance alone E. Matched Pairs Design: two treatments for one subject 1. Order the treatment is applied needs to be randomized! III. Running a simulation: A. Clearly assign the digits (e.g. 00-63 = getting oyster with a pearl; 64-99 = no pearl) 1. If youre using two digits, then youre using two digits. (e.g. Dont do 0-63 = success) B. Stopping rule: what causes the simulation to end? (e.g. Stop once two oysters with pearls are found.) 1. Dont forget to state whether repeats are OK or not. a ) When simulating a probability repeats are OK b) If youre selecting actual individuals to use in a study/experiment, you cant select the same person twice so repeats not OK. Think!! C. Simulate as many repetitions as they tell you to in the problem 1. Label directly on the table of random digits to make it clear to both you and the grader when a success is found and when a trial ends D. State your conclusions. (e.g. In our 7 trials, we had 3 successes. So we estimate our probability of success to be 3/7.) Chapter 6: Probability Theme: Probability forms the basis of inference - the ability to have confidence in our answers. I. II. III. IV. V. VI. VII. Randomness can have a long term pattern. We use this phenomenon. Tree diagrams are the best way to set up most probability problems Always know if you are dealing with independent events. Always know if you are dealing with disjoint events. P(A or B) = P(A) + P(B) - P(A and B); for disjoint events: P(A and B) = 0 Independent Events: P(A and B) = P(A)P(B) Conditional Probability: P(B | A) = P(A and B) P(A) A. Prob B given A = Prob Both Prob Given
Chapter 7: Random Variables; Calculator Stuff Covered: invNorm; normCDF; 1-var stats using two lists Theme: Exploring Discrete and Continuous Random Variables I. II. III. IV. V. VI. VII. Random Variable: variable whose value is a numerical outcome of a random phenomenon We sometimes use probability tables for discrete random variables. (Pg 369) We can then use the table to create a probability histogram Normal distribution = Continuous probability distribution Mean of a DRV = Expected Value; pg 483 Std Dev & Variance of a DRV: pg 485 Rules for Means and Variances A. axb = ax b B. xy = x y C. Varaxb = a2 Varx (Assuming x and y are independent)
-->
Chapter 8: Binomial and Geometric Distribution; Calculator Stuff Covered: BinomCDF, BinomPDF; statplot; geometpdf Theme: Moving beyond the normal distribution
I. Binomial Distribution: B(n, p); n = # trials; p = probability of success A. Two possibilities: success and failure B. Fixed number of observations C. Observations are independent D. Probability of success doesnt change from trial to trial II. Specific Binomial Probability = BinomPDF = formula on formula sheet! A. Example: P(X = 3) is a PDF III. Range of Binomial Probabilities = BinomCDF A. Example: P(X > 3) is a CDF IV. Remember, a binomial distribution is a DISCRETE distribution. A. Thus: P(X > 3) P(X > 3) 1. P(X > 3) = P(X = 4) + P(X = 5) + ... while P(X > 3) = P(X = 3) + P(X = 3) + ... B. For continuous distributions, > and > or < and < dont change anything. V. Mean and Standard Deviation of Binomial Distribution: If X ~ B(n, p) A. Mean: = np B. = npq VI. Geometric Distribution: # of trials until first success A. Two possibilities: success and failure B. Probability doesnt change from trial to trial C. Observations are independent D. P(X = n) = p qn-1 <--- Formula for first success on nth trial E. = 1 /p
F. Probability that it takes more than n trials to see first success: P(X > n) = qn Chapter 9: Sampling Distributions; Calculator Stuff Covered: NormalCDF, invNorm Theme: Working towards inference and the normal approximation to the binomial; CLT I. Parameter = description of population II. Statistic = description of sample III. Sampling Variability = natural variation = fact that even correctly computed statistics rarely = parameter value IV. Sampling Distribution: distribution of values taken by the statistic in all possible samples of the same size from the same population A. Example: Taking 1 million samples of size 50 from the population and computing or graphing its statistical values would give a good sense of the sampling distribution. V. Exploring bias and variability A. Bias = systematically away from the true center B. Variability = spread about the sample center C. See pg 576, 578 VI. Normal approximation to the binomial A. B. C. D.
Reason this works: sampling distribution of p is close to normal for n large. See pg 582 As n --> , p-hat ---> p As n --> , std deviation gets smaller and smaller If np > 10 and nq > 10 and population is at least ten times larger than the sample, then: 1. If X ~ B(n, p) then X ~ N(np,
npq )
^
2.
~ N(p,
pq ) n
^
3. Remember:
~ N(,
) n
A. The sampling distribution of x is approximately normal REGARDLESS of the shape of the population distribution if the sample size is large enough!
Chapters 10-11: Introduction to Inference; Calculator Stuff Covered: Zinterval; Z-test Theme: Introduction to the calculations and meaning of significance tests and confidence. I. Confidence Interval: estimate critical value std. error of the estimate A. Net were using to try and catch the parameter B. Confidence Level tells us how often our method is likely to capture the true parameter C. Margin of Error: critical value std error of the estimate 1. Determines the size of our net 2. Takes into account sampling error - natural variation 3. Does not rescue us from nonsampling error - voluntary response, etc. D. Our goal: high confidence, low margin of error 1. Meaning: we have a small net that is still likely to catch the parameter! E. How to reduce margin of error: 1. Choose a smaller confidence level 2. Take a bigger sample 3. Improve the measurement process (decreases std deviation) F. If you know the margin of error you want, you can figure out the sample size you need! 1. Remember to always round up to the next highest whole number!!! G. Rules/Cautions about Confidence Intervals 1. Need an SRS! a ) More complicated designs like multistage and stratified require different methods 2. The confidence interval is not a resistant measure because if it is based on nonresistant measures such as x ! H. Small sample size and nonnormal population will skew the confidence interval. I. Remember, a confidence interval refers to the likelihood that our method worked, NOT to how likely our answer is to be correct. Our answer is either correct or incorrect. II. Significance Testing: how likely are we to have gotten our results by chance alone? A. Null Hypothesis: Ho , no change hypothesis. (No effect present) 1. Always written in terms of PARAMETERS B. Alternative Hypothesis: Ha , the effect we suspect is true 1. Always written in terms of PARAMETERS C. P-value: The probability that our result happened by pure chance alone! 1. It is the probability of getting our statistic or a more extreme one assuming the null hypothesis is true 2. Low p-value ALWAYS indicates strong evidence against the null hypothesis D. Significance Level: The level at which we would reject the null hypothesis; E. If P-value < , we have statistical significance 1. Our effect is unlikely to have occurred by chance alone 2. We reject the null hypothesis 3. We accept the alternative hypothesis F. If P-value > , we do NOT have statistical significance 1. We dont have sufficient evidence to say our effect is anything more than pure chance 2. We fail to reject the null. (Note: we can NEVER accept the null) 3. We dont have sufficient evidence at the level of significance... G. More on P-value: Suppose p-value = 0.03. This means we would only expect to get our result or a more extreme one 3% of the time if the null hypothesis was true. H. Test Statistic: measure the compatibility between the data and null hypothesis 1.
III. Proper Outline of a Significance Test A. State Hypotheses 1. Have a clear conception of the null hypothesis 2. Carefully decide whether the alternative hypothesis is one-sided or two-sided B. Check Assumptions 1. Show your work! C. Test statistic formula, test statistic value, p-value, degrees of freedom (if applicable) D. Conclusion in context of the problem! E. Know how to draw the associated p-value picture if required IV. Proper Outline of a Confidence Interval A. Check Assumptions 1. Show your work 2. Confidence intervals have the same assumptions as their corresponding significance tests
B. Confidence interval formula, confidence interval results, degrees of freedom (if applicable) C. State conclusion in context of the problem! D. Know how to draw the associated picture if required V. Cautions about significance tests A. Rejecting the null doesnt mean we have evidence that a strong effect is present. It just means we have strong evidence that some effect is present. It may be large or small. 1. A confidence interval gives you a better sense of the size of an effect! B. SRS still needed! C. Statistical inference doesnt work on all data sets. See Hawthorne Effect (wikipedia). VI. Type I and Type II errors A. Remember your table! B. Type I: Rejecting the null when the null is true; 1. Confidence = 1 - C. Type II: Failing to reject the null when the null is false; 1 - Power D. Power: Prob that you correctly reject the null for some specified value of the alternative Chapters 12-13: Significance Tests in Practice Calculator Stuff Covered: TInterval,T-Test, 2-SampTInt, 2-SampT-Test; 2Samp-ZInt; 2-Samp-ZTest; 1propZtest, 2propZtest, 1propZInt, 2propZInt Theme: Now we get into more practical test statistics, such as t. I. Standard Error: When the Std Deviation is estimated from the data A. Example: Std. Error of
is
s n
B. t-distribution 1. Family of curves determined by degrees of freedom 2. Symmetric 3. Bell-shaped 4. Spread is larger than for a normal distribution 5. As df --> , t-distribution looks like a normal distribution C. Matched Pairs: one group of subjects given two treatments 1. You take the differences in the data and run a 1 sample t-test on those differences 2. Null hypothesis is that difference is zero D. Robust: How much the assumptions can be violated and still get credible results 1. Very Robust: Major violations of assumptions still yield good results 2. For n < 15, data should be close to normal to use a t-distribution 3. For 15 < n < 40, t is safe to use as long as there are no outliers or strong skewness 4. For n > 40, t is safe to use. E. Two-Sample Stuff: Two groups each getting a different treatment! 1. Groups should be independent of each other 2. Groups can be different sample sizes 3. Assumption is TWO SRS and that BOTH populations are normal 4. Use Two-Sample z-test if we know population standard deviations 5. Use Two-Sample t-test if we dont know population standard deviations a ) Can use pooled if population variances can be considered equal 6. Two sample t-procedures are even MORE robust than 1-sample t-procedures. a ) For n1 + n2 < 15, both populations should be close to normal b) For 15 n1 + n2 40, safe as long as no outliers or strong skewness c) For n1 + n2 40, safe to use regardless of population distribution 7. Conclusions for a two sample: Describing the differences between populations a ) Example: We are 95% confident the true mean difference between populations is... b) Example: At the 5% level of significance, we have strong evidence that there is a difference between the means of the two populations... Theme: Everything up until now was estimating means. Now we move on to proportions! I. Proportion = Count of successes sample size
^
II. We use
^
III. For a large sample,
^
is approximately normal!
~ N(p,
A. Remember, p is NEVER binomial because binomial distribution is DISCRETE while proportions are CONTINUOUS
pq ) n
^
B. Standard Error of
^
= Std. Dev of
^^
=
IV. Significance Testing: A. Assumptions: np 10, nq 10 (where p comes from the null hypothesis), SRS, population at least 10 times larger than the sample
pq n
B. Test Statistic:
z=
p p pq n
^
10 and n(1-
p)
10
^
B. Interval:
^^
z*
pq n
n=
(z*) 2 4m 2
B. This answer assumes .3 < p* < .7. If this isnt true, it will simply yield a larger sample size than you really need. Thus, this is called the conservative value for n. 1. Where we assume p = 0.5 since that value maximizes the margin of error VII. Two proportions
A. Assumptions: n1 p1 10, n2 p2 10, n1 q1 10, n2 q 2 10, Two independent SRS, both populations at least 10 times larger than their sample sizes B. Conclusions Similar to above: 1. Example: We are 95% confident the true difference is proportions is... 2. Example: At the 5% level of significance, we have evidence of a difference between the two population proportions... a) DONT SAY TRUE MEAN PROPORTION Chapter 14: Inference for Tables: Chi-Square; Calc Stuff Covered: Chi-SqGOF; Chi-Sq-2Way Theme: We need a way to perform inference on categorical data! I. Chi-Square: Two types! A. Chi-Square Goodness of fit: allows us to compare more than two proportions to each other by looking at the counts! B. Chi-Square Test of Independence: Tests whether the distribution of one categorical variable has been influenced by another variable C. Distribution: 1. ALWAYS right skewed 2. Rejection region is always in the right tail 3. Shape of distribution depends on degrees of freedom a ) The higher the degrees of freedom, the more normal Chi-Square looks! b) Chi-Square with 1 degree of freedom = normal distribution squared!
D. Significance Testing for Test of Independence: 1. Assumptions: MUST SHOW & EVALUATE EXPECTED CELL COUNTS! a ) For 2 X 2 table, ALL EXPECTED CELL COUNTS 5 b) For larger than a 2 X 2: (1) AVERAGE of EXPECTED cell counts 5 (2) NO cell count < 1. (3) Note: Expected Cell Count = (row total column total) n c) SRS 2. Hypotheses: a ) Ho : No association between [row and column variables names] b) Ha : There exists an association between [row and column variables names] 3. Degrees of freedom = (# rows minus 1)(# columns minus 1) E. Significance Testing for Goodness of Fit 1. Assumptions: Every expected cell count is 5. YOU MUST SHOW EXP CELL COUNTS! 2. Hypotheses: a ) Ho : The population proportions are as stated...[state them] b) Ha : At least one of the population proportions differs from the stated ones --OR-a) Ho : The population distribution is as stated [state distribution] b) Ha : The population distribution is not as stated 1. Test Statistic:
2. Degrees of Freedom: # categories minus 1 Chapter 14: Inference on Regression; Calculator Stuff Covered: LinRegT-Test, statplot, LSCI program (TI83/TI84); LinRegT-Int Theme: Were testing whether or not regression is worthwhile to perform! I. Significance Testing A. Assumptions: 1. Make a scatterplot of y vs. x: is the trend roughly linear? a ) Calculate and graph the regression line on your scatterplot (1) Look at r2 . How much of the natural variation in the response variable is accounted for by the regression line? b) Look for outliers and influentials 2. Std. Deviation of response fairly constant a ) Residual plot shouldnt be excessively curved or fanning out 3. Response varies normally around the regression line a ) Make a normal quantile plot of the residuals. Is it linear? b) Or you can make a histogram or boxplot of the residuals. Fairly symmetric with no outliers? 4. Observations independent B. Were performing inference on the slope of the regression line. 1. Null hypothesis: Slope (beta) = 0 means regression useless 2. Alternative: Slope (beta) 0 or Slope > 0 or Slope < 0 C. Test Statistic: Uses t-distribution with (n - 2) degrees of freedom 1.
t=
b SEb
II. Confidence Interval on Slope A. b t*SEb (These always involve using computer output to construct) III. Mean response confidence interval: answers the question: We are x% confident that for everyone in the population that did x, we expect the average of their responses to be in the interval... IV. Prediction Interval: answers the question: We are x% confident that if an individual did x, we expect their response to be in the interval...