Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

The Analysis of Biological Data, Second Edition

2014

The Analysis of Biological Data by Michael C. Whitlock and Dolph Schluter Second Edition (z-lib.org)

The Analysis of Biological Data The Analysis of Biological Data Second Edition Michael C. Whitlock and Dolph Schluter The Analysis of Biological Data, Second Edition Publisher: Ben Roberts Proofreader: Kathi Townes Art Studio: Lineworks, Inc.; Lori Heckelman Cover Designer: Emiko Paul Photo Researcher: Jennifer Simmons, Lumina Datamatics, Inc. Permissions Assistant: Michael McCarthy Compositor: Kristina Elliott at TECHarts ©2015 by W. H. Freeman and Company Reproduction or translation of any part of this work beyond that permitted by Section 107 or 108 of the 1976 United States Copyright Act without permission of the copyright owner is unlawful. Requests for permission or further information should be addressed to the W. H. Freeman and Company Rights and Permissions Department. Grateful acknowledgment for third-party permissions, which have been granted for material in this title not owned by Macmillan Learning, appears in the Photo Credits section, which represents an extension of this copyright page. Cover Photo: ©www.pheromonegallery.com ISBN: 978-1-319156-71-8 Library of Congress Cataloging-in-Publication Data Whitlock, Michael, author. The analysis of biological data / Michael C. Whitlock and Dolph Schluter. -- Second edition. pages cm Includes bibliographical references and index. ISBN 978-1-319156-71-8 1. Biometry--Textbooks. I. Schluter, Dolph, author. II. Title. QH323.5.W48 2015 570.1’5195--dc23 2014010300 10 9 8 7 6 5 4 W. H. Freeman and Company One New York Plaza Suite 4500 10004-1562 New York, NY www.macmillanhighered.com To Sally and Wilson, Andrea and Maggie Contents in brief Preface Acknowledgments About the authors PART 1 INTRODUCTION TO STATISTICS 1. Statistics and samples INTERLEAF 1 Biology and the history of statistics 2. Displaying data 3. Describing data 4. Estimating with uncertainty INTERLEAF 2 Pseudoreplication 5. Probability 6. Hypothesis testing INTERLEAF 3 PART Why statistical significance is not the same as biological importance 2 PROPORTIONS AND FREQUENCIES 7. Analyzing proportions INTERLEAF 4 Correlation does not require causation 8. Fitting probability models to frequency data INTERLEAF 5 Making a plan 9. Contingency analysis: associations between categorical variables 235 Review Problems 1 PART 3 COMPARING NUMERICAL VALUES 10. The normal distribution INTERLEAF 6 Controls in medical studies 11. Inference for a normal population 12. Comparing two means INTERLEAF 7 Which test should I use? 13. Handling violations of assumptions Review Problems 2 14. Designing experiments INTERLEAF 8 Data dredging 15. Comparing means of more than two groups INTERLEAF 9 PART Experimental and statistical mistakes 4 REGRESSION AND CORRELATION 16. Correlation between numerical variables INTERLEAF 10 Publication bias 17. Regression INTERLEAF 11 Using species as data points Review Problems 3 PART 5 MODERN STATISTICAL METHODS 18. Multiple explanatory variables 19. Computer-intensive methods 20. Likelihood 21. Meta-analysis: combining information from multiple studies Statistical tables Literature cited Answers to practice problems Photo credits Index Contents Preface Acknowledgments About the authors PART 1 INTRODUCTION TO STATISTICS 1. Statistics and samples 1.1 What is statistics? 1.2 Sampling populations Example 1.2: Raining cats Populations and samples Properties of good samples Random sampling How to take a random sample The sample of convenience Volunteer bias Data in the real world 1.3 Types of data and variables Categorical and numerical variables Explanatory and response variables 1.4 Frequency distributions and probability distributions 1.5 Types of studies 1.6 Summary Practice problems Assignment problems TERLEAF 1 Biology and the history of statistics 2. Displaying data 2.1 Guidelines for effective graphs How to draw a bad graph How to draw a good graph 2.2 Showing data for one variable Showing categorical data: frequency table and bar graph Example 2.2A: Crouching tiger Making a good bar graph A bar graph is usually better than a pie chart Showing numerical data: frequency table and histogram Example 2.2B: Abundance of desert bird species Describing the shape of a histogram How to draw a good histogram Other graphs for numerical data 2.3 Showing association between two variables Showing association between categorical variables Example 2.3A: Reproductive effort and avian malaria Showing association between numerical variables: scatter plot Example 2.3B: Sins of the father Showing association between a numerical and a categorical variable Example 2.3C: Blood responses to high elevation 2.4 Showing trends in time and space Line graph Example 2.4A: Bad science can be deadly Maps Example 2.4B: Biodiversity hotspots 2.5 How to make good tables Follow similar principles for display tables 2.6 Summary Practice problems Assignment problems 3. Describing data 3.1 Arithmetic mean and standard deviation Example 3.1: Gliding snakes The sample mean Variance and standard deviation Rounding means, standard deviations, and other quantities Coefficient of variation Calculating mean and standard deviation from a frequency table Effect of changing measurement scale 3.2 Median and interquartile range Example 3.2: I’d give my right arm for a female The median The interquartile range The box plot 3.3 How measures of location and spread compare Example 3.3: Disarming fish Mean versus median Standard deviation versus interquartile range 3.4 Cumulative frequency distribution Percentiles and quantiles Displaying cumulative relative frequencies 3.5 Proportions Calculating a proportion The proportion is like a sample mean 3.6 Summary 3.7 Quick Formula Summary Practice problems Assignment problems 4. Estimating with uncertainty 4.1 The sampling distribution of an estimate Example 4.1: The length of human genes Estimating mean gene length with a random sample The sampling distribution of Y¯ 4.2 Measuring the uncertainty of an estimate Standard error The standard error of Y¯ The standard error of Y¯ from data 4.3 Confidence intervals The 2SE rule of thumb 4.4 Error bars 4.5 Summary 4.6 Quick Formula Summary Practice problems Assignment problems TERLEAF 2 Pseudoreplication 5. Probability 5.1 The probability of an event 5.2 Venn diagrams 5.3 Mutually exclusive events 5.4 Probability distributions Discrete probability distributions Continuous probability distributions 5.5 Either this or that: adding probabilities The addition rule The probabilities of all possible mutually exclusive outcomes add to one The general addition rule 5.6 Independence and the multiplication rule Multiplication rule Example 5.6A: Smoking and high blood pressure “And” versus “or” Independence of more than two events Example 5.6B: Mendel’s peas 5.7 Probability trees Example 5.7: Sex and birth order 5.8 Dependent events Example 5.8: Is this meat taken? 5.9 Conditional probability and Bayes’ theorem Conditional probability The general multiplication rule Sampling without replacement Bayes’ theorem Example 5.9: Detection of Down syndrome 5.10 Summary Practice problems Assignment problems 6. Hypothesis testing 6.1 Making and using hypotheses Null hypothesis Alternative hypothesis To reject or not to reject 6.2 Hypothesis testing: an example Example 6.2: The right hand of toad Stating the hypotheses The test statistic The null distribution Quantifying uncertainty: the P-value Draw the appropriate conclusion Reporting the results 6.3 Errors in hypothesis testing Type I and Type II errors 6.4 When the null hypothesis is not rejected Example 6.4: The genetics of mirror-image flowers The test Interpreting a non-significant result 6.5 One-sided tests 6.6 Hypothesis testing versus confidence intervals 6.7 Summary Practice problems Assignment problems TERLEAF 3 PART Why statistical significance is not the same as biological importance 2 PROPORTIONS AND FREQUENCIES 7. Analyzing proportions 7.1 The binomial distribution Formula for the binomial distribution Number of successes in a random sample Sampling distribution of the proportion 7.2 Testing a proportion: the binomial test Example 7.2: Sex and the X Approximations for the binomial test 7.3 Estimating proportions Example 7.3: Radiologists’ missing sons Estimating the standard error of a proportion Confidence intervals for proportions—the Agresti–Coull method Confidence intervals for proportions—the Wald method 7.4 Deriving the binomial distribution 7.5 Summary 7.6 Quick Formula Summary Practice problems Assignment problems TERLEAF 4 Correlation does not require causation 8. Fitting probability models to frequency data 8.1 Example of a probability model: the proportional model Example 8.1: No weekend getaway 8.2 χ2 goodness-of-fit test Null and alternative hypotheses Observed and expected frequencies The χ2 test statistic The sampling distribution of χ2 under the null hypothesis Calculating the P-value Critical values for the χ2 distribution 8.3 Assumptions of the χ2 goodness-of-fit test 8.4 Goodness-of-fit tests when there are only two categories Example 8.4: Gene content of the human X chromosome 8.5 Fitting the binomial distribution Example 8.5: Designer two-child families? 8.6 Random in space or time: the Poisson distribution Formula for the Poisson distribution Testing randomness with the Poisson distribution Example 8.6: Mass extinctions Comparing the variance to the mean 8.7 Summary 8.8 Quick Formula Summary Practice problems Assignment problems TERLEAF 5 Making a plan 9. Contingency analysis: associations between categorical variables 9.1 Associating two categorical variables 9.2 Estimating association in 2 × 2 tables: odds ratio Odds Example 9.2: Take two aspirin and call me in the morning? Odds ratio Standard error and confidence interval for odds ratio 9.3 Estimating association in 2 × 2 table: relative risk Odds ratio vs. relative ristk Example 9.3: Your litter box and your brain 9.4 The χ2 contingency test Example 9.4: The gnarly worm gets the bird Hypotheses Expected frequencies assuming independence The χ2 statistic Degrees of freedom P-value and conclusion A shortcut for calculating the expected frequencies The χ2 contingency test is a special case of the χ2 goodness-of-fit test Assumptions of the χ2 contingency test Correction for continuity 9.5 Fisher’s exact test Example 9.5: The feeding habits of vampire bats 9.6 G-tests 9.7 Summary 9.8 Quick Formula Summary Practice problems Assignment problems Review Problems 1 PART 3: COMPARING NUMERICAL VALUES 10. The normal distribution 10.1 Bell-shaped curves and the normal distribution 10.2 The formula for the normal distribution 10.3 Properties of the normal distribution 10.4 The standard normal distribution and statistical tables Using the standard normal table Using the standard normal to describe any normal distribution Example 10.4: One small step for man? 10.5 The normal distribution of sample means Calculating probabilities of sample means 10.6 Central limit theorem Example 10.6: Young adults and the Spanish flu 10.7 Normal approximation to the binomial distribution Example 10.7: The only good bug is a dead bug 10.8 Summary 10.9 Quick Formula Summary Practice problems Assignment problems TERLEAF 6 Controls in medical studies 11. Inference for a normal population 11.1 The t-distribution for sample means Student’s t-distribution Finding critical values of the t-distribution 11.2 The confidence interval for the mean of a normal distribution Example 11.2: Eye to eye The 95% confidence interval for the mean The 99% confidence interval for the mean 11.3 The one-sample t-test Example 11.3: Human body temperature The effects of larger sample size—body temperature revisited 11.4 Assumptions of the one-sample t-test 11.5 Estimating the standard deviation and variance of a normal population Confidence limits for the variance Confidence limits for the standard deviation Assumptions 11.6 Summary 11.7 Quick Formula Summary Practice problems Assignment problems 12. Comparing two means 12.1 Paired sample versus two independent samples 12.2 Paired comparison of means Estimating mean difference from paired data Example 12.2: So macho it makes you sick? Paired t-test Assumptions 12.3 Two-sample comparison of means Example 12.3: Spike or be spiked Confidence interval for the difference between two means Two-sample t-test Assumptions A two-sample t-test when standard deviations are unequal 12.4 Using the correct sampling units Example 12.4: So long; thanks to all the fish 12.5 The fallacy of indirect comparison Example 12.5: Mommy’s baby, Daddy’s maybe 12.6 Interpreting overlap of confidence intervals 12.7 Comparing variances The F-test of equal variances Levene’s test for homogeneity of variances 12.8 Summary 12.9 Quick Formula Summary Practice problems Assignment problems TERLEAF 7 Which test should I use? 13. Handling violations of assumptions 13.1 Detecting deviations from normality Graphical methods Example 13.1: The benefits of marine reserves Formal test of normality 13.2 When to ignore violations of assumptions Violations of normality Unequal standard deviations 13.3 Data transformations Log transformation Arcsine transformation The square-root transformation Other transformations Confidence intervals with transformations A caveat: Avoid multiple testing with transformations 13.4 Nonparametric alternatives to one-sample and paired t-tests Sign test Example 13.4: Sexual conflict and the origin of new species The Wilcoxon signed-rank test 13.5 Comparing two groups: the Mann–Whitney U-test Example 13.5: Sexual cannibalism in sagebrush crickets Tied ranks Large samples and the normal approximation 13.6 Assumptions of nonparametric tests 13.7 Type I and Type II error rates of nonparametric methods 13.8 Permutation tests Assumptions of permutation tests 13.9 Summary 13.10 Quick Formula Summary Practice problems Assignment problems Review Problems 2 14. Designing experiments 14.1 Why do experiments? Confounding variables Experimental artifacts 14.2 Lessons from clinical trials Example 14.2: Reducing HIV transmission Design components 14.3 How to reduce bias Simultaneous control group Randomization Blinding 14.4 How to reduce the influence of sampling error Replication Balance Blocking Example 14.4A: Holey waters Extreme treatments Example 14.4B: Plastic hormones 14.5 Experiments with more than one factor Example 14.5: Lethal combination 14.6 What if you can’t do experiments? Match and adjust 14.7 Choosing a sample size Plan for precision Plan for power Plan for data loss 14.8 Summary 14.9 Quick Formula Summary Practice problems Assignment problems TERLEAF 8 Data dredging 15. Comparing means of more than two groups 15.1 The analysis of variance Example 15.1: The knees who say night Hypotheses ANOVA in a nutshell ANOVA tables Partitioning the sum of squares Calculating the mean squares The variance ratio, F Variation explained: R2 ANOVA with two groups 15.2 Assumptions and alternatives The robustness of ANOVA Data transformations Nonparametric alternatives to ANOVA 15.3 Planned comparisons Planned comparison between two means 15.4 Unplanned comparisons Example 15.4: Wood wide web Testing all pairs of means using the Tukey–Kramer method Assumptions 15.5 Fixed and random effects 15.6 ANOVA with randomly chosen groups Example 15.6: Walking stick limbs ANOVA calculations Variance components Repeatability Assumptions 15.7 Summary 15.8 Quick Formula Summary Practice problems Assignment problems TERLEAF 9 PART Experimental and statistical mistakes 4 REGRESSION AND CORRELATION 16. Correlation between numerical variables 16.1 Estimating a linear correlation coefficient The correlation coefficient Example 16.1: Flipping the bird Standard error Approximate confidence interval 16.2 Testing the null hypothesis of zero correlation Example 16.2: What big inbreeding coefficients you have 16.3 Assumptions 16.4 The correlation coefficient depends on the range 16.5 Spearman’s rank correlation Example 16.5: The miracles of memory Procedure for large n Assumptions of Spearman’s correlation 16.6 The effects of measurement error on correlation 16.7 Summary 16.8 Quick Formula Summary Practice problems Assignment problems TERLEAF 10 Publication bias 17. Regression 17.1 Linear regression Example 17.1: The lion’s nose The method of least squares Formula for the line Calculating the slope and intercept Populations and samples Predicted values Residuals Standard error of slope Confidence interval for the slope 17.2 Confidence in predictions Confidence intervals for predictions Extrapolation 17.3 Testing hypotheses about a slope Example 17.3: Prairie home campion The t-test of regression slope The ANOVA approach Using R2 to measure the fit of the line to data 17.4 Regression toward the mean 17.5 Assumptions of regression Outliers Detecting nonlinearity Detecting non-normality and unequal variance 17.6 Transformations 17.7 The effects of measurement error on regression 17.8 Nonlinear regression A curve with an asymptote Quadratic curves Formula-free curve fitting Example 17.8: The incredible shrinking seal 17.9 Logistic regression: fitting a binary response variable 17.10 Summary 17.11 Quick Formula Summary Practice problems Assignment problems TERLEAF 11 Using species as data points Review Problems 3 PART 5 MODERN STATISTICAL METHODS 18. Multiple explanatory variables 18.1 ANOVA and linear regression are linear models Modeling with linear regression Generalizing linear regression General linear models 18.2 Analyzing experiments with blocking Analyzing data from a randomized block design Example 18.2: Zooplankton depredation Model formula Fitting the model to data 18.3 Analyzing factorial designs Example 18.3: Interaction zone Model formula Testing the factors The importance of distinguishing fixed and random factors 18.4 Adjusting for the effects of a covariate Example 18.4: Mole-rat layabouts Testing interaction Fitting a model without an interaction term 18.5 Assumptions of general linear models 18.6. Summary Practice problems Assignment problems 19. Computer-intensive methods 19.1 Hypothesis testing using simulation Example 19.1: How did he know? The nonrandomness of haphazard choice 19.2 Bootstrap standard errors and confidence intervals Example 19.2: The language center in chimps’ brains Bootstrap standard error Confidence intervals by bootstrapping Bootstrapping with multiple groups Assumptions and limitations of the bootstrap 19.3 Summary Practice problems Assignment problems 20. Likelihood 20.1 What is likelihood? 20.2 Two uses of likelihood in biology Phylogeny estimation Gene mapping 20.3 Maximum likelihood estimation Example 20.3: Unruly passengers Probability model The likelihood formula The maximum likelihood estimate Likelihood-based confidence intervals 20.4 Versatility of maximum likelihood estimation Example 20.4: Conservation scoop Probability model The likelihood formula The maximum likelihood estimate Bias 20.5 Log-likelihood ratio test Likelihood ratio test statistic Testing a population proportion 20.6 Summary 20.7 Quick Formula Summary Practice problems Assignment problems 21. Meta-analysis: combining information from multiple studies 21.1 What is meta-analysis? Why repeat a study? 21.2 The power of meta-analysis Example 21.2: Aspirin and myocardial infarction 21.3 Meta-analysis can give a balanced view Example 21.3: The Transylvania effect 21.4 The steps of a meta-analysis Define the question Example 21.4: Testosterone and aggression Review the literature Compute effect sizes Determine the average effect size Calculate confidence intervals and test hypotheses Look for effects of study quality Look for associations 21.5 File-drawer problem 21.6 How to make your paper accessible to meta-analysis 21.7 Summary 21.8 Quick Formula Summary Practice problems Assignment problems Statistical tables Using statistical tables Statistical Table A: The χ2 distribution Statistical Table B: The standard normal (Z) distribution Statistical Table C: The Student t-distribution Statistical Table D: The F-distribution Statistical Table E: Mann–Whitney U-distribution Statistical Table F: Tukey–Kramer q-distribution Statistical Table G: Critical values for the Spearman’s rank correlation Literature cited Answers to practice problems Photo credits Index Preface Modern biologists need the powerful tools of data analysis. As a result, an increasing number of universities offer, or even require, a basic data analysis course for all their biology and premedical students. We have been teaching such a course at the University of British Columbia for the last two decades. Over this period, we have sought a textbook that covered the material we needed in an introductory course at just the right level. We found that most texts were too technical and encyclopedic, or else they didn’t go far enough, missing methods that were crucial to the practice of modern biology and medicine. We wanted a book that had a strong emphasis on intuitive understanding to convey meaning, rather than an overreliance on formulas. We wanted to teach by example, and the examples needed to be interesting. Most importantly, we needed a biology book, addressing topics important to biologists and health care providers handling real data. We couldn’t find the book that we needed, so we decided to write this one to fill the gap. We include several unusual features that we have discovered to be helpful for effectively reaching our audience: Interesting biology examples. Our teaching has shown us that biology students learn data analysis best in the context of interesting examples drawn from the medical and biological literature. Statistics is a means to an end, a tool to learn about the human and natural world. By emphasizing what we can learn about the science, the power and value of statistics becomes plain. Plus, it’s just more fun for everyone concerned. Every chapter has several biological or medical examples of key concepts, and each example is prefaced by a substantial description of the biological setting. The examples are illustrated with photos of the real organisms, so that students can look at what they’re learning about. The emphasis on real and interesting examples carries into the problem sets; for each chapter, there are dozens of questions based on real data about biological and medical issues. In this Second Edition, we have added approximately 200 new problems. These include a new type of problem, called Calculation Practice, that takes the student step-by-step through the important procedures described in the chapter. The corresponding answers are provided at the back of the book so that the students can check their success at each step. Two or three such problems are at the start of most of the chapter problem sets. We’ve also added three new sections of review problems (after Chapters 9, 13, and 17) to allow cumulative review of the important concepts up to that point in the book. Intuitive explanations of key concepts. Statistical reasoning requires a lot of new ways of thinking. Students can get lost in the barrage of new jargon and multitudinous tests. We have found that starting from an intuitive foundation, away from all the details, is extremely valuable. We take an intuitive approach to basic questions: What’s a good sample? What’s a confidence interval? Why do an experiment? The first several chapters establish this basic knowledge, and the rest of the book builds on it. Practical data analysis. As its title suggests, this book focuses on data rather than the mathematical foundations of statistics. We teach how to make good graphical displays, and we emphasize that a good graph is the beginning point of any good data analysis. We give equal time to estimation and hypothesis testing, and we avoid treating the P-value as an end in itself. The book does not demand a knowledge of mathematics beyond simple algebra. We focus on practicality over nuance, on biological usefulness over theoretical hand wringing. We not only teach the “right” way of doing something but also highlight some of the pitfalls that might be encountered. We demonstrate how to carry out the calculations for most methods so that the steps are not mysterious to students. At the same time, we know that a computer will be available for most calculations. Hence, we focus on the concepts of biological data analysis and how statistics can help extract scientific insight from data. With the power of modern computers at hand, the challenge in analyzing data becomes knowing what method to use and why.1 We imagine and hope that every course using this book will have a component encouraging students to use computer statistical packages. We are also aware that the diversity of such packages is immense, and so we have not tied the book to any particular program. We have heard most often from instructors who use the R package with this book, and we now provide R scripts for all the examples in this book at the book’s web site (http://whitlockschluter.zoology.ubc.ca). Practical experimental design. A biologist cannot do good statistics—or good science— without a practical understanding of experimental design. Unlike most books, our book covers basic topics in experimental design, such as controls, randomization, pseudoreplication, and blocking, and we do it in a practical, intuitive way. Up to date on the basics. Believe it or not, the best confidence interval for the proportion is not the one you probably learned as an undergraduate. Nonparametric statistics do not effectively test for differences in means (or medians, for that matter) without some fairly strong assumptions that we normally hear little about. For these and many other topics, we have updated the coverage of basic, everyday topics in statistics. Coverage of modern topics. Modern biology uses a larger toolkit than the one available a generation ago. In this book, we go beyond most introductory books by establishing the conceptual principles of important topics, such as likelihood, nonlinear regression, permutation tests, meta-analysis, and the bootstrap. Useful summaries. Near the end of each chapter is a short, clear summary of the key concepts, and most chapters end with Quick Formula Summaries that put most equations in one easy-tofind place. Interleaves. Between chapters are short essays that we call interleaves. These interleaves cover a variety of conceptual topics in a nontechnical way—ideas that are nevertheless crucial for the interpretation of statistical results in scientific research. Several of them focus on common mistakes in the analysis and interpretation of biological data and how to avoid them. Although the interleaves are pressed between the chapters, they complement the material in the core chapters in important ways. We believe that the interleaves discuss some of the most important topics of the book, such as the meaning of statistical significance (Interleaf 3), the difference between correlation and causation (Interleaf 4), why control treatments are necessary (Interleaf 6), and the distortions caused by publication bias (Interleaf 10). Interleaf 7 summarizes what statistical test should be used and when, and it is a good place to start when reviewing for exams. After five years of writing—and a couple more years of work on this new edition—the result is now in your hands. We think The Analysis of Biological Data provides a good background in data analysis for biologists and medical practitioners, covering a broad range of topics in a practical and intuitive way. It works for our classes; we hope that it works for yours, too. Organization of the book The Analysis of Biological Data is divided into five parts, each with a handful of chapters as indicated in the table of contents. We recommend starting with the first part, because it introduces many basic concepts that are used throughout the book, such as sampling, drawing a good graph, describing data, estimating, hypothesis testing, and concepts in probability. These early chapters are meant to be read in their entirety. After the first block, most chapters progress from the most general topics at the start to more specialized topics by the end. Each chapter is structured so that a basic understanding of the topic may be obtained from the earliest sections. For example, in the chapter on analysis of variance (Chapter 15), the basics are taught in the first two sections; reading Sections 15.1 and 15.2 gives roughly the same material that most introductory statistics texts provide about this method. Sections 15.3–15.6 explain additional twists and other interesting applications. The last block of chapters (Chapters 18–21) is mainly for the adventurous and the curious. These chapters introduce topics, such as likelihood, bootstrapping, and meta-analysis, that are commonly encountered in the biological and medical literature but not often mentioned in an introductory course. These chapters introduce the basic principles of each topic, discuss how the methods work, and point to where you might look to find out more. A basic course could be taught by using only Chapters 1–17 and, within this subset of chapters, by stopping after Sections 5.8, 7.3, 8.4, 9.4, 12.6, 13.6, 15.2, 16.4, and 17.6 in their respective chapters. We suggest that all courses highlight specific topics covered only in the interleaves. Each chapter ends with a series of problems that are designed to test students’ understanding of the concepts and the practical application of statistics. The problems are divided into Practice Problems and Assignment Problems. Three cumulative review problem sets have been added to the Second Edition, placed approximately at locations convenient for midterm and final exam review. Short answers to all Practice Problems and Review Problems are provided in the back of the book; answers to the Assignment Problems are available to instructors only from the publisher. For a copy, contact Karen Carson at karen.carson@macmillan.com or 1-800-446-8923. Other teaching resources for the book are available online at http://whitlockschluter.zoology.ubc.ca. A word about the data The data used in this book are real, with a few well-marked exceptions. For the most part, these data were obtained directly from published papers. In some cases, we contacted the authors of articles who generously provided the raw data for our use. Often, when raw data were not provided in the original paper, we resorted to obtaining data points by digitizing graphical depictions, such as scatter plots and histograms. Inevitably, the numbers we extracted differ slightly from the original numbers because of measurement error. In rare cases, we generated data by computer that matched the statistical summaries in the paper. In all cases, the results we present are consistent with the conclusions of the original papers. Most of the data sets are available online at http://whitlockschluter.zoology.ubc.ca. 1. “A computer lets you make more mistakes faster than any invention in human history—with the possible exceptions of handguns and tequila.” (Mitch Ratcliffe, in Technology Review, 1992). Acknowledgments This book would not have happened had the two of us had been left to do it by ourselves. Many people contributed in substantial ways, and we are forever grateful. The clarity and accuracy of its contents were improved by the careful attention of a lot of generous readers, including Arianne Albert, Brad Anholt, Cécile Ané, Eric Baack, Arthur Berg, Chad Brassil, James Bryant, Martin Buntinas, Mark Butler, Carrie Case, C. Ray Chandler, Mark Clemens, Bradley Cosentino, Perry de Valpine, Flo Débarre, Abby Drake, Christiana Drake, Jonathan Dushoff, Steven George, Aleeza Gerstein, George Gilchrist, Brett Goodwin, Steven Green, Tim Hanson, Mike Hickerson, Lisa Hines, Darren Irwin, Nusrat Jahan, Philip Johns, Roger Johnson, Istvan Karsai, Robert Keen, John Kelly, Rex Kenner, Ben Kerr, Laura Kubatko, Joseph G. Kunkel, Bret Larget, Theo Light, Todd Livdahl, Heather Masonjones, Brian C. McCarthy, Kevin Middleton, Eli Minkoff, Robert Montgomerie, Spencer Muse, Courtney Murren, Claudia Neuhauser, Liam O’Brien, Maria Orive, Patrick C. Phillips, Jay Pitocchelli, Danielle Presgraves, James Robinson, Simon Robson, Michael Rosenberg, Noah Rosenberg, Nathan Rank, Bruce Rannala, Mark Rizzardi, Colin Rundel, Michael Russell, Ronald W. Russell, Andrew Schaffner, Andrea Schluter, James Scott, Joel Shore, John Soghigian, William Thomas, Michael Travisano, Thomas Valone, Bruce Walsh, Claus Wilke, Michael Wunder, Grace A. Wyngaard, and Sam Yeaman. Many of these good people read multiple chapters, and we thank all for their invaluable aid and considerate forebearance. Sally Otto, Allan StewartOaten, and Maria Orive earned our undying gratitude by reading and commenting on the entire book. Of course, any errors that remain are our own fault; we didn’t always take everyone’s advice, even perhaps when we should have. If we have forgotten anyone, you have our thanks even if our memories are poor. We owe a debt to the students of BIOL 300 at the University of British Columbia, who class-tested this book over the last several years. The book also benefited by class testing at several colleges and universities in courses by Helen Alexander, Brad Anholt, Eric Baack, Carol Baskauf, Peter Dunn, Matthias Foellmer, Marie-Josée Fortin, George Gilchrist, Michael Grant, Joe Hardin, Karen Harper, Scott Harrison, Mike Hickerson, Stephen Hudman, Nusrat Jahan, Laura Kapitula, Randy Larsen, Terri Lacourse, Susan Lehman, Todd Livdahl, Kelly McCoy, Jean Richardson, Simon Robson, Chris Sanford, Tom Short, Andrew Tierman, Steve Vamosi, Liette Vasseur, Jason Wilson, and Grace A. Wyngaard. George Gilchrist and his students gave us a very detailed and extremely helpful set of comments at a crucial stage of the book. The following students and faculty from UBC and other institutions uncovered errors in earlier versions of the book: Jessica Beaubier, Chad Brassil, Edward Cheung, Lorena Cheung, Stephanie Cheung, Denise Choi, Peter Dunn, Maryam Garabedian, Samrad Ghavimi, Inderjit Grewal, Rutger Hermsen, Sarah Hamanishi, Anne Harris, Yunhyung Ku, Gurpreet Khaira, Joanne Kim, Jung Min Kim, Arleigh Lambert, Alexander Leung, Mira Li, Flora Liu, Dianna Louie, Johnston Mak, Giovanni Marchetti, Luke Miller, Diana Moanga, Sarah Neumann, Jarad Niemi, Tyler Ng, Ruth Ogbamichael, Jasmine Ono, Chris Parrish, Jason Pither, Marion Pearson, Jessica Pham, Trevor Schofield, Melissa Smith, Meredith Soon, Erin Stacey, Daniel Stoebel, Myriam Tremblay, Michelle Uzelac, John Wakeley, Hillary Ward, Chris Wong, Irene Yu, Fei Yuan, Anush Zakaryan, Qiong Zhang, Paul Zhou, and Jon-Paul Zacharias. We send special thanks to Nick Cox, who graciously read a previous printing of this book with extraordinary care. A number of researchers freely sent us their original data, including Matt Arnegard, Angela Attwood, Audrey Barker-Plotkin, Cynthia Beall, Butch Brodie, Pamela Colosimo, Kimmo Eriksson, Kevin Fowler, Suzanne Gray, Chris Harley, Luke Harmon, Andrew Hendry, Peter Keightley, Fredrik Liljeros, Jean Thierry-Mieg, Jeffrey S. Mogil, Patrik Nosil, Margarita Ramos, Rick Relyea, Locke Rowe, Natarajan Singaravelan, Jake Socha, Jan Soumanan, Brian Starzomski, Richard Svanback, David Tilman, Andrew Trites, Neils van de Ven, Christoph von Beeren, Yitong Wang, Jason Weir, Jack Werren, and Martin Wikelski. The book has benefited enormously from the efforts of a large team of able people: copyediting by John Murdzek (1st ed.), Gunder Hefta (1st ed.), and Kathi Townes (2nd ed.); photo research by Laura Roberts (1st ed.), Terri Wright, Austin MacRae, and Jen Simmons (2nd ed.); art by Tom Webster (1st ed.) and Lori Heckelman (2nd ed.); and design and composition by Mark Ong (1st ed.), Kathi Townes (2nd ed.), and Kristina Elliott (2nd ed.). Eric Baack has our special appreciation for slaving over the problem sets to create the answer keys, as does Holly Kindsvater, who carefully checked all the answer keys for the 2nd edition. Steven Green pointed out several improvements to the answer key and reviewed the answers to the review problem sets. Aleeza Gerstein corrected numerous errors with her careful proofreading. Finally, Ben Roberts deserves our greatest thanks, for all of his support and vision in making this book happen, and especially for Clause 24. The book was started while MCW was supported by the Peter Wall Institute for Advanced Studies at UBC as a Distinguished-Scholar-in-Residence, and the majority of the final stages of the book were written while he was a Sabbatical Scholar at the National Evolutionary Synthesis Center in North Carolina (NSF #EF-0423641). DS began working on the book while a visiting professor in Developmental Biology at Stanford University. The second edition was aided by a sabbatical leave at the University of Texas (MCW) and a Canada Council Senior Killam Fellowship (DS), which included a wonderful stay at La Selva Biological Station. The scholarly support and environment provided by each of these institutions was exceptional—and greatly appreciated. Finally, we would like to give great thanks to all of the people that have taught us the most over the years. MCW would like to thank Dave McCauley, Mike Wade, Nick Barton, Ben Pierce, Kevin Fowler, Patrick Phillips, Sally Otto, and Betty Whitlock. About the authors Michael Whitlock is an evolutionary biologist and population geneticist known for his work on evolution in spatially structured populations. He is a Professor of Zoology at the University of British Columbia, where he has taught statistics to biology students since 1995. He is a fellow of the American Association for the Advancement of Science and of the American Academy of Arts and Science. Dolph Schluter is Professor and Canada Research Chair in the Zoology Department and Biodiversity Research Center at the University of British Columbia. He is known for his research on the ecology and evolution of Galápagos finches and threespine stickleback. He is a fellow of the Royal Societies of Canada and London and a foreign member of the American Academy of Arts and Sciences. 1 Statistics and samples Leafcutter ant What is statistics? Biologists study the properties of living things. Measuring these properties is a challenge, though, because no two individuals from the same biological population are ever exactly alike. We can’t measure everyone in the population, either, so we are constrained by time and funding to limit our measurements to a sample of individuals drawn from the population. Sampling brings uncertainty to the project because, by chance, properties of the sample are not the same as the true values in the population. Thus, measurements made from a sample are affected by who happened to get sampled and who did not. Statistics is the study of methods to describe and measure aspects of nature from samples. Crucially, statistics gives us tools to quantify the uncertainty of these measures—that is, statistics makes it possible to determine the likely magnitude of their departure from the truth. Statistics is about estimation, the process of inferring an unknown quantity of a target population using sample data. Properly applied, the tools for estimation allow us to approximate almost everything about populations using only samples. Examples range from the average flying speed of bumblebees, to the risks of exposure to cell phones, to the variation in beak size of finches on a remote Galápagos island. We can estimate the proportion of people with a particular disease who die per year and the fraction who recover when treated. Estimation is the process of inferring an unknown quantity of a population using sample data. Most importantly, we can assess differences between groups and relationships between variables. For example, we can estimate the effects of different drugs on the possibility of recovery, we can measure the association between the lengths of horns on male antelopes and their success at attracting mates, and we can determine by how much the survival of women and children during shipwrecks differs from that of men. All of these quantities describing populations—namely, averages, proportions, measures of variation, and measures of relationship—are called parameters. Statistical methods tell us how best to estimate these parameters using our measurements of a sample. The parameter is the truth, and the estimate (also known as the statistic) is an approximation of the truth, subject to error. If we were able to measure every possible member of the population, we could know the parameter without error, but this is rarely possible. Instead, we use estimates based on incomplete data to approximate this true value. With the right statistical tools, we can determine just how good our approximations are. A parameter is a quantity describing a population, whereas an estimate or statistic is a related quantity calculated from a sample. Statistics is also about hypothesis testing. A statistical hypothesis is a specific claim regarding a population parameter. Hypothesis testing uses data to evaluate evidence for or against statistical hypotheses. Examples are “The mean effect of this new drug is not different from that of its predecessor,” and “Inhibition of the Tbx4 gene changes the rate of limb development in chick embryos.” Biological data usually get more interesting and informative if they can resolve competing claims about a population parameter. Statistical methods have become essential in almost every area of biology—as indispensable as the PCR machine, calipers, binoculars, and the microscope. This book presents the ideas and methods needed to use statistics effectively, so that we can improve our understanding of nature. Chapter 1 begins with an overview of samples—how they should be gathered and the conclusions that can be drawn from them. We also discuss the types of variables that can be measured from samples, introducing terms that will be used throughout the book. Sampling populations Our ability to obtain reliable measures of population characteristics—and to assess the uncertainty of these measures—depends critically on how we sample populations. It is often at this early step in an investigation that the fate of a study is sealed, for better or worse, as Example 1.2 demonstrates. EXAMPLE 1.2 Raining cats In an article published in the Journal of the American Veterinary Medical Association, Whitney and Mehlhaff (1987) presented results on the injury rates of cats that had plummeted from buildings in New York City, according to the number of floors they had fallen. Fear not: no experimental scientist tossed cats from different altitudes to obtain the data for this study. Rather, the cats had fallen (or jumped) of their own accord. The researchers were merely recording the fates of the sample of cats that ended up at a veterinary hospital for repair. The damage caused by such falls was dubbed Feline High-Rise Syndrome, or FHRS.1 Not surprisingly, cats that fell five floors fared worse than those dropping only two, and those falling seven or eight floors tended to suffer even more (see Figure 1.2-1). But the astonishing result was that things got better after that. On average, the number of injuries was reduced in cats that fell more than nine floors. This was true in every injury category. Their injury rates approached that of cats that had fallen only two floors! One cat fell 32 floors and walked away with only a chipped tooth. FIGURE 1.2-1 A graph plotting the average number of injuries sustained per cat according to the number of stories fallen. Numbers in parentheses indicate number of cats. Modified from Diamond (1988). This effect cannot be attributed to the ability of cats to right themselves so as to land on their feet—a cat needs less than one story to do that. The authors of the article put forth a more surprising explanation. They proposed that after a cat attains terminal velocity, which happens after it has dropped six or seven floors, the falling cat relaxes, and this change to its muscles cushions the impact when the cat finally meets the pavement. Remarkable as these results seem, aspects of the sampling procedure raise questions. A clue to the problem is provided by the number of cats that fell a particular number of floors, indicated along the horizontal axis of Figure 1.2-1. No cats fell just one floor, and the number of cats falling increases with each floor from the second floor to the fifth. Yet, surely, every building in New York that has a fifth floor has a fourth floor, too, with open windows no less inviting. What can explain this curious trend? To answer this, keep in mind that the data are a sample of cats. The study was not carried out on the whole population of cats that fell from New York buildings. Our strong suspicion is that the sample is biased. Not all fallen cats were taken to the vet, and the chance of a cat making it to the vet might have been affected by the number of stories it had fallen. Perhaps most cats that tumble out of a first- or second-floor window suffer only indignity, which is untreatable. Any cat appearing to suffer no physical damage from a fall of even a few stories may likewise skip a trip to the vet. At the other extreme, a cat fatally plunging 20 stories might also avoid a trip to the vet, heading to the nearest pet cemetery instead. This example illustrates the kinds of questions of interpretation that arise if samples are biased. If the sample of cats delivered to the vet clinic is, as we suspect, a distorted subset of all the cats that fell, then the measures of injury rate and injury severity will also be distorted. We cannot say whether this bias is enough to cause the surprising downturn in injuries at a high number of stories fallen. At the very least, though, we can say that, if the chances of a cat making it to the vet depends on the number of stories fallen, the relationship between injury rate and number of floors fallen will be distorted. Good samples are a foundation of good science. In the rest of this section, we give an overview of the concept of sampling, what we are trying to accomplish when we take a sample, and the inferences that are possible when researchers get it right. Populations and samples The first step in collecting any biological data is to decide on the target population. A population is the entire collection of individual units that a researcher is interested in. Ordinarily, a population is composed of a large number of individuals—so many that it is not possible to measure them all. Examples of populations include ■ ■ ■ ■ ■ all cats that have fallen from buildings in New York City, all the genes in the human genome, all individuals of voting age in Australia, all paradise flying snakes in Borneo, and all children in Vancouver, Canada, suffering from asthma. A sample is a much smaller set of individuals selected from the population.2 The researcher uses this sample to draw conclusions that, hopefully, apply to the whole population. Examples include ■ ■ ■ ■ ■ the fallen cats brought to one veterinary clinic in New York City, a selection of 20 human genes, all voters in an Australian pub, eight paradise flying snakes caught by researchers in Borneo, and a selection of 50 children in Vancouver, Canada, suffering from asthma. A population is all the individual units of interest, whereas a sample is a subset of units taken from the population. In most of the above examples, the basic unit of sampling is literally a single individual. However, in one example, the sampling unit was a single gene. Sometimes the basic unit of sampling is a group of individuals, in which case a sample consists of a set of such groups. Examples of units that are groups of individuals include a single family, a colony of microbes, a plot of ground in a field, an aquarium of fish, and a cage of mice. Scientists use several terms to indicate the sampling unit, such as unit, individual, subject, or replicate. Properties of good samples Estimates based on samples are doomed to depart somewhat from the true population characteristics simply by chance. This chance difference from the truth is called sampling error. The spread of estimates resulting from sampling error indicates the precision of an estimate. The lower the sampling error, the higher the precision. Larger samples are less affected by chance and so, all else being equal, larger samples will have lower sampling error and higher precision than smaller samples. Sampling error is the difference between an estimate and the population parameter being estimated caused by chance. Ideally, our estimate is accurate (or unbiased), meaning that the average of estimates that we might obtain is centered on the true population value. If a sample is not properly taken, measurements made on it might systematically underestimate (or overestimate) the population parameter. This is a second kind of error called bias. Bias is a systematic discrepancy between the estimates we would obtain, if we could sample a population again and again, and the true population characteristic. The major goal of sampling is to minimize sampling error and bias in estimates. Figure 1.22 illustrates these goals by analogy with shooting at a target. Each point represents an estimate of the population bull’s-eye (i.e., of the true characteristic). Multiple points represent different estimates that we might obtain if we could sample the population repeatedly. Ideally, all the estimates we might obtain are tightly grouped, indicating low sampling error, and they are centered on the bull’s-eye, indicating low bias. Estimates are precise if the values we might obtain are tightly grouped and highly repeatable, with different samples giving similar answers. Estimates are accurate if they are centered on the bull’s-eye. Estimates are imprecise, on the other hand, if they are spread out, and they are biased (inaccurate) if they are displaced systematically to one side of the bull’s-eye. The shots (estimates) on the upper right-hand target in Figure 1.2-2 are widely spread out but centered on the bull’s-eye, so we say that the estimates are accurate but imprecise. The shots on the lower left-hand target are tightly grouped but not near the bull’s-eye, so we say that they are precise but inaccurate. Both precision and accuracy are important, because a lack of either means that an estimate is likely to differ greatly from the truth. FIGURE 1.2-2 Analogy between estimation and target shooting. An accurate estimate is centered around the bull’s-eye, whereas a precise estimate has low spread. With sampling, we also want to be able to quantify the precision of an estimate. There are several quantities available to measure precision, which we discuss in Chapter 4. The sample of cats in Example 1.2 falls short in achieving some of these goals. If uninjured and dead cats do not make it to the pet hospital, then estimates of injury rate are biased. Injury rates for cats falling only two or three floors are likely to be over-estimated, whereas injury rates for cats falling many stories might be underestimated. Random sampling The common requirement of the methods presented in this book is that the data come from a random sample. A random sample is a sample from a population that fulfills two criteria. In a random sample, each member of a population has an equal and independent chance of being selected. First, every unit in the population must have an equal chance of being included in the sample. This is not as easy as it sounds. A botanist estimating plant growth might be more likely to find the taller individual plants or to collect those closer to the road. Some members of animal or human populations may be difficult to collect because they are shy of traps, never answer the phone, ignore questionnaires, or live at greater depths or distances than other members. These hard-to-sample individuals might differ in their characteristics from those of the rest of the population, so underrepresenting them in samples would lead to bias. Second, the selection of units must be independent. In other words, the selection of any one member of the population must in no way influence the selection of any other member.3 This, too, is not easy to ensure. Imagine, for example, that a sample of adults is chosen for a survey of consumer preferences. Because of the effort required to contact and visit each household to conduct an interview, the lazy researcher is tempted to record the preferences of multiple adults in each household and add their responses to those of other adults in the sample. This approach violates the criterion of independence, because the selection of one individual has increased the probability that another individual from the same household will also be selected. This will distort the sampling error in the data if individuals from the same household have preferences more similar to one another than would individuals randomly chosen from the population at large. With non-independent sampling, our sample size is effectively smaller than we think. This, in turn, will cause us to miscalculate the precision of the estimates. In general, the surest way to minimize bias and allow sampling error to be quantified is to obtain a random sample.4 Random sampling minimizes bias and makes it possible to measure the amount of sampling error. How to take a random sample Obtaining a random sample is easy in principle but can be challenging in practice. A random sample can be obtained by using the following procedure: 1. Create a list of every unit in the population of interest, and give each unit a number between one and the total population size. 2. Decide on the number of units to be sampled (call this number n). 3. Using a random-number generator,5 generate n random integers between one and the total number of units in the population. 4. Sample the units whose numbers match those produced by the random-number generator. An example of this process is shown in Figure 1.2-3. In both panels of the figure, we’ve drawn the locations of all 5699 trees present in 2001 in a carefully mapped tract of Harvard Forest in Massachussets, USA (Barker-Plotkin et al. 2006). Every tree in this population has a unique number between 1 and 5699 to identify it. We used a computerized random-number generator to pick n = 20 random integers between 1 and 5699, where 20 is the desired sample size. The 20 random integers, after sorting, are as follows: 156, 167, 232, 246, 826, 1106, 1476, 1968, 2084, 2222, 2223, 2284, 2790, 2898, 3103, 3739, 4315, 4978, 5258, 5500 FIGURE 1.2-3 The locations of all 5699 trees present in the Prospect Hill Tract of Harvard Forest in 2001 (green circles). The red dots in the left panel are a random sample of 20 trees. The squares in the right panel are a random sample of 20 quadrats (each 20 feet on a side). These 20 randomly chosen trees are identified by red dots in the left panel of Figure 1.2-3. How realistic are the requirements of a random sample? Creating a numbered list of every individual member of a population might be feasible for patients recorded in a hospital database, for children registered in an elementary-school system, or for some other populations for which a registry has been built. The feat is impractical for most plant populations, however, and unimaginable for most populations of animals or microbes. What can be done in such cases? One answer is that the basic unit of sampling doesn’t have to be a single individual—it can be a group, instead. For example, it is easier to use a map to divide a forest tract into many equal-sized blocks or plots and then to create a numbered list of these plots than it is to produce a numbered list of every tree in the forest. To illustrate this second approach, we divided the Harvard Forest tract into 836 plots of 400 square feet each. With the aid of a random-number generator, we then identified a random sample of 20 plots, which are identified by the squares in the right panel of Figure 1.2-3. The trees contained within a random sample of plots do not constitute a random sample of trees, for the same reason that all of the adults inhabiting a random sample of households do not constitute a random sample of adults. Trees in the same plot are not sampled independently; this can cause problems if trees growing next to one another in the same plot are more similar (or more different) in their traits than trees chosen randomly from the forest. The data in this case must be handled carefully. A simple technique is to take the average of the measurements of all of the individuals within a unit as the single independent observation for that unit. Random numbers should always be generated with the aid of a computer. Haphazard numbers made up by the researcher are not likely to be random (see Example 19.1). Most spreadsheet programs and statistical software packages on the computer include randomnumber generators. The sample of convenience One undesirable alternative to the random sample is the sample of convenience, a sample based on individuals that are easily available to the researcher. The researchers must assume (i.e., dream) that a sample of convenience is unbiased and independent like a random sample, but there is no way to guarantee it. A sample of convenience is a collection of individuals that are easily available to the researcher. The main problem with the sample of convenience is bias, as the following examples illustrate: ■ If the only cats measured are those brought to a veterinary clinic, then the injury rate of cats that have fallen from high-rise buildings is likely to be underestimated. Uninjured and fatally injured cats are less likely to make it to the vet and into the sample. ■ The spectacular collapse of the North Atlantic cod fishery in the last century was caused in part by overestimating cod densities in the sea, which led to excessive allowable catches by fishing boats (Walters and Maguire 1996). Density estimates were too high because they relied heavily on the rates at which the fishing boats were observed to capture cod. However, the fishing boats tended to concentrate in the few remaining areas where cod were still numerous, and they did not randomly sample the entire fishing area (Rose and Kulka 1999). A sample of convenience might also violate the assumption of independence if individuals in the sample are more similar to one another in their characteristics than individuals chosen randomly from the whole population. This is likely if, for example, the sample includes a disproportionate number of individuals who are friends or who are related to one another. Volunteer bias Human studies in particular must deal with the possibility of volunteer bias, which is a bias resulting from a systematic difference between the pool of volunteers (the volunteer sample) and the population to which they belong. The problem arises when the behavior of the subjects affects their chance of being sampled. In a large experiment to test the benefits of a polio vaccine, for example, participating schoolchildren were randomly chosen to receive either the vaccine or a saline solution (serving as the control). The vaccine proved effective, but the rate at which children in the saline group contracted polio was found to be higher than in the general population. Perhaps parents of children who had not been exposed to polio prior to the study, and therefore had no immunity, were more likely to volunteer their children for the study than parents of kids who had been exposed (Brownlee 1955, Bland 2000). Compared with the rest of the population, volunteers might be ■ more health conscious and more proactive; ■ low-income (if volunteers are paid); ■ more ill, particularly if the therapy involves risk, because individuals who are dying anyway might try anything; ■ more likely to have time on their hands (e.g., retirees and the unemployed are more likely to answer telephone surveys); ■ more angry, because people who are upset are sometimes more likely to speak up; or ■ less prudish, because people with liberal opinions about sex are more likely to speak to surveyors about sex. Such differences can cause substantial bias in the results of studies. Bias can be minimized, however, by careful handling of the volunteer sample, but the resulting sample is still inferior to a random sample. Data in the real world In this book we use real data, hard-won from observational or experimental studies in the lab and field and published in the literature. Do the samples on which the studies are based conform to the ideals outlined above? Alas, the answer is often no. Random samples are desired but often not achieved by researchers working in the trenches. Sometimes, the only data available are a sample of convenience or a volunteer sample, as the falling cats in Example 1.2 demonstrate. Scientists deal with this problem by taking every possible step to obtain random samples. If random sampling is impossible, then it is important to acknowledge that the problem exists and to point out where biases might arise in their studies.6 Ultimately, further studies should be carried out that attempt to control for any sampling problems evident in earlier work. Types of data and variables With a sample in hand, we can begin to measure variables. A variable is any characteristic or measurement that differs from individual to individual. Examples include running speed, reproductive rate, and genotype. Estimates (e.g., average running speed of a random sample of 10 lizards) are also variables, because they differ by chance from sample to sample. Data are the measurements of one or more variables made on a sample of individuals. Variables are characteristics that differ among individuals. Categorical and numerical variables Variables can be categorical or numerical. Categorical variables describe membership in a category or group. They describe qualitative characteristics of individuals that do not correspond to a degree of difference on a numerical scale. Categorical variables are also called attribute or qualitative variables. Examples of categorical variables include Categorical data are qualitative characteristics of individuals that do not have magnitude on a numerical scale. ■ ■ ■ ■ ■ ■ ■ survival (alive or dead), sex chromosome genotype (e.g., XX, XY, XO, XXY, or XYY), method of disease transmission (e.g., water, air, animal vector, or direct contact), predominant language spoken (e.g., English, Mandarin, Spanish, Indonesian, etc.), life stage (e.g., egg, larva, juvenile, subadult, or adult), snakebite severity score (e.g., minimal severity, moderate severity, or very severe), and size class (e.g., small, medium, or large). A categorical variable is nominal if the different categories have no inherent order. Nominal means “name.” Sex chromosome genotype, method of disease transmission, and predominant language spoken are nominal variables. In contrast, the values of an ordinal categorical variable can be ordered. Unlike numerical data, the magnitude of the difference between consecutive values is not known. Ordinal means “having an order.” Life stage, snakebite severity score, and size class are ordinal categorical variables. A variable is numerical when measurements of individuals are quantitative and have magnitude. These variables are numbers. Measurements that are counts, dimensions, angles, rates, and percentages are numerical. Examples of numerical variables include ■ ■ ■ ■ core body temperature (e.g., degrees Celsius, °C), territory size (e.g., hectares), cigarette consumption rate (e.g., average number per day), age at death (e.g., years), ■ number of mates, and ■ number of amino acids in a protein. Numerical data are quantitative measurements that have magnitude on a numerical scale. Numerical data are either continuous or discrete. Continuous numerical data can take on any real-number value within some range. Between any two values of a continuous variable, an infinite number of other values are possible. In practice, continuous data are rounded to a predetermined number of digits, set for convenience or by the limitations of the instrument used to take the measurements. Core body temperature, territory size, and cigarette consumption rate are continuous variables. In contrast, discrete numerical data come in indivisible units. Number of amino acids in a protein and numerical rating of a statistics professor in a student evaluation are discrete numerical measurements. Number of cigarettes consumed on a specific day is a discrete variable, but the rate of cigarette consumption is a continuous variable when calculated as an average number per day over a large number of days. In practice, discrete numerical variables are often analyzed as though they were continuous, if there is a large number of possible values. Just because a variable is indexed by a number does not mean it is a numerical variable. Numbers might also be used to name categories (e.g., family 1, family 2, etc.). Numerical data can be reduced to categorical data by grouping, though the result contains less information (e.g., “above average” and “below average”). Explanatory and response variables One major use of statistics is to relate one variable to another, by examining associations between variables and differences between groups. Measuring an association is equivalent to measuring a difference, statistically speaking. Showing that “the proportion of survivors differs between treatment categories” is the same as showing that the variables “survival” and “treatment” are associated. Often when association between two variables is investigated, a goal is to assess how well one of the variables, deemed the explanatory variable, predicts or affects the other variable, called the response variable. When conducting an experiment, the treatment variable (the one manipulated by the researcher) is the explanatory variable, and the measured effect of the treatment is the response variable. For example, the administered dose of a toxin in a toxicology experiment would be the explanatory variable, and organism survival would be the response variable. When neither variable is manipulated by the researcher, their association might nevertheless be described by the “effect” of one of the variables (the explanatory) on the other (the response), even though the association itself is not direct evidence for causation. For example, when exploring the possibility that high blood pressure affects the risk of stroke in a sample of people, blood pressure is the explanatory variable and incidence of strokes is the response variable. When natural groups of organisms, such as populations or species, are compared in some measurement, such as body mass, the group variable (population or species) is typically the explanatory variable and the measurement is the response variable. In more complicated studies involving more than two variables, there may be more than one explanatory or response variable. Sometimes you will hear variables referred to as “independent” and “dependent.” These are the same as explanatory and response variables, respectively. Strictly speaking, if one variable depends on the other, then neither is independent, so we prefer to use explanatory and response throughout this book. Frequency distributions and probability distributions Different individuals in a sample will have different measurements. We can see this variability with a frequency distribution. The frequency of a specific measurement in a sample is the number of observations having a particular value of the measurement.7 The frequency distribution shows how often each value of the variable occurs in the sample. The frequency distribution describes the number of times each value of a variable occurs in a sample. Figure 1.4-1 shows the frequency distribution for the measured beak depths of a sample of 100 finches from a Galápagos island population.8 FIGURE 1.4-1 The frequency distribution of beak depths in a sample of 100 finches from a Galápagos island population (Boag and Grant 1984). The vertical axis indicates the frequency, the number of observations in each 0.5-mm interval of beak depth. The large-beaked ground finch on the Galápagos Islands. We use the frequency distribution of a sample to inform us about the distribution of the variable (beak depth) in the population from which it came. Looking at a frequency distribution gives us an intuitive understanding of the variable. For example, we can see which values of beak depth are common and which are rare, we can get an idea of the average beak depth, and we can start to understand how variable beak depth is among the finches living on the island. The distribution of a variable in the whole population is called its probability distribution. The real probability distribution of a population in nature is almost never known. Researchers typically use theoretical probability distributions to approximate the real probability distribution. For a continuous variable like beak depth, the distribution in the population is often approximated by a theoretical probability distribution known as the normal distribution. The normal distribution drawn in Figure 1.4-2, for example, approximates the probability distribution of beak depths in the finch population from which the sample of 100 birds was drawn. FIGURE 1.4-2 A normal distribution. This probability distribution is often used to approximate the distribution of a variable in the population from which a sample has been drawn. The normal distribution is the familiar “bell curve.” It is the most important probability distribution in all of statistics. You’ll learn a lot more about it in Chapter 10. Most of the methods presented in this book for analyzing data depend on the normal distribution in some way. Types of studies Data in biology are obtained from either an experimental study or an observational study. In an experimental study, the researcher assigns different treatment groups or values of an explanatory variable randomly to the individual units of study. A classic example is the clinical trial, where different treatments are assigned randomly to patients in order to compare responses. In an observational study, on the other hand, the researcher has no control over which units fall into which groups. A study is experimental if the researcher assigns treatments randomly to individuals, whereas a study is observational if the assignment of treatments is not made by the researcher. Studies of the health consequences of cigarette smoking in humans are all observational studies, because it is ethically impossible to assign smoking and nonsmoking treatments to human beings to assess the effects of smoking. The individuals in the sample have made the decision themselves about whether to smoke. The only experimental studies of the health consequences of smoking have been carried out on nonhuman subjects, such as mice, where researchers can assign smoking and nonsmoking treatments randomly to individuals. The distinction between experimental studies and observational studies is that experimental studies can determine cause-and-effect relationships between variables, whereas observational studies can only point to associations. An association between smoking and lung cancer might be due to the effects of smoking per se, or perhaps to an underlying predisposition to lung cancer in those individuals prone to smoking. It is difficult to distinguish these alternatives with observational studies alone. For this reason, experimental studies of the health hazards of smoking in nonhuman animals have helped make the case that cigarette smoking is dangerous to human health. But experimental studies are not always possible, even on animals. Smoking in humans, for example, is also associated with a higher suicide rate (Hemmingsson and Kriebel 2003). Is this association caused by the effects of smoking, or is it caused by the effects of some other variable? Just because a study was carried out in the laboratory does not mean that the study is an experimental study in the sense described here. A complex laboratory apparatus and careful conditions may be necessary to obtain measurements of interest, but such a study is still observational if the assignment of treatments is out of the control of the researcher. Summary ■ Statistics is the study of methods for measuring aspects of populations from samples and for quantifying the uncertainty of the measurements. ■ Much of statistics is about estimation, which infers an unknown quantity of a population using sample data. ■ Statistics also allows hypothesis testing, a method to determine how well hypotheses about a population parameter fit the sample data. ■ Sampling error is the chance difference between an estimate describing a sample and the corresponding parameter of the whole population. Bias is a systematic discrepancy between an estimate and the population quantity. ■ The goals of sampling are to increase the accuracy and precision of estimates and to ensure that it is possible to quantify precision. ■ In a random sample, every individual in a population has the same chance of being selected, and the selection of individuals is independent. ■ A sample of convenience is a collection of individuals easily available to a researcher, but it is not usually a random sample. ■ Volunteer bias is a systematic discrepancy in a quantity between the pool of volunteers and the population. ■ Variables are measurements that differ among individuals. ■ Variables are either categorical or numerical. A categorical variable describes which category an individual belongs to, whereas a numerical variable is expressed as a number. ■ The frequency distribution describes the number of times each value of a variable occurs in a sample. A probability distribution describes the number of times each value occurs in a population. Probability distributions in populations can often be approximated by a normal distribution. ■ In studies of association between two variables, one variable is typically used to predict the value of another variable and is designated as the explanatory variable. The other variable is designated as the response variable. ■ In experimental studies, the researcher is able to assign subjects randomly to different treatments or groups. In observational studies, the assignment of individuals to treatments is not controlled by the researcher. PRACTICE PROBLEMS Answers to the practice problems are provided at the end of the book, starting on page 747. 1. Which of the following numerical variables are continuous? Which are discrete? a. Number of injuries sustained in a fall b. Fraction of birds in a large sample infected with avian flu virus c. Number of crimes committed by a randomly sampled individual d. Logarithm of body mass 2. The peppered moth (Biston betularia) occurs in two types: peppered (speckled black and white) and melanic (black). A researcher wished to measure the proportion of melanic individuals in the peppered moth population in England, to examine how this proportion changed from year to year in the past. To accomplish this, she photographed all the peppered moth specimens available in museums and large private collections and grouped them by the year in which they had been collected. Based on this sample, she calculated the proportion of melanic individuals in every year. The people who collected the specimens, she knew, would prefer to collect whichever type was rarest in any given year, since those would be the most valuable. a. Can the specimens from any given year be considered a random sample from the moth population? b. If not a random sample, what type of sample is it? c. What type of error might be introduced by the sampling method when estimating the proportion of melanic moths? 3. What feature of an estimate—precision or accuracy—is most strongly affected when individuals differing in the variable of interest do not have an equal chance of being selected? 4. In a study of stress levels in U.S. army recruits stationed in Iraq, researchers obtained a complete list of the names of recruits in Iraq at the time of the study. They listed the recruits alphabetically and then numbered them consecutively. One hundred random numbers between one and the total number of recruits were then generated using a random-number generator on a computer. The 100 recruits whose numbers corresponded to those generated by the computer were interviewed for the study. a. What is the population of interest in this study? b. The 100 recruits interviewed were randomly sampled as described. Is the sample affected by sampling error? Explain. c. What are the main advantages of random sampling in this example? d. What effect would a larger sample size have had on sampling error? 5. An important quantity in conservation biology is the number of plant and animal species inhabiting a given area. To survey the community of small mammals inhabiting Kruger National Park in South Africa, a large series of live traps were placed randomly throughout the park for one week in the main dry season of 2004. Traps were set each evening and checked the following morning. Individuals caught were identified, tagged (so that new captures could be distinguished from recaptures), and released. At the end of the survey, the total number of small mammal species in the park was estimated by the total number of species captured in the survey. a. What is the parameter being estimated in the survey? b. Is the sample of individuals captured in the traps likely to be a random sample? Why or why not? In your answer, address the two criteria that define a sample as random. c. Is the number of species in the sample likely to be an unbiased estimate of the total number of small mammal species in the park? Why or why not? 6. In a recent study, researchers took electrophysiological measurements from the brains of two rhesus macaques (monkeys). Forty neurons were tested in each monkey, yielding a total of 80 measurements. a. Do the 80 neurons constitute a random sample? Why or why not? b. If the 80 measurements were analyzed as though they constituted a random sample, what consequences would this have for the estimate of the measurement in the monkey population? 7. Identify which of the following variables are discrete and which are continuous: a. Number of warts on a toad b. Survival time after poisoning c. Temperature of porridge d. Number of bread crumbs in 10 meters of trail e. Length of wolves’ canines 8. A study was carried out in women to determine whether the psychological consequences of having an abortion differ from those experienced by women who have lost their fetuses by other causes at the same stage of pregnancy. a. Which is the explanatory variable in this study, and which is the response variable? b. Was this an observational study or an experimental study? Explain. 9. For each of the following studies, say which is the explanatory variable and which is the response variable. Also, say whether the study is observational or experimental. a. Forestry researchers wanted to compare the growth rates of trees growing at high altitude to that of trees growing at low altitude. They measured growth rates using the space between tree rings in a set of trees harvested from a natural forest. b. Researchers randomly assign diabetes patients to two groups. In the first group, the patients receive a new drug tasploglutide, whereas patients in the other group receive standard treatment without the new drug. The researchers compared the rate of insulin release in the two groups. c. Psychologists tested whether the frequency of illegal drug use differs between people suffering from schizophrenia and those not having the disease. They measured drug use in a group of schizophrenia patients and compared it with that in a similar sized group of randomly chosen people. d. Spinner Hansen et al. (2008) studied a species of spider whose females often eat males that are trying to mate with them. The researchers removed a leg from each male spider in one group (to make them weaker and more vulnerable to being eaten) and left the males in another group undamaged. They studied whether survival of males in the two groups differed during courtship. e. Bowen et al. (2012) studied the effects of advanced communication therapy for patients whose communication skills had been affected by previous strokes. They randomly assigned two therapies to stroke patients. One group received advanced communication therapy and the other received only social visits without formal therapy. Both groups otherwise received normal, best-practice care. After six months, the communication ability (as measured by a standardized quantitative test score) was measured on all patients. 10. Each of the examples (a–e) in problem 9 involves estimating or testing an association between two variables. For each of the examples, list the two variables and state whether each is categorical or numerical. 11. A random sample of 500 households was identified in a major North American city using the municipal voter registration list. Five hundred questionnaires went out, directed at one adult in each household, and surveyed respondents about attitudes regarding the municipal recycling program. Eighty of the 500 surveys were filled out and returned to the researchers. a. Can the 80 households that returned questionnaires be regarded as a random sample of households? Explain. b. What type of bias might affect the survey outcome? 12. State whether the following represent cases of estimation or hypothesis testing. a. A random sample of quadrats in Olympic National Forest is taken to determine the average density of Ensatina salamanders. b. A study is carried out to determine whether there is extrasensory perception (ESP), by counting the number of cards guessed correctly by a subject isolated from a second person who is drawing cards randomly from a normal card deck. The number of correct guesses is compared with the number we would expect by chance if there were no such thing as ESP. c. A trapping study measures the rate of fruit fall in forest clear-cuts. d. An experiment is conducted to calculate the optimal number of calories per day to feed captive sugar gliders (Petaurus breviceps) to maintain normal body mass and good health. e. A clinical trial is carried out to determine whether taking large doses of vitamin C benefits health of advanced cancer patients. f. An observational study is carried out to determine whether hospital emergency room admissions increase during nights with a full moon compared with other nights. 13. A researcher dissected the retinas of 20 randomly sampled fish belonging to each of two subspecies of guppy in Venezuela. Using a sophisticated laboratory apparatus, he measured the two groups of fish to find the wavelengths of visible light to which the cones of the retina were most sensitive. The goal was to explore whether fish from the two subspecies differed in the wavelength of light that they were most sensitive to. a. What are the two variables being associated in this study? b. Which is the explanatory variable and which is the response variable? c. Is this an experimental study or an observational study? Why? ASSIGNMENT PROBLEMS 14. Identify whether the following variables are numerical or categorical. If numerical, state whether the variable is discrete or continuous. If categorical, state whether the categories have a natural order (ordinal) or not (nominal). a. Number of sexual partners in a year b. Petal area of rose flowers c. Heartbeats per minute of a Tour de France cyclist, averaged over the duration of the race d. Birth weight e. Stage of fruit ripeness (e.g., underripe, ripe, or overripe) f. Angle of flower orientation relative to position of the sun g. Tree species h. Year of birth i. Gender 15. In the vermilion flycatcher, the males are brightly colored and sing frequently and prominently. Females are more dull-colored and make less sound. In a field study of this bird, a researcher attempted to estimate the fraction of individuals of each sex in the population. She based her estimate on the number of individuals of each sex detected while walking through suitable habitat. Is her sample of birds detected likely to be a random sample? Why or why not? 16. Not all telephone polls carried out to estimate voter or consumer preferences make calls to cell phones. One reason is that in the USA, automated calls (“robocalls”) to cell phones are not permitted and interviews conducted by humans are more costly. a. How might the strategy of leaving out cell phones affect the goal of obtaining a random sample of voters or consumers? b. Which criterion of random sampling is most likely to be violated by the problems you identified in part (a): equal chance of being selected, or the independence of the selection of individuals? c. Which attribute of estimated consumer preference is most affected by the problem you identified in (a): accuracy or precision? 17. The average age of piñon pine trees in the coast ranges of California was investigated by placing 500 ten-hectare plots randomly on a distribution map of the species in California using a computer. Researchers then found the location of each random plot in the field, and then they measured the age of every piñon pine tree within each of the 10-hectare plots. The average age within the plot was used as the unit measurement. These unit measurements were then used to estimate the average age of California piñon pines. a. What is the population of interest in this study? b. Why did the researchers take an average of the ages of trees within each plot as their unit measurement, rather than combine into a single sample the ages of all the trees from all the plots? 18. Refer to problem 17. a. Is the estimate of age based on 500 plots influenced by sampling error? Why? b. How would the sampling error of the estimate of mean age change if the investigators had used a sample of only 100 plots? 19. In each of the following examples, indicate which variable is the explanatory variable and which is the response variable. a. The anticoagulant warfarin is often used as a pesticide against house mice, Mus musculus. Some populations of the house mouse have acquired a mutation in the vkorc1 gene from hybridizing with the Algerian mouse, M. spretus (Song et al. 2011). In the Algerian mice, this gene confers resistance to warfarin. In a hypothetical follow-up study, researchers collected a sample of house mice to determine whether presence of the vkorc1 mutation is associated with warfarin resistance in house mice as well. They fed warfarin to all the mice in a sample and compared survival between the individuals possessing the mutation and those not possessing the mutation. b. Cooley et al. (2009) randomly assigned either of two treatments, naturopathic care (diet counseling, breathing techniques, vitamins, and a herbal medicine) or standardized psychotherapy (psychotherapy with breathing techniques and a placebo added), to 81 individuals having moderate to severe anxiety. Anxiety scores decreased an average of 57% in the naturopathic group and 31% in the psychotherapy group. c. Individuals highly sensitive to rewards tend to experience more food cravings and are more likely to be overweight or develop eating disorders than other people. Beaver et al. (2006) recruited 14 healthy volunteers and scored their reward sensitivity using a questionnaire (they were asked to answer yes or no to questions like: “I’m always willing to try something new if I think it will be fun”). The subjects were then presented with images of appetizing foods (e.g., chocolate cake, pizza) while activity of their fronto– striatal– amygdala–midbrain was measured using functional MRI. Reward sensitivity of subjects was found to correlate with brain activity in response to the images. d. Endostatin is a naturally occurring protein in humans and mice that inhibits the growth of blood vessels. O’Reilly et al. (1997) investigated its effects on growth of cancer tumors, whose growth and spread requires blood vessel proliferation. Mice having lung carcinoma tumors were randomly divided into groups that were treated with doses of 0, 2.5, 10, or 20 mg/kg of endostatin injected once daily. They found that higher doses of endostatin led to inhibition of tumor growth. 20. For each of the studies presented in problem 19, indicate whether the study is an experimental or observational study. 21. In a study of heart rate in ocean-diving birds, researchers harnessed 10 randomly sampled, wild-caught cormorants to a laboratory contraption that monitored vital signs. Each cormorant was subjected to six artificial “dives” over the following week (one dive per day). A dive consisted of rapidly immersing the head of the bird in water by tipping the harness. In this way, a sample of 60 measurements of heart rate in diving birds was obtained. Do these 60 measurements represent a random sample? Why or why not? 22. Researchers sent out a survey to U.S. war veterans that asked a series of questions, including whether individuals surveyed were smokers or nonsmokers (Seltzer et al. 1974). They found that nonsmokers were 27% more likely than smokers to respond to a survey within 30 days (based on the much larger number of smokers and nonsmokers who eventually responded). Hypothetically, if the study had ended after 30 days, what effect would this have on the estimate of the proportion of veterans who smoke? (Use terminology you learned in this chapter to describe the effect.) 23. During World War II, the British Royal Air Force estimated the density of bullet holes on different sections of planes returning to base from aerial sorties. Their goal was to use this information to determine which plane sections most needed additional protective shields. (It was not possible to reinforce the whole plane, because it would weigh too much.) They found that the density of holes was highest on the wings and lowest on the engines and near the cockpit, where the pilot sits (their initial conclusion, that therefore the wings should be reinforced, was later shown to be mistaken). What is the main problem with the sample: bias or large sampling error? What part of the plane should have been reinforced? 24. In a study of diet preferences in leafcutter ants, a researcher presented 20 randomly chosen ant colonies with leaves from the two most common tree species in the surrounding forest. The leaves were placed in piles of 100, one pile for each tree species, close to colony entrances. Leaves were cut so that each was small enough to be carried by a single ant. After 24 hours, the researcher returned and counted the number of leaves remaining of the original 100 of each species. Some of the results are shown in the following table. Tree species Number of leaves removed Spondius mombin Sapium thelocarpum 1561 851 Total 2412 Using these results, the researcher estimated the proportion of Spondius leaves taken as 0.65 and concluded that the ants have a preference for leaves of this species. a. Identify the two variables whose association is displayed in the table. Which is the explanatory variable and which is the response variable? Are they numeric or categorical? b. Why do the 2412 leaves used in the calculation of the proportion not represent a random sample? c. Would treating the 2412 leaves as a random sample most likely affect the accuracy of the estimate of diet preference or the precision of the estimate? d. If not the leaves, what units were randomly sampled in the study? 25. Garaci et al. (2012) examined a sample of people with and without multiple sclerosis (MS) to test the controversial idea that the disease is caused by blood flow restriction resulting from a vein condition known as chronic cerebrospinal venous insufficiency (CCSVI). Of 39 randomly sampled patients with MS, 25 were found to have CCSVI and 14 were not. Of 26 healthy control subjects, 14 were found to have CCSVI and 12 were not. The researchers found an association between CCSVI and MS. a. What is the explanatory variable and what is the response variable? b. Is this an experimental study or an observational study? c. Where might hypothesis testing have been used in the study? 1 INTERLEAF Biology and the history of statistics Sir Ronald Fisher The formal study of probability began in the mid-17th century when Blaise Pascal and Pierre de Fermat started to describe the mathematical rules for determining the best gambling strategies. Gambling and insurance continued to motivate the development of probability for the next couple of centuries. The application of probability to data did not happen until much later. The importance of variation in the natural world, and by extension to samples from that world, was made obvious by Charles Darwin’s theory of natural selection. Darwin highlighted the importance of variation in biological systems as the raw material of evolution. Early followers of Darwin saw the need for quantitative descriptions of variation and the need to incorporate the effects of sampling error in biology. This led to the development of modern statistics. In many ways, therefore, modern statistics was an offshoot of evolutionary biology. One of the first pioneers in statistical data analysis was Francis Galton, who began to apply probability thinking to the study of all sorts of data. Galton was a real polymath, thinking about nearly everything and collecting and analyzing data at every chance. He said, “Whenever you can, count.”1 He invented the study of fingerprints, he tested whether prayer increased the life span of preachers compared with others of the middle class, and he quantified the heritable nature of many important traits. He once recorded his idea of the attractiveness of women seen from the window of a train headed from London to Glasgow, finding that “attractiveness” (at least according to Galton) declined as a function of distance from London. Galton is best known, though, for his twin interests in data analysis and evolution (he was the first cousin of Darwin). He invented the idea of regression, which we will learn more about in Chapter 17. Galton was also responsible for establishing a lab that brought more researchers into the study of both statistics and biology, including Karl Pearson and Ronald Fisher. Karl Pearson, like Galton, was interested in many spheres of knowledge. Pearson was motivated by biological data—in particular, by heredity and evolution. Pearson was responsible for our most often-used measure of the correlation between numerical variables. In fact, the correlation coefficient that we will learn about in Chapter 16 is often referred to as Pearson’s correlation coefficient. He also made many contributions to the study of regression analysis and invented the χ2 contingency test (Chapter 9). He also invented the term standard deviation (Chapter 3). Last, but definitely not least, Ronald Fisher was one of the great geniuses of the 20th century. Fisher is well known in evolutionary biology as one of the three founders of the field of theoretical population genetics. Among his many contributions are the demonstration that Mendelian inheritance is compatible with the continuous variation we see in many traits, the accepted theory for why most animals conceive equal numbers of male and female offspring, and a great deal of the mathematical machinery we use to describe the process of evolution. But his contributions did not end there. He is probably also the most important figure in the history of statistics. He developed the analysis of variance (Chapter 15), likelihood (Chapter 20), the Pvalue (Chapter 6), randomized experiments (Chapter 14), multiple regression, and many other tools used in data analysis. His mathematical knowledge was made practical by a lifelong association with biologists, particularly agricultural scientists at the Rothamsted Experimental Station in England. Fisher solved problems associated with the analysis of real data as he encountered them, and he generalized them for application to many other related questions. Moreover, Fisher developed experimental designs that would give more information than might otherwise be possible. What this short hagiography is intended to demonstrate is that the early history of statistics is tightly bound up with biology. Biological questions motivated the development of most of the basic statistical tools, and biologists helped to develop those tools. Biology and statistics naturally go hand in hand. 2 Displaying data The human eye is a natural pattern detector, adept at spotting trends and exceptions in visual displays. For this reason, biologists spend hours creating and examining visual summaries of their data—graphs and, to a lesser extent, tables. Effective graphs enable visual comparisons of measurements between groups, and they expose relationships between different variables. They are also the principal means of communicating results to a wider audience. Florence Nightingale (1858) was one of the first persons to put graphs to good use. In her famous wedge diagrams, redrawn in the figure above, she visualized the causes of death of British troops during the Crimean War. The number of cases is indicated by the area of a wedge, and the cause of death by color. The diagrams showed convincingly that disease was the main cause of soldier deaths during the wars, not wounds or other causes. With these vivid graphs, she successfully campaigned for military and public health measures that saved many lives. Effective graphs are a prerequisite for good data analysis, revealing general patterns in the data that bare numbers cannot. Therefore, the first step in any data analysis or statistical procedure is to graph the data and look at it. Humans are a visual species, with brains evolved to process visual information. Take advantage of millions of years of evolution, and look at visual representations of your data before doing anything else. We’ll follow this prescription throughout the book. In this chapter, we explain how to produce effective graphical displays of data and how to avoid common pitfalls. We’ll then review which types of graphs best show the data. The choice will depend on the type of data, numerical or categorical, and whether the goal is to show measurements of one variable or the association between two variables. There is often more than one way to show the same pattern in data, and we will compare and evaluate successful and unsuccessful approaches. We will also mention a few tips for constructing tables, which should also be laid out to show patterns in data. Guidelines for effective graphs Graphs are vital tools for analyzing data. They are also used to communicate patterns in data to a wider audience in the form of reports, slide shows, and web content. The two purposes, analysis and presentation, are largely coincident because the most revealing displays will be the best both for identifying patterns in the data and for communicating these patterns to others. Both purposes require displays that are clear, honest, and efficient. To motivate principles of effective graphs, let’s highlight some common ways in which researchers might get it wrong. How to draw a bad graph Figure 2.1-1 shows the results of an experiment in which maize plants were grown in pots under three nitrogen regimes and two soil water contents. Height of bars represents the average maize yield (dry weight per plant) at the end of the experiment under the six combinations of water and nitrogen. The data are real (Quaye et al. 2009), but we made the bad graph intentionally to highlight four common defects. Examine the graph before reading further and try to recognize some of them. Many graphics packages on the computer make it easy to produce flawed graphs like this one, which is probably why we still encounter them so often. FIGURE 2.1-1 An example of a defective graph showing mean plant height of maize grown in pots under different nitrogen and water treatments. Mistake #1: Where are the data? Each bar in Figure 2.1-1 represents average yield of four plant pots assigned to that nitrogen and water treatment. The data points—yields of all the experimental units (pots)—are nowhere to be seen. This means we are unable to see the variation in yield between pots and compare it with the magnitude of differences between treatments. It means that any unusual observations that might distort the calculation of average yield remain hidden. It would be a challenge to add the data points to this particular graph because the bars are in the way. We’ll say more later about when bars are appropriate and when they are not. Mistake #2: Patterns in the data are difficult to see. The three dimensions and angled perspective make it difficult to judge bar height by eye, which means that average plant growth is difficult to compare between treatments. In his classic book on information graphics, Tufte (1983) referred to 3-D and other visual embellishments as “chartjunk.” Chartjunk adds clutter that dilutes information and interferes with the ability of the eye and brain to “see” patterns in data. Mistake #3: Magnitudes are distorted. The vertical axis on the graph, plant yield, ranges from 2 to 9 g/plant, rather than 0 to 9, which means that bar height is out of proportion to actual magnitudes. Mistake #4: Graphical elements are unclear. Text and other figure elements are too small to read easily. How to draw a good graph A few straightforward principles will help to make sure that your graphs do not end up with the kind of problems illustrated in Figure 2.1-1. We follow these four rules ourselves in the remainder of the book. ■ ■ ■ ■ Show the data. Make patterns in the data easy to see. Represent magnitudes honestly. Draw graphical elements clearly. Show the data, first and foremost (Tufte 1983). A good graph allows you to visualize the measurements and helps the eye detect patterns in the data. Showing the data makes it possible to evaluate the shape of the distribution of data points and to compare measurements between groups. It helps you to spot potential problems, such as extreme observations, which will be useful as you decide the next step of your data analysis. Figure 2.1-2 gives an example of what it means to show data. The study examined the role of the neurotransmitter serotonin1 in bringing about a transition in social behavior, from solitary to gregarious, in a desert locust (Anstey et al. 2009). This behavior change is a critical point in the production of huge locust swarms that blacken skies and ravage crops in many parts of the world. Each data point is the serotonin level of one of 30 locusts experimentally caged at high density for 0, 1, or 2 hours, with 0 representing the control. The panel on the left of Figure 2.1-2 shows the data (this type of graph is called a strip chart or dot plot). The panel on the right of Figure 2.1-2 hides the data, using bars to show only treatment averages. FIGURE 2.1-2 A graph that shows the data (left) and a graph that hides the data (right). Data points are serotonin levels in the central nervous system of desert locusts, Schistocerca gregaria, that were experimentally crowded for 0 (the control group), 1, and 2 hours. The data points in the left panel were perturbed a small amount to the left or right to minimize overlap and make each point easier to see. The horizontal bars in the left panel indicate the mean (average) serotonin level in each group. The graph on the right shows only the mean serotonin level in each treatment (indicated by bar height). Note that the vertical axis does not have the same scale in the two graphs. In the left panel of Figure 2.1-2, we can see lots of scatter in the data in each treatment group and plenty of overlap between groups. We see that most points fall below the treatment average, and that each group has a few extreme observations. Nevertheless, we can see a clear shift in serotonin levels of locusts between treatments. All this information is missing from the right panel of Figure 2.1-2, which uses more ink yet shows only the averages of each treatment group. Make patterns easy to see. Try displaying your data in different ways, possibly with different types of graphs, to find the best way to communicate the findings. Is the main pattern in the data recognizable right away? If not, try again with a different method. Stay away from 3-D effects and elaborate chartjunk that obscures the patterns in the data. In the rest of the chapter we’ll compare alternative ways of graphing the same data sets and discuss their effectiveness. Avoid putting too much information into one graph. Remember the purpose of a graph: to communicate essential patterns to eyes and brains. The purpose is not to cram as much data as possible into each graph. Think about getting the main point across with one or two key graphs in the main body of your presentation. Put the remainder into an appendix or online supplement if it is important to show them to a subset of your audience. Represent magnitudes honestly. This sounds easy, but misleading graphics are common in the scientific literature. One of the most important decisions concerns the smallest value on the vertical axis of a graph (the “baseline”). A bar graph must always have a baseline at zero, because the eye instinctively reads bar height and area as proportional to magnitude. The upper bar graph in Figure 2.1-3 shows an example, depicting government spending on education each year since 1998 in British Columbia. The area of each bar is not proportional to the magnitude of the value displayed. As a result, the graph exaggerates the differences. The figure falsely suggests that spending increased twenty-fold over time, but the real increase is less than 20%. It is more honest to plot the bars with a baseline of zero, as in the lower graph in Figure 2.1-3 (the revised graph also removed the 3-D effects and the numbers above bars to make the pattern easier to see). FIGURE 2.1-3 Upper graph: A bar graph, taken from data from a British Columbia government brochure, indicating spending per student in different years. Lower graph: A revised presentation of the same data, in which the magnitude of the spending is proportional to the height and area of bars. This revision also removed the 3-D effects and the numbers above bars to make the pattern easier to see. The upper graph is modified from data from British Columbia Ministry of Education (2004). Other types of graphs, such as strip charts, don’t always need a zero baseline if the main goal is to show differences between treatments rather than proportional magnitudes. Draw graphical elements clearly. Clearly label the axes and choose unadorned, simple typefaces and colors. Text should be legible even after the graph is shrunk to fit the final document. Always provide the units of measurement in the axis label. Use clearly distinguishable graphical symbols if you plot with more than one kind. Don’t always accept the default output of statistical or spreadsheet programs. Up to a tenth of your male audience is red-green color-blind, so choose colors that differ in intensity and apply redundant coding to distinguish groups (for example, use distinctive shapes or patterns as well as different colors).2 A good graph is like a good paragraph. It conveys information clearly, concisely, and without distortion. A good graph requires careful editing. Just as in writing, the first draft is rarely as good as the final product. Showing data for one variable To examine data for single variable, we show its frequency distribution. Recall from Chapter 1 that the frequency of occurrence of a specific measurement in a sample is the number of observations having that particular measurement. The frequency distribution of a variable is the number of occurrences of all values in the data. Relative frequency is the proportion of observations having a given measurement, calculated as the frequency divided by the total number of observations. The relative frequency distribution is the proportion of occurrences of each value in the data set. The relative frequency distribution describes the fraction of occurrences of each value of a variable. Showing categorical data: frequency table and bar graph Let’s start with displays for a categorical variable. A frequency table is a text display of the number of occurrences of each category in the data set. A bar graph uses the height of rectangular bars to visualize the frequency (or relative frequency) of occurrence of each category. A bar graph uses the height of rectangular bars to display the frequency distribution (or relative frequency distribution) of a categorical variable. Example 2.2A illustrates both kinds of displays. EXAMPLE 2.2A Crouching tiger Conflict between humans and tigers threatens tiger populations, kills people, and reduces public support for conservation. Gurung et al. (2008) investigated causes of human deaths by tigers near the protected area of Chitwan National Park, Nepal. Eighty-eight people were killed by 36 individual tigers between 1979 and 2006, mainly within 1 km of the park edge. Table 2.2-1 lists the main activities of people at the time they were killed. Such information may be helpful to identify activities that increase vulnerability to attack. TABLE 2.2-1 Frequency table showing the activities of 88 people at the time they were attacked and killed by tigers near Chitwan National Park, Nepal, from 1979 to 2006. Activity Frequency (number of people) Collecting grass or fodder for livestock Collecting non-timber forest products Fishing Herding livestock Disturbing tiger at its kill 44 11 8 7 5 Collecting fuel wood or timber Sleeping in a house Walking in forest Using an outside toilet Total 5 5 3 2 88 Table 2.1-1 is a frequency table showing the number of deaths associated with each activity. Here, alternative values of the variable “activity” are listed in a single column, and frequencies of occurrence are listed next to them in a second column. The categories have no intrinsic order, but comparing the frequencies of each activity is made easier by arranging the categories in order of their importance, from the most frequent at the top to the least frequent at the bottom. The table shows that more people were killed while collecting grass and fodder for their livestock than when doing any other activity. The number of deaths under this activity was four times that of the next category of activity (collecting non-timber forest products) and is related to the amount of time people spend carrying out these activities. The differences in frequency stand out even more vividly in the bar graph shown in Figure 2.2-1. In a bar graph, frequency is depicted by the height of rectangular bars. Unlike a frequency table, a bar graph does not usually present the actual numbers. Instead, the graph gives a clear picture of how steeply the numbers drop between categories. Some activities are much more common than others, and we don’t need the actual numbers to see this. FIGURE 2.2-1 Bar graph showing the activities of people at the time they were attacked and killed by tigers near Chitwan National Park, Nepal, between 1979 and 2006. Total number of deaths: n = 88. The frequencies are taken from Table 2.2-1, which also gives more detailed labels of activities. Making a good bar graph The top edge of each bar conveys all the information about frequency, but the eye also compares the areas of the bars, which must therefore be of equal width. It is crucial that the baseline of the y-axis is at zero—otherwise, the area and height of bars are out of proportion with actual magnitudes and so are misleading. When the categorical variable is nominal, as in Figure 2.2-1 and Table 2.2-1, the best way to arrange categories is by frequency of occurrence. The most frequent category goes first, the next most frequent category goes second, and so on. This aids in the visual presentation of the information. For an ordinal categorical variable, such as snakebite severity score, the values should be in the natural order (e.g., minimally severe, moderately severe, and very severe). Bars should stand apart, not be fused together. It is a good habit to provide the total number of observations (n) in the figure legend. A bar graph is usually better than a pie chart The pie chart is another type of graph often used to display frequencies of a categorical variable. This method uses colored wedges around the circumference of a circle to represent frequency or relative frequency. Figure 2.2-2 shows the tiger data again, this time in a pie chart. This graphical method is reminiscent of Florence Nightingale’s wedge diagram shown at the beginning of this chapter. FIGURE 2.2-2 Pie chart of the activities of people at the time they were attacked and killed by tigers near Chitwan National Park, Nepal. The frequencies are taken from Table 2.2-1. Total number of deaths: n = 88. The pie chart has received a lot of criticism from experts in information graphics. One reason is that while it is straightforward to visualize the frequency of deaths in the first and most frequent category (Collecting grass/fodder), it is more difficult to compare frequencies in the remaining categories by eye. This problem worsens as the number of categories increases. Another reason is that it is very difficult to compare frequencies between two or more pie charts side by side, especially when there are many categories. To compensate, pie charts are often drawn with the frequencies added as text around the circle perimeter. The result is not better than a table. The shape of a frequency distribution is more readily perceived in a bar graph than a pie chart, and it is easier to compare frequencies between two or more bar graphs than between pie charts. Use the bar graph instead of the pie chart for showing frequencies in categorical data. Showing numerical data: frequency table and histogram A frequency distribution for a numerical variable can be displayed either in a frequency table or in a histogram. A histogram uses area of rectangular bars to display frequency. The data values are split into consecutive intervals, or “bins,” usually of equal width, and the frequency of observations falling into each bin is displayed. A histogram uses the area of rectangular bars to display the frequency distribution (or relative frequency distribution) of a numerical variable. We discuss how histograms are made in greater detail using the data in Example 2.2B. EXAMPLE 2.2B Abundance of desert bird species How many species are common in nature and how many are rare? One way to address this question is to construct a frequency distribution of species abundance. The data in Table 2.2-2 are from a survey of the breeding birds of Organ Pipe Cactus National Monument in southern Arizona, USA. The measurements were extracted from the North American Breeding Bird Survey, a continent-wide data set of estimated bird numbers (Sauer et al. 2003). TABLE 2.2-2 Data on the abundance of each species of bird encountered during four surveys in Organ Pipe Cactus National Monument. Species Abundance Greater roadrunner Black-chinned hummingbird Western kingbird Great-tailed grackle Bronzed cowbird Great horned owl Costa’s hummingbird Canyon wren Canyon towhee Harris’s hawk Loggerhead shrike Hooded oriole Northern mockingbird American kestrel Rock dove Bell’s vireo Common raven Northern cardinal House sparrow Ladder-backed woodpecker Red-tailed hawk Phainopepla Turkey vulture Violet-green swallow Lesser nighthawk Scott’s oriole Purple martin Black-throated sparrow Brown-headed cowbird 1 1 1 1 1 2 2 2 2 3 3 4 5 7 7 10 12 13 14 15 16 18 23 23 25 28 33 33 59 Black vulture 64 Lucy’s warbler Gilded flicker 67 77 Brown-crested flycatcher Mourning dove Gambel’s quail Black-tailed gnatcatcher Ash-throated flycatcher 128 135 148 152 173 Curve-billed thrasher Cactus wren Verdin House finch 173 230 282 297 Gila woodpecker White-winged dove 300 625 We treated each bird species in the survey as the unit of interest and the abundance of a species in the survey as its measurement. The range of abundance values was divided into 13 intervals of equal width (0–50, 50–100, and so on), and the number of species falling into each abundance interval was counted and presented in a frequency table to help see patterns (Table 2.2-3). TABLE 2.2-3 Frequency distribution of bird species abundance at Organ Pipe Cactus National Monument. Abundance Frequency (Number of species) 0–50 50–100 100–150 150–200 200–250 250–300 300–350 350–400 400–450 450–500 500–550 550–600 600–650 28 4 3 3 1 2 1 0 0 0 0 0 1 Total 43 Source: Data are from Table 2.2-2. Although the table shows the numbers, the shape of the frequency distribution is more obvious in a histogram of these same data (Figure 2.2-3). Here, frequency (number of species) in each abundance interval is perceived as bar area. FIGURE 2.2-3 Histogram illustrating the frequency distribution of bird species abundance at Organ Pipe Cactus National Monument. Total number of bird species: n = 43. The frequency table and histogram of the bird abundance data reveal that the majority of bird species have low abundance. Frequency falls steeply with increasing abundance.3 The white-winged dove (pictured in Example 2.2B) is exceptionally abundant at Organ Pipe Cactus National Monument, accounting for a large fraction of all individual birds encountered in the survey. Describing the shape of a histogram The histogram reveals the shape of a frequency distribution. Some of the most common shapes are displayed in Figure 2.2-4. Any interval of the frequency distribution that is noticeably more frequent than surrounding intervals is called a peak. The mode is the interval corresponding to the highest peak. For example, a bell-shaped frequency distribution has a single peak (the mode) in the center of the range of observations. A frequency distribution having two distinct peaks is said to be bimodal. The mode is the interval corresponding to the highest peak in the frequency distribution. FIGURE 2.2-4 Some possible shapes of frequency distributions. A frequency distribution is symmetric if the pattern of frequencies on the left half of the histogram is the mirror image of the pattern on the right half. The uniform distribution and the bell-shaped distribution in Figure 2.2-4 are symmetric. If a frequency distribution is not symmetric, we say that it is skewed. The distribution in Figure 2.2-4 labeled “Asymmetric” has left or negative skew: it has a long tail extending to the left. The distribution in Figure 2.2-4 labeled “Bimodal” is also asymmetric but is positively skewed: its long tail is to the right.4 The abundance data for desert bird species also have positive skew (Figure 2.2-3), which means they have a long tail extending to the right. Skew refers to asymmetry in the shape of a frequency distribution for a numerical variable. Extreme data points lying well away from the rest of the data are called outliers. The histogram of desert bird abundance (Figure 2.2-3) includes one extreme observation (the whitewinged dove) that falls well outside the range of abundance of other bird species. The whitewinged dove, therefore, is an outlier. Outliers are common in biological data. They can result from mistakes in recording the data or, as in the case of the white-winged dove, they may represent real features of nature. Outliers should always be investigated. They should be removed from the data only if they are found to be errors. An outlier is an observation well outside the range of values of other observations in the data set. How to draw a good histogram When drawing a histogram, the choice of interval width must be made carefully because it can affect the conclusions. For example, Figure 2.2-5 shows three different histograms that depict the body mass of 228 female sockeye salmon (Oncorhynchus nerka) from Pick Creek, Alaska, in 1996 (Hendry et al. 1999). The leftmost histogram of Figure 2.2-5 was drawn using a narrow interval width. The result is a somewhat bumpy frequency distribution that suggests the existence of two or even more peaks. The rightmost histogram uses a wide interval. The result is a smoother frequency distribution that masks the second of the two dominant peaks. The middle histogram uses an intermediate interval that shows two distinct body-size groups. The fluctuations from interval to interval within size groups are less noticeable. FIGURE 2.2-5 Body mass of 228 female sockeye salmon sampled from Pick Creek in Alaska (Hendry et al. 1999). The same data are shown in each case, but the interval widths are different: 0.1 kg (left), 0.3 kg (middle), and 0.5 kg (right). To choose the ideal interval width we must decide whether the two distinct body-size groups are likely to be “real,” in which case the histogram should show both, or whether a bimodal shape is an artifact produced by too few observations.5 When you draw a histogram, each bar must rise from a baseline of zero, so that the area of each bar is proportional to frequency. Unlike bar graphs, adjacent histogram bars are contiguous, with no spaces between them. This reinforces the perception of a numerical scale, with bars grading from one into the next. In this book we follow convention by placing an observation whose value is exactly at the boundary of two successive intervals into the higher interval. For example, the Gila woodpecker, with a total of 300 individuals observed (Table 2.2-2), is recorded in the interval 300–350, not in the interval 250–300. There are no strict rules about the number of intervals that should be used in frequency tables and histograms. Some computer programs use Sturges’s rule of thumb, in which the number of intervals is 1 + ln(n)/ln(2), where n is the number of observations and ln is the natural logarithm. The resulting number is then rounded up to the higher integer (Venables and Ripley 2002). Many regard this rule as overly conservative, and in this book we tend to use a few more intervals than Sturges. The number of intervals should be chosen to best show patterns and exceptions in the data, and this requires good judgment rather than strict rules. Computers allow you to try several alternatives to help you determine the best option. When breaking the data into intervals for the histogram, use readable numbers for breakpoints—for example, break at 0.5 rather than 0.483. Finally, it is a good idea to provide the total number of individuals in the accompanying legend. Other graphs for numerical data The histogram is recommended for showing the frequency distribution of a single numerical variable. The box plot and the strip chart are alternatives, but most often these are used to show differences when there are data from two or more groups. We describe these graphs in the next section. Another type of graph, the cumulative frequency distribution, is explained in Chapter 3. Showing association between two variables Here we illustrate how to show data for two variables simultaneously, rather than one at a time. The goal is to create an image that visualizes association or correlation between two variables and differences between groups. The most suitable type of graph depends on whether both variables are categorical, both are numerical, or one is of each data type. Showing association between categorical variables If two categorical variables are associated, the relative frequencies for one variable will differ among categories of the other variable. To reveal such association, show the frequencies using a contingency table, a mosaic plot, or a grouped bar graph. Here’s an example. EXAMPLE 2.3A Reproductive effort and avian malaria Is reproduction hazardous to health? If not, then it is difficult to explain why adults in many organisms seem to hold back on the number of offspring they raise. Oppliger et al. (1996) investigated the impact of reproductive effort on the susceptibility to malaria6 in wild great tits (Parus major) breeding in nest boxes. They divided 65 nesting females into two treatment groups. In one group of 30 females, each bird had two eggs stolen from her nest, causing the female to lay an additional egg. The extra effort required might increase stress on these females. The remaining 35 females were left alone, establishing the control group. A blood sample was taken from each female 14 days after her eggs hatched to test for infection by avian malaria. The association between experimental treatment and the incidence of malaria is displayed in Table 2.3-1. This table is known as a contingency table, a frequency table for two (or more) categorical variables. It is called a contingency table because it shows how the frequencies of the categories in a response variable (the incidence of malaria, in this case) are contingent upon the value of an explanatory variable (the experimental treatment group). A contingency table gives the frequency of occurrence of all combinations of two (or more) categorical variables. TABLE 2.3-1 Contingency table showing the incidence of malaria in female great tits in relation to experimental treatment. Experimental treatment group Control group Egg-removal group Row total 7 15 22 No malaria 28 15 43 Column total 35 30 65 Malaria Each experimental unit (bird) is counted exactly once in the four “cells” of Table 2.3-1, and so the total count (65) is the number of birds in the study. A cell is one combination of categories of the row and column variables in the table. The explanatory variable (experimental treatment) is displayed in the columns, whereas the response variable, the variable being predicted (incidence of malaria), is displayed in the rows. The frequency of subjects in each treatment group is given in the column totals, and the frequency of subjects with and without malaria is given in the row totals. According to Table 2.3-1, malaria was detected in 15 of the 30 birds subjected to egg removal, but in only seven of the 35 control birds. This difference between treatments suggests that the stress of egg removal, or the effort involved in producing one extra egg, increases female susceptibility to avian malaria. Table 2.3-1 is an example of a 2 × 2 (“two-by-two”) contingency table, because it displays the frequency of occurrence of all combinations of two variables, each having exactly two categories. Larger contingency tables are possible if the variables have more than two categories. Two types of graph work best for displaying the relationship between a pair of categorical variables. The grouped bar graph uses heights of rectangles to graph the frequency of occurrence of all combinations of two (or more) categorical variables. Figure 2.3-1 shows the grouped bar graph for the avian malaria experiments. Grouped bar graphs are like bar graphs for single variables, except that different categories of the response variable (e.g., malaria and no malaria) are indicated by different colors or shades. Bars are grouped by the categories of the explanatory variable treatment (control and egg removal), so make sure that the spaces between bars from different groups are wider than the spaces between bars separating categories of the response variable. We can see from the grouped bar graph in Figure 2.3-1 that incidence of malaria is associated with treatment, because the relative heights of the bars for malaria and no malaria differ between treatments. Most birds in the control group had no malaria (the gold bar is much taller than the red bar), whereas in the experimental group, the frequency of subjects with and without malaria was equal. A grouped bar graph uses the height of rectangular bars to display the frequency distributions (or relative frequency distributions) of two or more categorical variables. FIGURE 2.3-1 Grouped bar graph for reproductive effort and avian malaria in great tits. The data are from Table 2.3-1, where n = 65 birds. A mosaic plot is similar to a grouped bar plot except that bars within treatment groups are stacked on top of one another (Figure 2.3-2). Within a stack, bar area and height indicate the relative frequencies (i.e., the proportion) of the responses. This makes it easy to see the association between treatment and response variables: if an association is present in the data, then the vertical position at which the colors meet will differ between stacks. If no association is present, then the meeting point between the colors will be at the same vertical position between stacks. In Figure 2.3-2, for example, few individuals in the control group were infected with malaria, so the red bar (malaria) meets the gold bar (no malaria) at a higher vertical position than in the egg removal stack, where the incidence of malaria was greater. The mosaic plot uses the area of rectangles to display the relative frequency of occurrence of all combinations of two categorical variables. FIGURE 2.3-2 Mosaic plot for reproductive effort and avian malaria in great tits. Red indicates birds with malaria, whereas gold indicates birds free of malaria. The data are from Table 2.3-1, where n = 65 birds. Another feature of the mosaic plot is that the width of each vertical stack is proportional to the number of observations in that group. In Figure 2.3-2, the wider stack for the control group reflects the greater total number of individuals in this treatment (35) compared with the number in the egg-removal treatment (30). As a result, the total area of each box is proportional to the relative frequency of that combination of variables in the whole data set. A mosaic plot provides only relative frequencies, not the absolute frequency of occurrence in each combination of variables. This might be considered a drawback, but keep in mind that the most important goal of graphs is to depict the pattern in the data rather than exact figures. Here, the pattern is the association between treatment and response variables: the difference in the relative frequencies of diseased birds in the two treatments. Of the three methods for presenting the same data—the contingency table, the mosaic plot, and the grouped bar graph—which is best? The answer depends on the circumstances, and it is a good idea to try all three to evaluate their effectiveness in any data set. It is usually easier to see differences in relative frequency between groups when the data are visualized in a grouped bar plot or mosaic plot than in a contingency table. On the other hand, a contingency table might work best if one of the response categories is vastly more frequent than the other, making it difficult to see the bars corresponding to rare categories in a graph, or if the explanatory and response variables have many categories, thus increasing the complexity of the graph. We find that association, or lack of association, is easier to see in a mosaic plot than in a grouped bar graph, but this will not always be the case. Deciding which type of display is most effective for a given circumstance is best done by trying several methods and choosing among them on the basis of information, clarity, and simplicity. Showing association between numerical variables: scatter plot Use a scatter plot to show the association between two numerical variables. Position along the horizontal axis (the x-axis) indicates the measurement of the explanatory variable. The position along the vertical axis (the y-axis) indicates the measurement of the response variable. The pattern in the resulting cloud of points indicates whether an association between the two variables is positive (in which case the points tend to run from the lower left to the upper right of the graph), negative (the points run from the upper left to the lower right), or absent (no discernible pattern). Example 2.3B shows an example. EXAMPLE 2.3B Sins of the father The bright colors and elaborate courtship displays of the males of many species posed a problem for Charles Darwin: how can such elaborate traits evolve? His answer was that they evolved because females are attracted to them when choosing a mate. But why would females choose those kinds of males? One possible answer: females that choose fancy males have attractive sons as well. A recent laboratory study examined how attractive traits in guppies are inherited from father to son (Brooks 2000). The attractiveness of sons (a score representing the rate of visits by females to corralled males, relative to a standard) was compared with their fathers’ ornamentation (a composite index of several aspects of male color and brightness). The father’s ornamentation is the explanatory variable in the resulting scatter plot of these data (Figure 2.3-3). FIGURE 2.3-3 Scatter plot showing the relationship between the ornamentation of male guppies and the average attractiveness of their sons. Total number of families: n = 36. Each dot in the scatter plot is a father-son pair. The father’s ornamentation is the explanatory variable and the son’s attractiveness is the response variable. The plot shows a positive association between these variables (note how the points tend to run from the lower left to the upper right of the graph). Thus, the sexiest sons come from the most gloriously ornamented fathers, whereas unadorned fathers produce less attractive sons on average. A scatter plot is a graphical display of two numerical variables in which each observation is represented as a point on a graph with two axes. Showing association between a numerical and a categorical variable There are several good methods to show an association between a numerical variable and a categorical variable. Three that we recommend are the strip chart (which we first saw in Figure 2.1-2), the box plot, and the multiple histograms method. Here we compare these methods with an example. We recommend against the common practice of using a bar graph because the bars make it difficult to show the data (bar graphs are ideal for frequency data). Showing an association between a numerical and a categorical variable is the same as showing a difference in the numerical variable between groups. EXAMPLE 2.3C Blood responses to high elevation The amount of oxygen obtained in each breath at high altitude can be as low as one-third that obtained at sea level. Studies have begun to examine whether indigenous people living at high elevations have physiological attributes that compensate for the reduced availability of oxygen. A reasonable expectation is that they should have more hemoglobin, the molecule that binds and transports oxygen in the blood. To test this, researchers sampled blood from males in three high-altitude human populations: the high Andes, high-elevation Ethiopia, and Tibet, along with a sea-level population from the USA (Beall et al. 2002). Results are shown in Figures 2.3-4 and 2.3-5. FIGURE 2.3-4 Strip chart (left) and box plot (right) showing hemoglobin concentration in males living at high altitude in three different parts of the world: the Andes (71), Ethiopia (128), and Tibet (59). A fourth population of 1704 males living at sea level (USA) is included as a control. FIGURE 2.3-5 Multiple histograms showing the hemoglobin concentration in males of the four populations. The number of measurements in each group is given in Figure 2.3-4. The left panel of Figure 2.3-4 shows the hemoglobin data with a strip chart (sometimes also called a dot plot). In a strip chart, each observation is represented as a dot on a graph showing its numerical measurement on one axis (here, the vertical or y-axis) and the category (group) to which it belongs on the other (here, horizontal or x-axis). A strip chart is like a scatter plot except the explanatory variable is categorical rather than numerical. It is usually necessary to spread, or “jitter,” the points along the horizontal axis to reduce overlap of points, so that they can be more easily seen. The strip chart method worked well in Figure 2.1-2, where there were few data points in each group. However, several of the male populations in Example 2.3C have so many observations that the points overlap too much in the strip chart, making it difficult to see the individual dots and their distribution (left panel of Figure 2.3-4). The strip chart is a graphical display of a numerical variable and a categorical variable in which each observation is represented as a dot. An alternative method to “show the data” is the box plot, which uses lines and rectangles to display a compact summary of the frequency distribution of the data (right panel of Figure 2.34). A box plot doesn’t show most of the data points, but instead uses lines and boxes to indicate where the bulk of the observations lie. The scale of the vertical axis is the same in both panels of Figure 2.3-4 so that you can see the correspondence between boxes and data points. A box plot is a graph that uses lines and a rectangular box to display the median, quartiles, range, and extreme measurements of the data. The line inside each box (right panel of Figure 2.3-4) is the median, the middle measurement of the group of observations. Half the observations lie above the median and half lie below. The lower and upper edge of each box are first and third quartiles. One-fourth of the observations lie below the first quartile (three-fourths lie above). Conversely, three-fourths of the observations lie below the third quartile (one-quarter lie above). Two lines, called whiskers, extend outward from a box at each end. The whiskers stop at the smallest and largest “non-extreme” values in the data. Extreme values are plotted as isolated dots past the ends of the whiskers. We explain these quantities in more detail, and how to calculate them, in Chapter 3. The box plot in Figure 2.3-4 shows key features of the four frequency distributions using just a few graphical elements. We can clearly see in this graph how only men from the high Andes had elevated hemoglobin concentrations, whereas men from high-elevation Ethiopia and Tibet were not noticeably different in hemoglobin concentration from the sea-level group.7 We can see that the shapes of distributions are relatively similar in the four groups of males—the span of the boxes is similar in the four groups, although greatest in the Andean males and least in the USA males. The lengths of the whiskers are also fairly similar. USA males have the most extreme observations, but the shape of the distribution, as indicated by the box and whiskers, is similar to that in the other groups. The third method uses multiple histograms, one for each category, to show the data, as shown in Figure 2.3-5. It is important that the histograms be stacked above one another as shown so that the position and spread of the data are most easily compared. Side-by-side histograms lose most of the advantages of the multiple histogram method for visualizing association, because differences in the position of bars between groups are difficult to see. Use the same scale along the horizontal axis to allow comparison. Of the three methods for showing association between a numeric and a categorical variable (difference between groups), which is the best? The strip chart shows all the data points, which is ideal when there are only a few observations in each category. The box plot picks out a few of the most important features of the frequency distribution and is more suitable when the number of observations is large. The multiple histogram plot shows more features of the frequency distribution but takes up more space than the other two options. It works best when there are only a few categories. As usual, the best strategy is to try all three methods on your data and judge for that situation which method shows the association most clearly. Showing trends in time and space Often a variable of interest represents a summary measurement taken at consecutive points in time or space. In this case, line graphs and maps are excellent visual tools. Line graph A line graph is a powerful tool for displaying trends in time or other ordered series. Typically, one y-measurement is displayed for each point in time or space, which is displayed along the xaxis. Adjacent points along the x-axis are connected by a line segment. Example 2.4A illustrates the line graph. A line graph uses dots connected by line segments to display trends in a measurement over time or other ordered series. EXAMPLE 2.4A Bad science can be deadly Since the introduction of a measles vaccine, the number of cases in the U.K. dropped dramatically. A disease that once killed hundreds of people per year in the U.K. became a negligible risk as most of the population was immunized. However, recent declines in the fraction of people vaccinated, in part from unfounded fears concerning the safety of the vaccine,8 has caused a resurgence in the number of cases of measles (Jansen et al. 2003). The number of cases quarterly between 1995 and 2011 is shown in a line plot in Figure 2.4-1 (data from Health Protection Agency 2012). FIGURE 2.4-1 Confirmed cases of measles in England and Wales from 1995 to 2011. The four numbers in each year refer to new cases in each quarter. The trends in the number of measles cases over time are made more visible by the lines connecting the points in Figure 2.4-1. The steepness of the line segments reflects the speed of change in the number of cases from one quarter-year to the next. Notice, for example, how steeply the number of cases rises when an outbreak begins, and then how cases decline just as quickly afterward, as immunity spreads. When the baseline for the vertical axis is zero, as in the present example, the area under the curve between two time points is proportional to the total number of new cases in that period. Maps A map is the spatial equivalent of the line graph, using a color gradient to display a numerical response variable at multiple locations on a surface. The explanatory variable is location in space. One measurement is displayed for each point or interval of the surface, as shown in Example 2.4B. EXAMPLE 2.4B Biodiversity hotspots South America is renowned for its extraordinary numbers of species. We tend to think of the vast expanse of lowland Amazon rainforest as the seat of this abundance. The data shown in Figure 2.4-2 are the numbers of plant species recorded at many points on a fine grid covering the northern part of South America. Points are colored such that “hotter” colors represent more plant species at each point. The image shows that peak diversities actually occur at the northwest coast, the nearby inland where the Andes mountains meet the lowland rainforest, and along the southeast coast of Brazil. FIGURE 2.4-2 Map displaying numbers of plant species in northern South America. Colors reflect the numbers of species estimated at many points in a fine grid, with each point consisting of an area 100 km x 100 km. The color scale is on the right, with hotter colors reflecting more species. The horizontal gray line is the equator. Modified from Bass et al. (2010). The map in Figure 2.4-2 contains an enormous amount of data, yet the pattern is easy to see. The regions of peak diversity, and those of relatively low diversity, are clearly evident. Maps can be used for measurements at points on any two-dimensional surface, including a spatial grid (such as in the plant species richness example) or at political or geological boundaries on a map. They can also be used to indicate measurements at locations on the surface of two- or three-dimensional objects, such as the brain or the body. For example, a visual representation of an MRI scan is also a map. How to make good tables Tables have two functions: to communicate patterns and to record quantitative data summaries for further use. When the main function of a table is to display the patterns in the data—a “display table”—numerical detail is less important than the effective communication of results. This is the kind of table that would appear in the main body of a report or publication. Compact frequency tables are examples of display tables; for example, look at Table 2.2-3, which shows the number of bird species in a sequence of abundance categories. In this section we summarize strategies for making good display tables. The purpose of the second kind of table is to store raw data and numerical summaries for reference purposes. Such “data tables” are often large and are not ideal for recognizing patterns in data. They are inappropriate for communicating general findings to a wider audience. They are nevertheless often invaluable. Table 2.2-2, which lists the abundances of every bird species in a survey, is an example. Data tables aren’t usually included in the main body of a report. When published, they usually appear as appendices or online supplements, so specialized readers interested in more details can find them. Follow similar principles for display tables Producing clear, honest, and efficient display tables should follow many of the same principles discussed already for graphs. In particular, ■ Make patterns in the data easy to see. ■ Represent magnitudes honestly. ■ Draw table elements clearly. Make patterns easy to see. Make the table compact and present as few significant digits as are necessary to communicate the pattern. Avoid putting too much data into one table. Arrange the rows and columns of numbers to facilitate pattern detection. For example, a series of numbers listed above one another in a single column are easier to compare with one another than the same numbers listed side by side in different columns. Our earlier recommendations for frequency tables apply here (Section 2.2). For example, list unordered categorical (nominal) data in order of importance (frequency) rather than alphabetically or otherwise. If the categorical variables have a natural order (such as life stages: zygote, fetus, newborn, adolescent, adult), they should be listed in that order. Represent magnitudes honestly. For example, when combining numbers into bins in frequency tables, use intervals of equal width so that the numbers can be more accurately compared. Draw table elements clearly. Clearly label row and column headers, and always provide units of measurement. Choose unadorned, simple fonts. Let’s look at an example of a display table and then consider how it might be improved. The data in Table 2.5-1 were put together by Alvarez et al. (2009) to investigate the idea that a strong preference for consanguineous marriages (inbreeding) within the line of Spanish Habsburg kings, which ruled Spain from 1516 to 1700, contributed to its downfall. The quantity F is a measure of inbreeding in the offspring. F is zero if king and queen were unrelated, and F is 0.25 if they were brother and sister whose own parents were unrelated. Values may be lower or higher if there was inbreeding further generations back. TABLE 2.5-1 Inbreeding coefficient (F) of Spanish Habsburg kings and queens and survival of their progeny. Miscarriages Neonatal Later Survivors Survival Survival King/Queen F Pregnancies & stillbirths deaths deaths to age 10 (total) (postnatal) Ferdinand of Aragon Elizabeth of Castile Philip I Joanna I Charles I Isabella of Portugal Philip II Elizabeth of Valois Anna of Austria Philip III Margaret of Austria Philip IV Elizabeth of Bourbon Mariana of Austria 0.039 7 2 0 0 5 0.714 1.000 0.037 6 0 0 0 6 1.000 1.000 0.123 7 1 1 2 3 0.429 0.600 0.008 4 1 1 0 2 0.500 1.000 0.218 6 1 0 4 1 0.167 0.200 0.115 8 0 0 3 5 0.625 0.625 0.050 7 0 3 2 2 0.286 0.500 0.254 6 0 1 3 2 0.333 0.400 Source: Data are from Alvarez et al. (2009). There is a tendency for less related kings and queens to produce a higher proportion of surviving offspring, but it is not so easy to see this in Table 2.5-1. Before reading any further, examine the table and make a note of any deficiencies. How might these deficiencies be overcome by modifying the table? Let’s apply the principles of effective display to improve this table. Consider that the main goal of producing the table should be to show a pattern, in this case a possible association between F and offspring survival. Here is a list of features of Table 2.5-1 that we felt made it difficult to see this pattern. ■ King/queen pairs are not ordered in such a way as to make it easy for the eye to see any association. ■ The main variables of interest, F and survival, are separated by intervening columns. ■ Blank lines are inserted for every new king listed, fragmenting any pattern. ■ The number of decimal places is overly large, making it difficult to read the numbers. To overcome these problems, we have extracted the most crucial columns and reorganized them in Table 2.5-2. In this revised table, king and queen pairs are ordered by F value of the offspring, and survival has been placed in the adjacent column. Blank lines have been eliminated along with unnecessary columns, and decimals have been rounded to two places. TABLE 2.5-2 Inbreeding coefficient (F) of Spanish kings and queens and survival of their progeny. These data are extracted and reorganized from Table 2.5-1. Survival Survival Number of King/Queen F (postnatal) (total) pregnancies Philip II/Elizabeth of Valois Philip I/Joanna I Ferdinand/Elizabeth of Castile Philip IV/Elizabeth of Bourbon Philip III/Margaret of Austria Charles I/Isabella of Portugal Philip II/Anna of Austria Philip IV/Mariana of Austria 0.01 0.04 1.00 1.00 0.50 1.00 4 6 0.04 1.00 0.71 7 0.05 0.50 0.29 7 0.12 0.12 0.22 0.25 0.63 0.60 0.20 0.40 0.63 0.43 0.17 0.33 8 7 6 6 The revised Table 2.5-2 suggests that survival of more inbred progeny tends to be lower, at least when measured as postnatal survival. The trend appears weaker for total survival, which includes prenatal and neonatal survival. Just as for a graph, a good table must convey information clearly, concisely, and without distortion. A good table requires careful editing. See Ehrenberg (1977) for further insights into how to draw tables. Summary ■ Graphical displays must be clear, honest, and efficient. ■ Strive to show the data, to make patterns in the data easy to see, to represent magnitudes honestly, and to draw graphical elements clearly. ■ Follow the same rules when constructing tables to reveal patterns in the data. ■ A frequency table is used to display a frequency distribution for categorical or numerical data. ■ Bar graphs and histograms are recommended graphical methods for displaying frequency distributions of categorical and numerical variables: Type of data Categorical data Numerical data Graphical method Bar graph Histogram ■ Contingency tables describe the association between two (or more) categorical variables by displaying frequencies of all combinations of categories. ■ Recommended graphical methods for displaying associations between variables and differences between groups include the following: Types of data Graphical method Two numerical variables Scatter plot Line plot (space or time) Map (space) Two categorical variables Grouped bar graph Mosaic plot One numerical variable and Strip chart Box plot Multiple histograms one categorical variable Cumulative frequency distributions (Chapter 3) PRACTICE PROBLEMS 1. Estimate by eye the relative frequency of the shaded areas in each of the following histograms. 2. Using a graphical method from this chapter, draw three frequency distributions: one that is symmetric, one that is skewed, and one that is bimodal. a. Identify the mode in each of your frequency distributions. b. Does your skewed distribution have negative or positive skew? c. Is your bimodal distribution skewed or symmetric? 3. In the southern elephant seal, males defend harems that may contain hundreds of reproductively active females. Modig (1996) recorded the numbers of females in harems in a population on South Georgia Island. The histograms of the data (below, drawn from data in Modig 1996) are unusual because the rarer, larger harems have been divided into wider intervals. In the upper histogram, bar height indicates the relative frequency of harems in the interval. In the lower histogram, bar height is adjusted such that bar area indicates relative frequency. Which histogram is correct? Why? 4. Draw scatter plots for invented data that illustrate the following patterns: a. Two numerical variables that are positively associated b. Two numerical variables that are negatively associated c. Two numerical variables whose relationship is nonlinear 5. A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used. 6. The following data are the occurrences in 2012 of the different taxa in the list of endangered and threatened species under the U.S. Endangered Species Act (U.S. Fish and Wildlife Service 2012). The taxa are listed in no particular order in the table. Taxon Birds Clams Reptiles Fish Crustaceans Mammals Snails Flowering plants Amphibians Insects Arachnids Number of species 93 83 36 152 22 85 40 782 26 66 12 a. Rewrite the table, but list the taxa in a more revealing order. Explain your reasons behind the ordering you choose. b. What kind of table did you construct in part (a)? c. Choosing the most appropriate graphical method, display the number of species in each taxon. What kind of graph did you choose? Why? d. Should the baseline for the number of species in your graph in part (c) be 0 or 12, the smallest number in the data set? Why? e. Create a version of this table that shows the relative frequency of endangered species by taxon. 7. Can environmental factors influence the incidence of schizophrenia? A recent project measured the incidence of the disease among children born in a region of eastern China: 192 of 13,748 babies born in the midst of a severe famine in the region in 1960 later developed schizophrenia. This compared with 483 schizophrenics out of 59,088 births in 1956, before the famine, and 695 out of 83,536 births in 1965, after the famine (St. Clair et al. 2005). a. What two variables are compared in this example? b. Are the variables numerical or categorical? If numerical, are they continuous or discrete; if categorical, are they nominal or ordinal? c. Effectively display the findings in a table. What kind of table did you use? d. In each of the three years, calculate the relative frequency (proportion) of children born who later developed schizophrenia. Plot these proportions in a line graph. What pattern is revealed? 8. Human diseases differ in their virulence, which is defined as their ability to cause harm. Scientists are interested in determining what features of different diseases make some more dangerous to their hosts than others. The graph below depicts the frequency distribution of virulence measurements, on a log base 10 scale, of a sample of human diseases (modified from Ewald 1993). Diseases that spread from one victim to another by direct contact between people are shown in the upper graph. Those transmitted from person to person by insect vectors are shown in the lower graph. a. Identify the type of graph displayed. b. What are the two groups being compared in this graph? c. What variable is being compared between the two groups? Is it numerical or categorical? d. Explain the units on the vertical (y) axis. e. What is the main result depicted by this graph? 9. Examine the figure below, which indicates the date of first occurrence of rabies in raccoons in the townships of Connecticut, measured by the number of months following March 1, 1991 (modified from Smith et al. 2002). a. b. c. d. Identify the type of graph shown. What is the response variable? What is the explanatory variable? What was the direction of spread of the disease (from where, to where, approximately)? 10. The following graph is taken from a study of married women who had been raised by adoptive parents (Bereczkei et al. 2004). It shows the facial resemblance between the women and their husbands (first bar), between their husbands and the women’s adoptive fathers (second bar), and between their husbands and the women’s adoptive mothers (third bar). Facial resemblance of a given woman to her husband was scored by 242 “judges,” each of whom was given a photograph of the woman, photos of three other women, and a photo of her husband. The judges were asked to decide from the photos which of the four women was the wife of the husband based on facial similarity. Her resemblance was scored as the percentage of judges who chose correctly. If there was no resemblance between a given woman and her husband, then by chance only one in four judges (25%) should have chosen correctly. Resemblance of the woman’s husband to the wife’s adoptive father and to the wife’s adoptive mother was measured in the same way. a. Describe the essential findings displayed in the figure. b. Which two principles of good graph design are violated in this figure? 11. Each of the following graphs illustrates an association between two variables. For each graph identify (1) the type of graph, (2) the explanatory and response variables, and (3) the type of data (whether numerical or categorical) for each variable. a. Observed fruiting of individual plants in a population of Campanula americana according to the number of fruits produced previously (Richardson and Stephenson 1991): b. The maximum density of wood produced at the end of the growing season in white spruce trees in Alaska in different years (modified from Barber et al. 2000): c. Relative expression levels of Neuropeptide Y (NPY), a gene whose activity correlates with anxiety and is induced by stress, in the brains of people differing in their genotypes at the locus (Zhou et al. 2008). 12. The following data are from the Cambridge Study in Delinquent Development (see Problem 22). They examine the relationship between the occurrence of convictions by the end of the study and the family income of each boy when growing up. Three categories described income level: inadequate, adequate, and comfortable. Income level No convictions Convicted Inadequate Adequate Comfortable 47 43 128 57 90 30 a. What type of table is this? b. Display these same data in a mosaic plot. c. What type of variable is “income level”? How should this affect the arrangement of groups in your mosaic plot in part (b)? d. By viewing the table above and the graph in part (b), describe any apparent association between family income and later convictions. e. In answering part (d), which method (the table or the graph) better revealed the association between conviction status and income level? Explain. 13. Each of the following graphs illustrates an association between two variables. For each graph, identify (1) the type of graph, (2) the explanatory and response variables, and (3) the type of data (whether numerical or categorical) for each variable. a. Taste sensitivity to phenylthiocarbamide (PTC) in a sample of human subjects grouped according to their genotype at the PTC gene—namely, AA, Aa, or aa (Kim et al. 2003): b. Migratory activity (hours of nighttime restlessness) of young captive blackcaps (Sylvia atricapilla) compared with the migratory activity of their parents (Berthold and Pulido 1994): c. Sizes of the second appendage (middle leg) of embryos of water striders, in 10 control embryos and 10 embryos dosed with RNAi for the developmental gene Ultrabithorax (Khila et al. 2009). d. The frequency of injection-heroin users that share or do not share needles according to their known HIV infection status (Wood et al. 2001): 14. Spot the flaw. In an experimental study of gender and wages, Moss-Racusin et al. (2012) presented professors from research-intensive universities each with a job application for a laboratory manager position. The application was randomly assigned a male or female name, and the professors were asked to state the starting salary they would offer the candidate if hired. The average starting salary reported is compared in the following figure between applications with male names and female names. (The vertical lines at the top edge of each bar are “standard error bars”—we’ll learn about them in Chapter 4). a. Identify at least two of the four principles of good graph design that are violated. b. What alternative graph type is ideal for these data? c. Identify the main pattern in the data (interestingly, this pattern was similar when male professors and female professors were examined separately). 15. How do the insects that pollinate flowers distinguish individual flowers with nectar from empty flowers? One possibility is that they can detect the slightly higher humidity of the air —produced by evaporation—in flowers that contain nectar. von Arx et al. (2012) recently tested this idea by manipulating the humidity of air emitted from artificial flowers that were otherwise identical. The following graph summarizes the number of visits to the two types of flowers by hawk moths (Hyles lineata). a. b. c. d. e. What type of graph is this? What does the horizontal line in the center of each rectangle represent? What do the top and bottom edges of each rectangle represent? What are the vertical lines extending above and below each rectangle? Is an association apparent between the variables plotted? Explain. 16. For each of the graphs shown below, based on hypothetical data, identify the type of graph and say whether or not the two variables exhibit an association. Explain your answer in each case. FIGURE FOR PROBLEM 16 17. “Animal personality” has been defined as the presence of consistent differences between individuals in behaviors that persist over time. Do sea anemones have it? To investigate, Briffa and Greenaway (2011) measured the consistency of the startle response of individuals of wild beadlet anemones, Actinia equina, in tide pools in the U.K. When disturbed, such as with a mild jet of water (the method used in this study), the anemones retract their feeding tentacles to cover the oral disc, opening them again some time later. The accompanying table records the duration of the startle response (time to reopen, in seconds) of 12 individual anemones. Each anemone was measured twice, 14 days apart. TABLE FOR PROBLEM 17 Anemone: 1 2 3 4 5 6 7 8 9 10 11 12 Occasion one 1065 248 436 350 378 410 232 201 267 687 688 980 Occasion one 939 268 460 261 368 467 303 188 401 690 711 571 a. Choose the best method, and make a graph to show the association between the first and second measurements of startle response. b. Is a strong association present? In other words, does the beadlet anemone have animal personality? 18. Refer to the previous question. a. Draw a frequency distribution of startle durations measured on the first occasion. b. Describe the shape of the frequency distribution: is it skewed or symmetric? If skewed, say whether skew is positive or negative. ASSIGNMENT PROBLEMS 19. Male fireflies of the species Photinus ignitus attract females with pulses of light. Flashes of longer duration seem to attract the most females. During mating, the male transfers a spermatophore to the female. Besides containing sperm, the spermatophore is rich in protein that is distributed by the female to her fertilized eggs. The data below are measurements of spermatophore mass (in mg) of 35 males (Cratsley and Lewis 2003). 0.047, 0.037, 0.041, 0.045, 0.039, 0.064, 0.064, 0.065, 0.079, 0.070, 0.066, 0.059, 0.075, 0.079, 0.090, 0.069, 0.066, 0.078, 0.066, 0.066, 0.055, 0.046, 0.056, 0.067, 0.075, 0.048, 0.077, 0.081, 0.066, 0.172, 0.080, 0.078, 0.048, 0.096, 0.097 a. b. c. d. Create a graph depicting the frequency distribution of the 35 mass measurements. What type of graph did you choose in part (a)? Why? Describe the shape of the frequency distribution. What are its main features? What term would be used to describe the largest measurement in the frequency distribution? 20. The accompanying graph depicts a frequency distribution of beak widths of 1017 blackbellied seedcrackers, Pyrenestes ostrinus, a finch from West Africa (Smith 1993). a. What is the mode of the frequency distribution? b. Estimate by eye the fraction of birds whose measurements are in the interval representing the mode. c. There is a hint of a second peak in the frequency distribution between 15 and 16 mm. What strategy would you recommend be used to explore more fully the possibility of a second peak? d. What name is given to a frequency distribution having two distinct peaks? 21. When its numbers increase following favorable environmental conditions, the desert locust, Schistocerca gregaria, undergoes a dramatic transformation from a solitary, cryptic form into a gregarious form that swarms by the billions. The transition is triggered by mechanical stimulation—locusts bumping into one another. The accompanying figure shows the results of a laboratory study investigating the degree of gregariousness resulting from mechanical stimulation of different parts of the body (modified from Simpson et al. 2001). a. Identify the type of graph displayed. b. Identify the explanatory and response variables. 22. The Cambridge Study in Delinquent Development was undertaken in north London (U.K.) to investigate the links between criminal behavior in young men and the socioeconomic factors of their upbringing (Farrington 1994). A cohort of 395 boys was followed for about 20 years, starting at the age of 8 or 9. All of the boys attended six schools located near the research office. The following table shows the total number of criminal convictions by the boys between the start and end of the study. Number of convictions 0 1 2 3 4 5 6 7 Frequency 265 49 21 19 10 10 2 2 8 9 10 11 12 13 14 4 2 1 4 3 1 2 Total:395 What type of table is this? How many variables are presented in this table? How many boys had exactly two convictions by the end of the study? What fraction of boys had no convictions? Display the frequency distribution in a graph. Which type of graph is most appropriate? Why? f. Describe the shape of the frequency distribution. Is it skewed or is it symmetric? Is it unimodal or bimodal? Where is the mode in number of criminal convictions? Are there outliers in the number of convictions? g. Does the sample of boys used in this study represent a random sample of British boys? Why or why not? a. b. c. d. e. 23. Swordfish have a unique “heater organ” that maintains elevated eye and brain temperatures when hunting in deep, cold water. The following graph illustrates the results of a study by Fritsches et al. (2005) that measured how the ability of swordfish retinas to detect rapid motion, measured by the flicker fusion frequency, changes with eye temperature. a. What types of variables are displayed? b. What type of graph is this? c. Describe the association between the two variables. Is the relationship between flicker fusion frequency and temperature positive or negative? Is the relationship linear or nonlinear? d. The 20 points in the graph were obtained from measurements of six swordfish. Can we treat the 20 measurements as a random sample? Why or why not? 24. The following graph displays the net number of species listed under the U.S. Endangered Species Act between 1980 and 2002 (U.S. Fish and Wildlife Service 2001): a. What type of graph is this? b. What does the steepness of each line segment indicate? c. Explain what the graph tells us about the relationship between the number of species listed and time. 25. Spot the flaw. Examine the following figure, which displays the frequency distribution of similarity values (the percentage of amino acids that are the same) between equivalent (homologous) proteins in humans and pufferfish of the genus Fugu (modified from Aparicio et al. 2002). a. b. c. d. e. What type of graph is this? Identify the main flaw in the construction of this figure. What are the main results displayed in the figure? Describe the shape of the frequency distribution shown. What is the mode in the frequency distribution? 26. The following data give the photosynthetic capacity of nine individual females of the neotropical tree Ocotea tenera, according to the number of fruits produced in the previous reproductive season (Wheelwright and Logan 2004). The goal of the study was to investigate how reproductive effort in females of these trees impacts subsequent growth and photosynthesis. Number of fruits produced previously Photosynthetic capacity (µmol O2/m2/s) 10 14 5 24 50 37 89 13.0 11.9 11.5 10.6 11.1 9.4 9.3 162 149 9.1 7.3 a. Graph the association between these two variables using the most appropriate method. Identify the type of graph you used. b. Which variable is the explanatory variable in your graph? Why? c. Describe the association between the two variables in words, as revealed by your graph. 27. Examine the accompanying figure, which displays the percentage of adults over 18 with a “body mass index” greater than 25 in different years (modified from The Economist 2005, with permission). Body mass index is a measure of weight relative to height. a. What is the main result displayed in this figure? b. Which of the four principles for drawing good graphs are violated here? How are they violated? c. Redraw the figure using the most appropriate method discussed in this chapter. What type of graph did you use? 28. When a courting male of the small Indonesian fish Telmatherina sarasinorum spawns with a female, other males sometimes sneak in and release sperm, too. The result is that not all of the female’s eggs are fertilized by the courting male. Gray et al. (2007) noticed that courting males occasionally cannibalize fertilized eggs immediately after spawning. Egg eating took place by 61 of 450 courting males who fathered the entire batch; the remaining 389 males did not cannibalize eggs. In contrast, 18 of 35 courting males ate eggs when a single sneaking male also participated in the spawning event. Finally, 16 of 20 males ate eggs when two or more sneaking males were present. a. Display these results in a table that best shows the association between cannibalism and the number of sneaking males. Identify the type of table you used. b. Illustrate the same results using a graphical technique instead. Identify the type of graph you used. 29. The graph at the top of page 62, in red, shows the number of new cases of influenza in New York, Pennsylvania, and New Jersey, according to data from the Centers for Disease Control (Ginsberg et al. 2009). The black line shows predictions based on the number of Google searches of words like “flu” or “influenza.” a. What type of graph is this? b. Describe some of the scientific conclusions you might draw from looking at this graph. c. Can you suggest an improvement to the axes labels? FIGURE FOR PROBLEM 29 Source: Jeremy Ginsberg et al., “Detecting Influenza Epidemics Using Search Engine Query Data,” Nature 457 (2009): 1012–1014. 30. The following graph was drawn using a very popular spreadsheet program in an attempt to show the frequencies of observations in four hypothetical groups. Before reading further, estimate by eye the frequencies in each of the four groups. a. Identify two features of this graph that cause it to violate the principle, “Make patterns in the data easy to see.” b. Identify at least two other features of the graph that make it difficult to interpret. c. The actual frequencies are 10, 20, 30, and 40. Draw a graph that overcomes the problems identified above. 31. In Poland, students are required to achieve a score of 21 or higher on the high-school Polish language “maturity exam” to be eligible for university. The following graph shows the frequency distribution of scores (Freakonomics 2011). a. Examine the graph and identify the most conspicuous pattern in these data. b. Generate a hypothesis to explain the pattern. 32. More than 10% of people carry the parasite Toxoplasma gondii. The following table gives data from Prague on 15- to 29-year-old drivers who had been involved in an accident. The table gives the number of drivers who were infected with Toxoplasma gondii and who were uninfected. These numbers are compared with a control sample of 249 drivers of the same age living in the same area who had not been in an accident. Drivers with accidents Controls Infected Uninfected 21 38 38 211 a. What type of table is this? b. What are the two variables being compared? Which is the explanatory variable and which is the response? c. Depict the data in a graph. Use the results to answer the question: are the two variables associated in this data set? 33. The cutoff birth date for school entry in British Columbia, Canada, is December 31. As a result, children born in December tend to be the youngest in their grade, whereas those born in January tend to be the oldest. Morrow et al. (2012) examined how this relative age difference influenced diagnosis and treatment of attention deficit/hyperactivity disorder (ADHD). A total of 39,136 boys aged 6 to 12 years and registered in school in 1997–1998 had January birth dates. Of these, 2219 were diagnosed with ADHD in that year. A total of 38,977 boys had December birth dates, of which 2870 were diagnosed with ADHD in that year. Display the association between birth month and ADHD diagnosis using a table or graphical method from this chapter. Is there an association? 34. Examine the following figure, which displays hypothetical measurements of a sample of individuals from several groups. a. What type of graph is this? b. In which of the groups is the frequency distribution of measurements approximately symmetric? c. Which of the frequency distributions show positive skew? d. Which of the frequency distributions show negative skew? e. Which group has the largest value for the upper quartile? f. Which group has the smallest value for the median? g. Which group has the most extreme observation? 35. The following data are from Mattison et al. (2012), who carried out an experiment with rhesus monkeys to test whether a reduction in food intake extends life span (as measured in years). The data are the life spans of 19 male and 15 female monkeys who were randomly assigned a normal nutritious diet or a similar diet reduced in amount by 30%. All monkeys were adults at the start of the study. Females—reduced: 16.5, 18.9, 22.6, 27.8, 30.2, 30.7, 35.9 Females—control: 23.7, 24.5, 24.7, 26.1, 28.1, 33.4, 33.7, 35.2 Males—reduced: 23.7, 28.1, 29.8, 31.1, 36.3, 37.7, 39.9, 39.9, 40.2, 40.2 Males—control: 24.9, 25.2, 29.6, 33.2, 34.1, 35.4, 38.1, 38.8, 40.7 a. Graph the results, using the most appropriate method and following the four principles of good graph design. b. According to your graph, which difference in life span is greater: that between the sexes, or that between diet groups? 36. The accompanying graph indicates the amount of time (latency) that female subjects were willing to leave their hand in icy water while they were swearing (“words you might use after hitting yourself on the thumb with a hammer”) or while not swearing, using other words instead (“words to describe a table”). The data are from Stephens et al. (2009). a. b. c. d. Identify the type of graph. Is any association apparent between the variables? Explain. What do the “whiskers” indicate in this graph? List two other types of graphs that would also be appropriate for showing these results. 37. Following is a list of all the named hurricanes in the Atlantic between 2001 and 2010, along with their category on the Saffir-Simpson Hurricane Scale,9 which categorizes each hurricane by a label from 1 to 5 depending on its power. 2001: Erin, 3; Feliz, 3; Gabrielle, 1; Humberto, 2; Iris, 4; Karen, 1; Michelle, 4; Noel, 1; Olga, 1. 2002: Gustav, 2; Isidore, 3; Kyle, 1; Lili, 4; 2003: Claudette, 1; Danny, 1; Erika, 1; Fabian, 4; Isabel, 5; Juan, 2; Kate, 3. 2004: Alex, 3; Charley, 4; Danielle, 2; Frances, 4; Gaston, 1; Ivan, 5; Jeanne, 3; Karl, 4; Lisa, 1. 2005: Cindy, 1; Dennis, 4; Emily, 5; Irene, 2; Katrina, 5; Maria, 3; Nate, 1; Ophelia, 1; Philippe, 1; Rita, 5; Stan, 1; Vince, 1; Wilma, 5; Beta, 3; Epsilon, 1. 2006: Ernesto, 1; Florence, 1; Gordon, 3; Helene, 3; Isaac, 1. 2007: Dean, 5; Felix, 5; Humberto, 1; Karen, 1; Lorenzo, 1; Noel, 1. 2008: Bertha, 3; Dolly, 2; Gustav, 4; Hanna, 1; Ike, 4; Kyle, 1; Omar, 4; Paloma, 4. 2009: Bill, 4; Fred, 3; Ida, 2. 2010: Alex, 2; Danielle, 4; Earl, 4; Igor, 4, Julia, 4; Karl, 3; Lisa, 1; Otto, 1; Paula, 2; Richard, 2; Shary, 1; Toas, 2. a. Make a frequency table showing the frequency of hurricanes in each severity category during the decade. b. Make a frequency table that shows the frequency of hurricanes in each year. c. Explain how you chose to order the categories in your tables. 3 Describing data Saiga Descriptive statistics, or summary statistics, are quantities that capture important features of frequency distributions. Whereas graphs reveal shapes and patterns in the data, descriptive statistics provide hard numbers. The most important descriptive statistics for numerical data are those measuring the location of a frequency distribution and its spread. The location tells us something about the average or typical individual—where the observations are centered. The spread tells us how variable the measurements are from individual to individual—how widely scattered the observations are around the center. The proportion is the most important descriptive statistic for a categorical variable, measuring the fraction of observations in a given category. The importance of calculating the location of a distribution seems obvious. How else do we address questions like “Which species is larger?” or “Which drug yielded the greatest response?” The importance of describing distribution spread is less obvious but no less crucial, at least in biology. In some fields of science, variability around a central value is instrument noise or measurement error, but in biology much of the variability signifies real differences among individuals. Different individuals respond differently to treatments, and this variability begs measurement. Measuring variability also gives us perspective. We can ask, “How large are the differences between groups compared with variations within groups?” Biologists also appreciate variation as the stuff of evolution—we wouldn’t be here without variation. In this chapter, we review the most common statistics to measure the location and spread of a frequency distribution and to calculate a proportion. We introduce the use of mathematical symbols to represent values of a variable, and we show formulas to calculate each summary statistic. Arithmetic mean and standard deviation The arithmetic mean is the most common metric to describe the location of a frequency distribution. It is the average of a set of measurements. The standard deviation is the most commonly used measure of distribution spread. Example 3.1 illustrates the basic calculations for means and standard deviations. EXAMPLE 3.1 Gliding snakes When a paradise tree snake (Chrysopelea paradisi) flings itself from a treetop, it flattens its body everywhere except for the region around the heart. As it gains downward speed, the snake forms a tight horizontal S shape and then begins to undulate widely from side to side. This generates lift, causing the snake to glide away from the source tree. By orienting the head and anterior part of the body, the snake can change direction during a glide to avoid trees, reach a preferred landing site, and even chase aerial prey. To better understand how lift is generated, Socha (2002) videotaped the glides of eight snakes leaping from a 10-m tower.1 Among the measurements taken was the rate of side-to-side undulation on each snake. Undulation rates of the eight snakes, measured in hertz (cycles per second), were as follows: 0.9, 1.4, 1.2, 1.2, 1.3, 2.0, 1.4, 1.6 A histogram of these data is shown in Figure 3.1-1. The frequency distribution has a single peak between 1.2 and 1.4 Hz. FIGURE 3.1-1 A histogram of the undulation rate of gliding paradise tree snakes. n = 8 snakes. The sample mean The sample mean is the average of the measurements in the sample, the sum of all the observations divided by the number of observations. To show its calculation, we use the symbol Y to refer to the variable and Yi to represent the measurement of individual i. For the gliding snake data, i takes on values between 1 and 8, because there are eight snakes. Thus, Y1 = 0.9, Y2 = 1.4, Y3 = 1.2, Y4 = 1.2, and so on.2 The sample mean is the sum of all the observations in a sample divided by n, the number of observations. The sample mean, symbolized as Y¯ (and pronounced “Y-bar”), is calculated as Y¯=∑i=1nYin, where n is the number of observations. The symbol Σ (uppercase Greek letter sigma) indicates a sum. The “i = 1” under the Σ and the “n” over it indicate that we are summing over all values of i between 1 and n, inclusive: ∑i=1nYi=Y1+Y2+Y3+…+Yn. When it is clear that i refers to individuals 1, 2, 3, . . . , n, the formula is often written more succinctly as Y¯=∑Yin. Applying this formula to the snake data yields the mean undulation rate: Y¯=0.9+1.2+1.2+2.0+1.6+1.3+1.4+1.48=1.375 Hz. Based on the histogram in Figure 3.1-1, we see that the value of the sample mean is close to the middle of the distribution. Note that the sample mean has the same units as the observations used to calculate it. In Section 3.6, we review how the sample mean is affected when the units of the observations are changed, such as by adding a constant or multiplying by a constant. Variance and standard deviation The standard deviation is a commonly used measure of the spread of a distribution. It measures how far from the mean the observations typically are. The standard deviation is large if most observations are far from the mean, and it is small if most measurements lie close to the mean. The standard deviation is a common measure of the spread of a distribution. It indicates just how different measurements typically are from the mean. The standard deviation is calculated from the variance, another measure of spread. The standard deviation is simply the square root of the variance. The standard deviation is a more intuitive measure of the spread of a distribution (in part because it has the same units as the variable itself), but the variance has mathematical properties that make it useful sometimes as well. The standard deviation from a sample is usually represented by the symbol s, and the sample variance is written as s2. To calculate the variance from a sample of data, we must first compute the deviations. A deviation from the mean is the difference between a measurement and the mean (Yi−Y¯). Deviations for the measurements of snake undulation rate are listed in Table 3.1-1. TABLE 3.1-1 Quantities needed to calculate the standard deviation and variance of snake undulation rate (Y¯=1.375 Hz). Squared deviations Observations Deviations (Yi) (Yi−Y¯) (Yi−Y¯)2 0.9 1.2 –0.475 –0.175 0.225625 0.030625 1.2 1.3 1.4 1.4 1.6 2.0 –0.175 –0.075 0.025 0.025 0.225 0.625 0.030625 0.005625 0.000625 0.000625 0.050625 0.390625 Sum 0.000 0.735 The best measure of the spread of this distribution isn’t just the average of the deviations (Yi−Y¯), because this average is always zero (the negative deviations cancel the positive deviations). Instead, we need to average the squared deviations (the third column in Table 3.1-1) to find the variance: s2=∑i=1n(Yi-Y¯)2n-1. By squaring each number, deviations above and below the mean contribute equally3 to the variance. The summation in the numerator (top part) of the formula, ∑(Yi−Y¯)2, is called the sum of squares of Y. Note that the denominator (bottom part) is n − 1 instead of n, the total number of observations. Dividing by n − 1 gives a more accurate estimate of the population variance.4 We provide a shortcut formula for the variance in the Quick Formula Summary (Section 3.7). For the snake undulation data, the variance (rounded to hundredths) is s2=0.7357=0.11 Hz2. The variance has units equal to the square of the units of the original data. To obtain the standard deviation, we take the square root of the variance: s=∑(Yi-Y¯)2n-1. For the snake undulation data, s s=0.7357=0.324037 Hz. The standard deviation is never negative and has the same units as the observations from which it was calculated. The standard deviation has a straightforward connection to the frequency distribution. If the frequency distribution is bell shaped, like the example in Figure 2.2-4, then about two-thirds of the observations will lie within one standard deviation of the mean, and about 95% will lie within two standard deviations. In other words, about 67% of the data will fall between Y¯−s and Y¯+s, and about 95% will fall between Y¯−2s and Y¯+2s. For an in-depth discussion of standard deviation, see Chapter 10. This straightforward connection between the standard deviation and the frequency distribution diminishes when the frequency distribution deviates from the bell-shaped (normal) distribution. In such cases, the standard deviation is less informative about where the data lie in relation to the mean. This point is explored in greater detail in Section 3.3. Rounding means, standard deviations, and other quantities To avoid rounding errors when carrying out calculations of means, standard deviations, and other descriptive statistics, always retain as many significant digits as your calculator or computer can provide. Intermediate results written down on a page should also retain as many digits as feasible. Final results, however, should be rounded before being presented. There are no strict rules on the number of significant digits that should be retained when rounding. A common strategy, which we adopt here, is to round descriptive statistics to one decimal place more than the measurements themselves. For example, the undulation rates in snakes were measured to a single decimal place (tenths). We therefore present descriptive statistics with two decimals (hundredths). The mean rate of undulation for the eight snakes, calculated as 1.375 Hz, would be communicated as Y¯=1.38 Hz. Similarly, the standard deviation, calculated as 0.324037 Hz, would be reported as s=0.32 Hz. Note that even though we report the rounded value of the mean as Y¯=1.38, we used the more exact value, Y¯=1.375, in the calculation of s to avoid rounding errors. Coefficient of variation For many traits, standard deviation and mean change together when organisms of different sizes are compared. Elephants have greater mass than mice and also more variability in mass. For many purposes, we care more about the relative variation among individuals. A gain of 10 g for an elephant is inconsequential, but it would double the mass of a mouse. On the other hand, an elephant that is 10% larger than the elephant mean may have something in common with a mouse that is 10% larger than the mouse mean. For these reasons, it is sometimes useful to express the standard deviation relative to the mean. The coefficient of variation (CV) calculates the standard deviation as a percentage of the mean: CV=sY¯×100%. The coefficient of variation is the standard deviation expressed as a percentage of the mean. A higher CV means that there is more variability, whereas a lower CV means that individuals are more consistently the same. For the snake undulation data, the coefficient of variation is CV=0.3241.375100%=24%. The coefficient of variation makes sense only when all of the measurements are greater than or equal to zero. The coefficient of variation can also be used to compare the variability of traits that do not have the same units. If we wanted to ask, “What is more variable in elephants, body mass or life span?” then the standard deviation is not very informative, because mass is measured in kilograms and life span is measured in years. The coefficient of variation would allow us to make this comparison. Calculating mean and standard deviation from a frequency table Sometimes the data include many tied observations and are given in a frequency table. The frequency table in Table 3.1-2, for example, lists the number of criminal convictions of a cohort of 395 boys (Farrington 1994; see Assignment Problem 22 in Chapter 2). TABLE 3.1-2 Number of criminal convictions of a cohort of 395 boys. Number of convictions Frequency 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 265 49 21 19 10 10 2 2 4 2 1 4 3 1 2 Total 395 To calculate the mean and standard deviation of the number of convictions, notice first that the sample size is not 15, the number of rows in Table 3.1-2, but 395, the frequency total: n=265+49+21+19+…+2=395 Calculating the mean thus requires that the measurement of “0” be represented 265 times, the number “1” be represented 49 times, and so on. The sum of the measurements is thus ∑Yi=(265×0)+(49×1)+(21×2)+(19×3)+ …+(2×14)=445. The mean of these data is then Y¯=445395=1.126582, which we round to Y¯=1.1 when presenting the results. The calculation of standard deviation must also take into account the number of individuals with each value. The sum of the squared deviations is ∑(Yi−Y¯)2=265(0−Y¯)2+49(1−Y¯)2+21(2−Y¯)2+…+2(14−Y¯)2=2377.671. The standard deviation for these data is therefore s=2377.671395-1=2.4566, which we present as s = 2.5. These calculations assume that all the data are presented in the table. This approach would not work, however, for frequency tables in which the data are grouped into intervals, such as Table 2.2-3. Effect of changing measurement scale Results may need to be converted to a different scale than the one in which they were originally measured. For example, if temperature measurements were made in °F, it may be necessary to convert results to °C. The snake data were measured in hertz (cycles per second), but in some cases hertz must be converted to angular velocity (radians per second) instead. The good news is that we don’t need to start over by converting the raw data. Instead, we can convert the descriptive statistics directly, as follows. Briefly, here are the rules (we summarize them in the Quick Formula Summary at the end of this chapter). If converting data to a new scale, Y′, involves multiplying the data, Y, by a constant, c, Y′=cY, then multiply the original mean Y¯ by the same constant to obtain the new mean, and multiply the original standard deviation s by the absolute value of c to get the new standard deviation: Y¯′=cY¯s′=|c|s. However, the variance s2 is converted by multiplying by c2: s′2=c2s2. If converting data to a new scale, Y′, involves adding a constant, c, then the mean is converted by adding the same constant, Y¯′=Y¯+c, whereas the standard deviation and variance are unchanged: s′=ss′2=s2. This makes sense. Adding a constant to the data changes the location of the frequency distribution by the same amount but does not alter its spread. For example, converting degrees Fahrenheit to degrees Celsius uses the transformation °C=(5/9)°F−17.8. Therefore, if the mean temperature in a data set is Y¯=80°F, with a standard deviation of s = 3°F, then the new mean temperature is Y¯′=(5/9)80−17.8=26.6°C and the new standard deviation is s′=(5/9)3=1.7°C. The new variance is s′2=(5/9)232=2.8°C2. Median and interquartile range After the sample mean, the median is the next most common metric used to describe the location of a frequency distribution. As we showed in Chapter 2, the median is often displayed in a box plot alongside the span between the first and third quartiles, or interquartile range, another measure of the spread of the distribution. We define and demonstrate these concepts with the help of Example 3.2. EXAMPLE 3.2 I’d give my right arm for a female Male spiders in the genus Tidarren are tiny, weighing only about 1% as much as females. They also have disproportionately large pedipalps, copulatory organs that make up about 10% of a male’s mass. (See the adjacent photo; the pedipalps are indicated by arrows.) Males load the pedipalps with sperm and then search for females to inseminate. Astonishingly, male Tidarren spiders voluntarily amputate one of their two organs, right or left, just before sexual maturity. Why do they do this? Perhaps speed is important to males searching for females, and amputation increases running performance. To test this hypothesis, Ramos et al. (2004) used video to measure the running speed of males on strands of spider silk. The data are presented in Table 3.2-1. TABLE 3.2-1 Running speed (cm/s) of male Tidarren spiders before and after voluntary amputation of a pedipalp. Spider Speed before Speed after 1 2 3 4 5 6 7 8 1.25 2.94 2.38 3.09 3.41 3.00 2.31 2.93 2.40 3.50 4.49 3.17 5.26 3.22 2.32 3.31 9 2.98 3.70 10 11 12 13 14 15 16 3.55 2.84 1.64 3.22 2.87 2.37 1.91 4.70 4.94 5.06 3.22 3.52 5.45 3.40 The median The median is the middle observation in a set of data, the measurement that partitions the ordered measurements into two halves. To calculate the median, first sort the sample observations from smallest to largest. The sorted measurements of running speed of male spiders before amputation (Table 3.2-1) are 1.25, 1.64, 1.91, 2.31, 2.37, 2.38, 2.84, 2.87, 2.93, 2.94, 2.98, 3.00, 3.09, 3.22, 3.41, 3.55 in cm/s. Let Y(i) refer to the ith sorted observation, so Y(1) is 1.25, Y(2) is 1.64, Y(3) is 1.91, and so on. If the number of observations (n) is odd, then the median is the middle observation: Median=Y([n+1]/2). The median is the middle measurement of a set of observations. If the number of observations is even, as in the spider data, then the median is the average of the middle pair: Median=[Y(n/2)+Y(n/2+1)]/2. Thus, n/2 = 8, Y(8) = 2.87, and Y(9) = 2.93 for the spider data (before amputation). The median is the average of these two numbers: Median=(2.87+2.93)/2=2.90 cm/s. The interquartile range Quartiles are values that partition the data into quarters. The first quartile is the middle value of the measurements lying below the median. The second quartile is the median. The third quartile is the middle value of the measurements larger than the median. The interquartile range (IQR) is the span of the middle half of the data, from the first quartile to the third quartile: Interquartile range=third quartile−first quartile. The interquartile range is the difference between the third and first quartiles of the data. It is the span of the middle 50% of the data. Figure 3.2-1 shows the meaning of the median, first quartile, third quartile, and interquartile range for the spider data set (before amputation). FIGURE 3.2-1 The first quartile, median, and third quartile break the data set into four equal portions. The median is the middle value, and the first and third quartiles are the middles of the first and second halves of the data. The interquartile range is the span of the middle half of the data. The first step in calculating the interquartile range is to compute the first and third quartiles, as follows.5 For the first quartile, calculate j=0.25n, where n is the number of observations. If j is an integer then the first quartile is the average of Y(j) and Y(j+1): First quartile=(Y(j)+Y(j+1))/2, where Y(j) is the jth sorted observation. For the sorted spider data, j=(0.25)(16)=4, which is an integer. Therefore, the first quartile is the average of Y(4) and Y(5): First quartile=(2.31+2.37)/2=2.34. If j is not an integer, then convert j to an integer by replacing it with the next integer that exceeds it (i.e., round j up to the nearest integer). The first quartile is then First quartile=Y(j), where j is now the integer you rounded to. The third quartile is computed similarly. Calculate k=0.75n. If k is an integer, then the third quartile is the average of Y(k) and Y(k+1): Third quartile=(Y(k)+Y(k+1))/2, where Y(k) is the kth sorted observation. For the sorted spider data, j=(0.75)(16)=12, which is an integer. Therefore, the third quartile is the average of Y(12) and Y(13): Third quartile=(3.00+3.09)/2=3.045. If k is not an integer, then convert k to an integer by replacing it with the next integer that exceeds it (i.e., round k up to the nearest integer). The third quartile is then Third quartile=Y(k), where k is the integer you rounded to. The interquartile range is then Interquartile range=3.045−2.34−0.705 cm/s. The box plot A box plot displays the median and interquartile range, along with other quantities of the frequency distribution. We introduced the box plot in Chapter 2. Figure 3.2-2 shows a box plot for the spider running speeds, with data before and after amputation plotted separately. The lower and upper edges of the box are the first and third quartiles. Thus, the interquartile range is visualized by the span of the box. The horizontal line dividing each box is the median. The whiskers extend outward from the box at each end, stopping at the smallest and largest “nonextreme” values in the data. “Extreme” values are defined as those lying farther from the box edge than 1.5 times the interquartile range. Extreme values are plotted as isolated dots past the ends of the whiskers.6 There is one extreme value in the box plots shown in Figure 3.2-2, the smallest measurement for running speed before amputation. FIGURE 3.2-2 Box plot of the running speeds of 16 male spiders before and after self-amputation of a pedipalp. How measures of location and spread compare Which measure of location, the sample mean or the median, is most revealing about the center of a distribution of measurements? And which measure of spread, the standard deviation or the interquartile range, best describes how widely the observations are scattered about the center? The answer depends on the shape of the frequency distribution. These alternative measures of location and of spread yield similar information when the frequency distribution is symmetric and unimodal. The mean and standard deviation become less informative than the median and interquartile range when the data are strongly skewed or include extreme observations. We compare these measures using Example 3.3. EXAMPLE 3.3 Disarming fish The marine threespine stickleback is a small coastal fish named for its defensive armor. It has three sharp spines down its back, two pelvic spines under the belly, and a series of lateral bony plates down both sides. The armor seems to reduce mortality from predatory fish and diving birds. In contrast, in lakes and streams, where predators are fewer, stickleback populations have reduced armor. (See the photo at the right for examples of different types. Bony tissue has been stained red to make it more visible.) Colosimo et al. (2004) measured the grandchildren of a cross made between a marine and a freshwater stickleback. The study found that much of the difference in number of plates is caused by a single gene, Ectodysplasin.7 Fish inheriting two copies of the gene from the marine grandparent, called MM fish, had many plates (the top histogram in Figure 3.3-1). Fish inheriting both copies of the gene from the freshwater grandparent (mm) had few plates (the bottom histogram in Figure 3.3-1). Fish having one copy from each grandparent (Mm) had any of a wide range of plate numbers (the middle histogram in Figure 3.3-1). FIGURE 3.3-1 Frequency distributions of lateral plate number in three genotypes of stickleback, MM, Mm, and mm, descended from a cross between marine and freshwater grandparents. Plates are counted as the total number down the left and right sides of the fish. The total number of fish: 82 (MM), 174 (Mm), and 88 (mm). Mean versus median The mean and median of the three distributions in Figure 3.3-1 are compared in Table 3.3-1. The two measures of location give similar values in the case of the MM and mm genotypes, whose distributions are fairly symmetric, although one or two outliers are present. The mean is smaller than the median in the case of the Mm fish, whose distribution is strongly asymmetric. TABLE 3.3-1 Descriptive statistics for the number of lateral plates of the three genotypes of threespine sticklebacks8 discussed in Example 3.3. Genotype n Mean Median Standard deviation Interquartile range MM Mm mm 82 174 88 62.8 50.4 11.7 63 59 11 3.4 15.1 3.6 2 21 3 Why are the median and mean different from one another when the distribution is asymmetric? The answer, shown in Figure 3.3-2, is that the median is the middle measurement of a distribution, whereas the mean is the “center of gravity.” FIGURE 3.3-2 Comparison between the median and the mean using the frequency distribution for the Mm genotype (middle panel of Figure 3.3-1). The median is the middle measurement of the distribution (different colors represent the two halves of the distribution). The mean is the center of gravity, the point at which the frequency distribution would be balanced (if observations had weight). The balancing act illustrated in Figure 3.3-2 suggests that the mean is sensitive to extreme observations. To demonstrate, imagine taking the four smallest observations of the MM genotype (top panel in Figure 3.3-1) and moving them far to the left. The median would be completely unaffected, but the mean would shift leftward to a point near the edge of the range of most observations (Figure 3.3-3). FIGURE 3.3-3 Sensitivity of the mean to extreme observations using the frequency distribution of the MM genotypes (see the upper panel in Figure 3.3-1). The two different colors represent the two halves of the distribution. When the four smallest observations of the MM genotype are shifted far to the left (lower panel), the mean is displaced downward, to the edge of the range of the bulk of the observations. The median, on the other hand, which is located where the two colors meet, is unaffected by the shift. Median and mean measure different aspects of the location of a distribution. The median is the middle value of the data, whereas the mean is its center of gravity. Thus, the mean is displaced from the location of the “typical” measurement when the frequency distribution is strongly skewed, particularly when there are extreme observations. The mean is still useful as a description of the data as a whole, but it no longer indicates where most of the observations are located. The median is less sensitive to extreme observations, and hence the median is the more informative descriptor of the typical observation in such instances. However, the mean has better mathematical properties, and it is easier to calculate measures of the reliability of estimates of the mean. Standard deviation versus interquartile range Because it is calculated from the square of the deviations, the standard deviation is even more sensitive to extreme observations than is the mean. When the four smallest observations of the MM genotype are shifted far to the left, such that the smallest is set to zero (Figure 3.3-3), the standard deviation jumps from 3.4 to 12.0, whereas the interquartile range is not affected. For this reason, the interquartile range is a better indicator of the spread of the main part of a distribution than the standard deviation when the data are strongly skewed to one side or the other, especially when there are extreme observations. On the other hand, the standard deviation reflects the variation among all of the data points. Cumulative frequency distribution The median and quartiles are examples of percentiles, or quantiles, of the frequency distribution for a numerical variable. Plotting all the quantiles using the cumulative frequency distribution is another way to compare the shapes and positions of two or more frequency distributions. Percentiles and quantiles The Xth percentile of a sample is the value below which X percent of the individuals lie. For example, the median, the measurement that splits a frequency distribution into equal halves, is the 50th percentile. Ten percent of the observations lie below the 10th percentile, and the other 90% of observations exceed it. The first and third quartiles are the 25th and 75th percentiles, respectively. The percentile of a measurement specifies the percentage of observations less than or equal to it; the remaining observations exceed it. The quantile of a measurement specifies the fraction of observations less than or equal to it. The same information in a percentile is sometimes represented as a quantile. This only means that the proportion less than or equal to the given value is represented as a decimal rather than as a percentage. For example, the 10th percentile is the 0.10 quantile, and the median is the 0.50 quantile. Be careful not to mix up the words quantile and quartile (note the difference in the fourth letters). The first and third quartiles are the 0.25 and 0.75 quantiles. Displaying cumulative relative frequencies All the quantiles of a numerical variable can be displayed by graphing the cumulative frequency distribution. Cumulative relative frequency at a given measurement is the fraction of observations less than or equal to that measurement. Figure 3.4-1 shows the cumulative frequency distribution of the running speeds of male spiders before amputation. The raw data are from Table 3.2-1. To make this graph, all the measurements of running speed (before amputation) were sorted from smallest to largest. Next, the fraction of observations less than or equal to each data value was calculated. This fraction, which is called the cumulative relative frequency, is indicated by the height of the curve in Figure 3.4-1 at the corresponding data value. Finally, these points were connected with straight lines to form an ascending curve. The result is an irregular sequence of “steps” from the smallest data value to the largest data value. Each step is flat, but the curve jumps up by 1/n at every observed measurement, where n is the total number of observations (here, 16 spiders), to a maximum of 1. There may be multiple jumps at one measurement if multiple data points have the same measurement. FIGURE 3.4-1 The cumulative frequency distribution of male spiders before amputation (solid curve). Horizontal dotted lines indicate the cumulative relative frequencies 0.25 (lower) and 0.75 (upper); vertical lines indicate corresponding 0.25 and 0.75 quantiles of running speed (2.34 and 3.045). The data are from Table 3.2-1. n = 16 spiders. The curve in Figure 3.4-1 shows a lot of information because all the data points are represented. We can see that one-fourth of the observations (corresponding to a cumulative relative frequency of 0.25) had running speeds below 2.34, which is the value of the first quartile calculated earlier. Three-fourths of all observations lie below 3.045, which is the value of the third quartile calculated earlier. Both these values are indicated in Figure 3.4-1 with the dashed lines. Because of their simplicity and ease of interpretation, the histogram and box plot are usually superior to the cumulative frequency distribution for showing the data. However, with practice, the cumulative frequency distribution can be very useful, especially to compare frequency distributions of multiple groups. Proportions The proportion is the most important descriptive statistic for a categorical variable. Calculating a proportion The proportion of observations in a given category, symbolized pˆ is calculated as pˆ=Number in categoryn, where the numerator is the number of observations in the category of interest, and n is the total number of observations in all categories combined.9 For example, of the 344 individual sticklebacks in Example 3.3, 82 had genotype MM, 88 were mm, and 174 were Mm (Table 3.3-1). The proportion of MM fish is pˆ=82344=0.238. The other proportions are calculated similarly, and all three proportions are listed in Table 3.5-1. TABLE 3.5-1 The number of fish of each genotype from a cross between a marine stickleback and a freshwater stickleback (Example 3.3). As written, the sum of the proportions does not add precisely to one because of rounding. Genotype Frequency Proportion MM Mm mm 82 174 88 0.24 0.51 0.26 Total 344 1.00 The proportion is like a sample mean The proportion pˆ has properties in common with the arithmetic mean. To see this, let’s create a new numerical variable Y for the stickleback study. Give individual fish i the value Yi = 1 if it has the MM genotype, and give it the value Yi = 0 otherwise. The sum of all the ones and zeroes, ∑Yi, is the frequency of fish having genotype MM. The mean of the ones and zeroes is Y¯=∑Yin=82344=0.238, which is just pˆ, the proportion of observations in the first category. If we imagine the Ymeasurements to have weight, then the proportion is their center of gravity (Figure 3.5-1). FIGURE 3.5-1 The distribution of Y, where Y = 1 if a stickleback is genotype MM and 0 otherwise. The mean of Y is the proportion of MM individuals in the sample (0.238). Summary ■ The location of a distribution for a numerical variable can be measured by its mean or by its median. The mean gives the center of gravity of the distribution and is calculated as the sum of all measurements divided by the number of measurements. The median gives the middle value. ■ The standard deviation measures the spread of a distribution for a numerical variable. It is a measure of the typical distance between observations and the mean. The variance is the square of the standard deviation. ■ The quartiles break the ordered observations into four equal parts. The inter-quartile range, the difference between the first and third quartiles, is another measure of the spread of a frequency distribution. ■ The mean and median yield similar information when the frequency distribution of the measurements is symmetric and unimodal. The mean and standard deviation become less informative about the location and spread of typical observations than the median and interquartile range when the data include extreme observations. ■ The percentile of a measurement specifies the percentage of observations less than or equal to it. The quantile of a measurement specifies the fraction of observations less than or equal to it. ■ All the quantiles of a sample of data can be shown using a graph of the cumulative frequency distribution. ■ The proportion is the most important descriptive statistic for a categorical variable. It is calculated by dividing the number of observations in the category of interest by n, the total number of observations in all categories combined. Quick Formula Summary Table of formulas for descriptive statistics Quantity Sample size Formula n Mean Y¯=∑Yn Variance s2=∑(Yi-Y¯)2n-1 shortcut s2=∑(Yi2)-nY¯2n-1 formula: Standard deviation s=∑(Yi-Y¯)2n-1 shortcut formula: s=∑(Yi2)-nY¯2n-1 Sum of squares ∑(Yi-Y¯)2=∑(Yi2)-nY¯2 Coefficient of CV=sY¯×100% variation Y([n+1]/2) (if n is odd)[Y(n/2)+Y(n/2+1)]/2 (if n is odd)where Y(1),Y(2), Median …,Y(n) are the ordered observations Proportion pˆ=Number in categoryn Effect of arithmetic operations on descriptive statistics The table below lists the effect on the descriptive statistics of adding or multiplying all the measurements by a constant. The rules listed in the table are useful when converting measurements from one system of units to another, such as English to metric or degrees Fahrenheit to degrees Celsius. Statistic Mean Standard deviation Variance Median Interquartile range Value Y¯ s s2 M IQR Adding a constant c to Multiplying all the all the measurements, measurements by a Y′=Y+c constant c, Y′=cY Y¯′=Y¯+c s' = s s'2 = s2 M' = M + c IQR' = IQR Y¯′=cY¯ s' = |c|s s'2 = c2s2 M' = cM IQR' = |c|IQR PRACTICE PROBLEMS 1. Calculation practice: Basic descriptive stats. Systolic blood pressure was measured (in units of mm Hg) during preventative health examinations on people in Dallas, Texas. Here are the measurements for a subset of these patients. 112, 128, 108, 129, 125, 153, 155, 132, 137 a. b. c. d. e. f. g. How many individuals are in the sample (i.e., what is the sample size, n)? What is the sum of all of the observations? What is the mean of this sample? Here and forever after, provide units with your answer. What is the sum of the squares of the measurements? What is the variance of this sample? What is the standard deviation of this sample? What is the coefficient of variation for this sample? 2. Calculation practice: Box plots. Here is another sample of systolic blood pressure (in units of mm Hg), this time with all 101 data points. The mean is 122.73 and the standard deviation is 13.83. 88, 88, 92, 96, 96, 100, 102, 102, 104, 104, 105, 105, 105, 107, 107, 108, 110, 110, 110, 111, 111, 112, 113, 114, 114, 115, 115, 116, 116, 117, 117, 117, 117, 117, 117, 119, 119, 120, 121, 121, 121, 121, 121, 121, 122, 122, 123, 123, 123, 123, 123, 124, 124, 124, 124, 125, 125, 125, 126, 126, 126, 126, 126, 127, 127, 128, 128, 128, 128, 129, 129, 130, 131, 131, 131, 131, 131, 131, 133, 133, 133, 134, 135, 136, 136, 136, 138, 138, 139, 139, 141, 142, 142, 142, 143, 144, 146, 146, 147, 155, 156 What is the median of this sample? What is the upper (third) quartile (or 75th percentile)? What is the lower (first) quartile (or 25th percentile)? What is the interquartile range (IQR)? Calculate the upper quartile plus 1.5 times the IQR. Is this greater than the largest value in the data set? f. Calculate the lower quartile minus 1.5 times the IQR. Is this less than the smallest value in the data set? g. Plot the data in a box plot. (A rough sketch by hand is appropriate, as long as the correct values are shown for each critical point.) a. b. c. d. e. 3. A review of the performance of hospital gynecologists in two regions of England measured the outcomes of patient admissions under each doctor’s care (Harley et al. 2005). One measurement taken was the percentage of patient admissions made up of women under 25 years old who were sterilized. We are interested in describing what constitutes a typical rate of sterilization, so that the behavior of atypical doctors can be better scrutinized. The frequency distribution of this measurement for all doctors is plotted in the following graph. a. Explain what the vertical axis measures. b. What would be the best choice to describe the location of this frequency distribution, the mean or the median, if our goal was to describe the typical individual? Why? c. Do you see any evidence that might lead to further investigation of any of the doctors? 4. The data displayed in the plot below are from a nearly complete record of body masses of the world’s native mammals (in grams, then converted to log base 10; Smith et al. 2003). The data were divided into three groups: those surviving from the last ice age to the present day (n = 4061), those who went extinct around the end of the last ice age (n = 227), and those driven extinct within the last 300 years (recent; n = 44). a. What type of graph is this? b. What does the horizontal line in the center of each rectangle represent? c. What are the horizontal lines at the top and bottom edges of each rectangle supposed to represent? d. What are the data points (indicated by “—”) lying outside the rectangle? e. What are the vertical lines extending above and below each rectangle? f. Compare the locations of the three body-size distributions. How do they differ? g. Compare the shapes of the three frequency distributions. Which are relatively symmetric and which are asymmetric? Explain your reasoning. h. Which group’s frequency distribution has the lowest spread? Explain your reasoning. i. What has been the likely effect of ice-age and recent extinctions on the median body size of mammals? 5. Mehl et al. (2007) wired 396 volunteers with electronically activated recorders that allowed the researchers to calculate the number of words each individual spoke, on average, per 17hour waking day. They found that the mean number of words spoken was only slightly higher for the 210 women (16,215 words) than for the 186 men (15,669) in the sample. The frequency distribution of the number of words spoken by all individuals of each sex is shown in the accompanying graphs (modified from Mehl et al. 2007). a. b. c. d. e. What type of graph is shown? What are the explanatory and response variables in the figure? What is the mode of the frequency distribution of each sex? Which sex likely has the higher median number of words spoken per day? Which sex had the highest variance in number of words spoken per day? 6. The following data are measurements of body mass, in grams, of finches captured in mist nets during a survey of various habitats in Kenya, East Africa (Schluter 1988). Crimson-rumped waxbill Cutthroat finch White-browed sparrow weaver 8, 8, 8, 8, 8, 8, 8, 6, 7, 7, 7, 8, 8, 8, 7, 7, 7 16, 16, 16, 12, 16, 15, 15, 17, 15, 16, 15, 16 40, 43, 37, 38, 43, 33, 35, 37, 36, 42, 36, 36, 39, 37, 34, 41 a. Calculate the mean body mass of each of these three finch species. Which species is largest, and which is smallest? b. Which species has the greatest standard deviation in body mass? Which has the least? c. Calculate the coefficient of variation (CV) in mass for each finch species. How different are the coefficients between the species? Compare the difference in CVs with the differences in standard deviation calculated in part (b). A photo of a finch perched on a tree is shown. d. The following measurements are of another trait, beak length, in mm, of the 16 whitebrowed sparrow weavers. Which measurement is more variable in this species (relative to the mean), body mass or beak length? 10.6, 10.8, 10.9, 11.3, 10.9, 10.1, 10.7, 10.7, 10.9, 11.4, 10.8, 11.2, 10.7, 10.0, 10.1, 10.7 7. The spider data in Example 3.2 consist of pairs of measurements made on the same subjects. One measurement is running speed before amputation and the second is running speed after amputation. Calculate a new variable called “change in speed,” defined as the speed of each spider after amputation minus its speed before amputation. a. What are the units of the new variable? b. Draw a box plot for the change in running speed. Use the method outlined in Section 3.2 to calculate the quartiles. c. Based on your drawing in part (b), is the frequency distribution of the change in running speed symmetric or asymmetric? Explain how you decided this. d. What is the quantity measured by the span of the box in part (b)? e. Calculate the mean change in running speed. Is it the same as the median? Why or why not? f. Calculate the variance of the change in running speed. g. What fraction of observations fall within one standard deviation above and below the mean? 8. Refer to the previous problem. If you were to convert all of the observations of change in running speed from cm/s into mm/s, how would this change a. the mean? b. the standard deviation? c. the median? d. the interquartile range? e. the coefficient of variation? f. the variance? 9. Niderkorn’s (1872; from Pounder 1995) measurements on 114 human corpses provided the first quantitative study on the development of rigor mortis.10 The data in the following table give the number of bodies achieving rigor mortis in each hour after death, recorded in onehour intervals. Hours Number of bodies 1 2 3 4 5 6 7 8 9 10 11 12 13 0 2 14 31 14 20 11 7 4 7 1 1 2 Total 114 a. Calculate the mean number of hours after death that it took for rigor mortis to set in. b. Calculate the standard deviation in the number of hours until rigor mortis. c. What fraction of observations lie within one standard deviation of the mean (i.e., between the value Y¯−s and the value Y¯+s )? d. Calculate the median number of hours until rigor mortis sets in. What is the likely explanation for the difference between the median and the mean? 10. The following graph shows the population growth rates of the 204 countries recognized by the United Nations. Growth rate is measured as the average annual percent change in the total human population between 2000 and 2004 (United Nations Statistics Division 2004). a. b. c. d. e. Identify the type of graph depicted. Explain the quantity along the y-axis. Approximately what percentage of countries had a negative change in population? Identify by eye the 0.10, 0.50, and 0.90 quantiles of change in population size. Identify by eye the 60th percentile of change in population size. 11. Refer to the previous problem. a. Draw a box plot using the information provided in the graph in that problem. b. Label three features of this box plot. 12. Spot the flaw. The accompanying table shows means and standard deviations for the length of migration on a microgel of 20 lymphocyte cells exposed to X-irradiation. The length of migration is an indication of DNA damage suffered by the cells. The data are from Singh et al. (1988). a. Identify the main flaw in the construction of this table. b. Redraw the table following the principles recommended in this chapter and Chapter 2. TABLE FOR PROBLEM 12 X-ray dose Mean Standard deviation Control 25 rads 50 rads 100 rads 200 rads 3.70 1.10 5.27 1.19 12.37 4.69 23.30 3.27 29.80 2.99 13. The following graph illustrates an association between two variables. It shows percent changes in the range sizes of different species of native butterflies (red), birds (blue), and plants (black) of Britain over the past two to four decades (modified from Thomas et al. 2004). Identify (a) the type of graph, (b) the explanatory and response variables, and (c) the type of data (whether numerical or categorical) for each variable. ASSIGNMENT PROBLEMS 14. The gene for the vasopressin receptor V1a is expressed at higher levels in the forebrain of monogamous vole species than in promiscuous vole species.11 Can expression of this gene influence monogamy? To test this, Lim et al. (2004) experimentally enhanced V1a expression in the forebrain of 11 males of the meadow vole, a solitary promiscuous species. The percentage of time each male spent huddling with the female provided to him (an index of monogamy) was recorded. The same measurements were taken in 20 control males left untreated. Control males: 98, 96, 94, 88, 86, 82, 77, 74, 70, 60, 59, 52, 50, 47, 40, 35, 29, 13, 6, 5 V1a-enhanced males: 100, 97, 96, 97, 93, 89, 88, 84, 77, 67, 61 a. Display these data in a graph. Explain your choice of graph. b. Which group has the higher mean percentage of time spent huddling with females? c. Which group has the higher standard deviation in percentage of time spent huddling with females? 15. The data in the accompanying table are from an ecological study of the entire rainforest community at El Verde in Puerto Rico (Waide and Reagan 1996). Diet breadth is the number of types of food eaten by an animal species. The number of animal species having each diet breadth is shown in the second column. The total number of species listed is n = 127. Diet breadth (number of prey types eaten) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Frequency (number of species) 21 8 9 10 8 3 4 8 4 4 4 2 5 2 1 1 2 1 3 2 >20 Total 25 127 a. Calculate the median number of prey types consumed by animal species in the community. b. What is the interquartile range in the number of prey types? Use the method outlined in Section 3.2 to calculate the quartiles. c. Can you calculate the mean number of prey types in the diet? Explain. 16. Francis Galton (1894) presented the following data on the flight speeds of 3207 “old” homing pigeons traveling at least 90 miles. a. What type of graph is this? b. Examine the graph and visually determine the approximate value of the mean (to the nearest 100 yards per minute). Explain how you obtained your estimate. c. Examine the graph and visually determine the approximate value of the median (to the nearest 100 yards per minute). Explain how you obtained your estimate. d. Examine the graph and visually determine the approximate value of the mode (to the nearest 100 yards per minute). Explain how you obtained your estimate. e. Examine the graph and visually determine the approximate value of the standard deviation (to the nearest 50 yards per minute). Explain how you obtained your estimate. 17. A study of the endangered saiga antelope (pictured at the beginning of the chapter) recorded the fraction of females in the population that were fecund in each year between 1993 and 2001 (Milner-Gulland et al. 2003). A graph of the data is as follows: a. Assume that you want to describe the “typical” fraction of females that are fecund in a typical year, based on these data. What would be the better choice to describe this typical fraction, the mean or the median of the measurements? Why? b. With the same goal in mind, what would be the better choice to describe the spread of measurements around their center, the standard deviation or the interquartile range? Why? 18. Accurate prediction of the timing of death in patients with a terminal illness is important for their care. The following graph compares the survival times of terminally ill cancer patients with the clinical prediction of their survival times (modified from Glare et al. 2003). a. Describe in words what features most of the frequency distributions of actual survival times have in common, based on the box plots for each group. b. Describe the differences in shape of actual survival time distributions between those for one to five months predicted survival times and those for six to 24 months. c. Describe the trend in median actual survival time with increasing predicted number. d. The predicted survival times of terminally ill cancer patients tend to overestimate the medians of actual survival times. Are the means of actual survival times likely to be closer to, further from, or no different from the predicted times than the medians? Explain. 19. Measurements of lifetime reproductive success (LRS) of individual wild animals reveal the disparate contributions they make to the next generation. Jensen et al. (2004) estimated LRS of male and female house sparrows in an island population in Norway. They measured LRS of an individual as the total number of “recruits” produced in its lifetime, where a recruit is an offspring that survives to breed one year after birth. Parentage of recruits was determined from blood samples using DNA techniques. Their results are tabulated as follows: Lifetime reproductive success Frequency Females Males 0 1 2 3 4 5 6 7 8 >8 30 25 3 6 8 4 0 4 1 0 38 17 7 6 4 10 2 0 0 0 Total 81 84 a. Which sex has the higher mean lifetime reproductive success? b. Every recruit must have both a father and a mother, so it is not easy to see why male and female LRS should differ. Can you think of a biological explanation? c. Which sex has the higher variance in reproductive success? 20. If all the measurements in a sample of data are equal, what is the variance of the measurements in the sample? 21. Researchers have created every possible “knockout” line in yeast. Each line has exactly one gene deleted and all the other genes present (Steinmetz et al. 2002). The growth rate—how fast the number of cells increases per hour—of each of these yeast lines has also been measured, expressed as a multiple of the growth rate of the wild type that has all the genes present. In other words, a growth rate greater than 1 means that a given knockout line grows faster than the wild type, whereas a growth rate less than 1 means it grows more slowly. Below is the growth rate of a random sample of knockout lines: 0.86, 1.02, 1.02, 1.01, 1.02, 1, 0.99, 1.01, 0.91, 0.83, 1.01 a. b. c. d. What is the mean growth rate of this sample of yeast lines? What is the median growth rate of this sample? What is the variance of growth rate of the sample? What is the standard deviation of growth rate of the sample? 22. As in other vertebrates, individual zebrafish differ from one another along the shy–bold behavioral spectrum. In addition to other differences, bolder individuals tend to be more aggressive, whereas shy individuals tend to be less aggressive. Norton et al. (2011) compared several behaviors associated with this syndrome between zebrafish that had the spiegeldanio (spd) mutant at the Fgfr1a gene (reduced fibro-blast growth factor receptor 1a) and the “wild type” lacking the mutation. The data below are measurements of the amount of time, in seconds, that individual zebrafish with and without this mutation spent in aggressive activity over 5 minutes when presented with a mirror image. Wild type: 0, 21, 22, 28, 60, 80, 99, 101, 106, 129, 168 Spd mutant: 96, 97, 100, 127, 128, 156, 162, 170, 190, 195 a. Draw a boxplot to compare the frequency distributions of aggression score in the two groups of zebrafish. According to the box plot, which genotype has the higher aggression scores? b. According to the box plot, which sample spans the higher range of values for aggression scores? c. Which sample has the larger interquartile range? d. What are the vertical lines projecting outward above and below each box? 23. A eunuch is a castrated human male. Eunuchs were often used as servants and guards in harems in Asia and the Middle East. In males of some mammal species, castration increases life span. Do male eunuchs also have long lives compared to other men? The accompanying graph shows data on life spans of 81 male eunuchs from the Korean Chosun Dynasty between about 1400 and 1900, according to historical records. These data are compared with life spans of non-eunuch males who lived at the same time, and who belonged to families of similar social status (n = 1126, 1414, and 49 for the three families shown). Modified from Min et al. (2012), with permission. a. b. c. d. What type of graph is this? What do the upper and lower margins of the boxes indicate? Which male group had the highest median longevity? Although the mean is not indicated on the graph, which sample of men probably had the highest mean longevity? Explain your reasoning. 24. As the Arctic warms and winters become shorter, hibernation patterns of arctic mammals are expected to change. Sheriff et al. (2011) investigated emergence dates from hibernation of arctic ground squirrels at sites in the Brooks Range of northern Alaska. The measurements shown in the following figure are emergence dates in a sample of male and female ground squirrels at one of their study sites. a. What type of graph is this? b. Which sex, males or females, has the earliest median emergence date? Explain how you obtained your answer. c. Which sex, male or female, has the greater interquartile range in emergence date? Explain how you obtained your answer. 25. Convert the following statistics, calculated on samples in English units, to the metric equivalents. (Conversion factors are given as well.) a. Mean: 100 miles (1 km = 0.62 miles) b. Standard deviation: 17 miles (1 km = 0.62 miles) c. d. e. f. Variance: 289 miles2 (1 km = 0.62 miles) Coefficient of variation: 17% (1 km = 0.62 miles) Mean: 23 pounds (1 kg = 2.2 pounds) Standard deviation: 1.2 ounces (1 g = 0.032 ounces) g. Variance: 550 gallons2 (1 liter = 0.227 gallons) 26. The snake undulation data of Example 3.1 were measured in Hz, which has units of 1/s (cycles per second). Often frequency measurements are expressed instead as angular velocity, which is measured in radians per second. To convert measurements from Hz to angular velocity (rad/s), multiply by 2p, where p = 3.14159. a. The sample mean undulation rate in the snake sample was 1.375 Hz. Calculate the sample mean in units of angular velocity. b. The sample variance of undulation rate in the snake sample was 0.105 Hz2. Calculate the sample variance if the data were in units of angular velocity. c. The sample standard deviation of undulation rate in the snake sample was 0.324 Hz. Calculate the sample standard deviation in units of angular velocity. Provide the appropriate units with your answer. 27. Spot the flaw. Crohn’s disease is an autoimmune inflammatory disorder. The accompanying table shows medians and interquartile ranges for three response variables in 62 Crohn’s disease patients randomly assigned either the immuno-suppressant drug azathioprine (n = 32) or a placebo (n = 30) in a clinical trial. Response variables are measured as change from baseline. IQR is interquartile range. The data are from Candy et al. 1995. a. Identify the main flaw in the construction of this table. b. Redraw the table following the principles recommended in this book. TABLE FOR PROBLEM 27 Azathioprine Response Variable Crohn’s Disease Activity Index Erythrocyte sedimentation rate (mm/hr) Serum C reactive protein (%) Placebo Median IQR Median IQR 191.5 15.5 30.0 211 30 53 50.0 –6.5 0.0 230 26 27 28. Reproduction in sea urchins involves the release of sperm and eggs in the open ocean. Fertilization begins when a sperm bumps into an egg and the sperm protein bindin attaches to recognition sites on the egg surface. Gene sequences of bindin and egg-surface proteins vary greatly between closely related urchin species, and eggs can identify and discriminate between different sperm. In the burrowing sea urchin, Echinometra mathaei, the protein sequence for bindin varies even between populations within the same species. Do these differences affect fertilization? To test this, Palumbi (1999) carried out trials in which a mixture of sperm from AA and BB males, referring to two populations differing in bindin gene sequence, were added to dishes containing eggs from a female from either the AA or the BB population. The results below indicate the fraction of fertilizations of eggs of each of the two types by AA sperm (remaining eggs were fertilized by BB sperm). AA females: 0.58, 0.59, 0.69, 0.72, 0.78, 0.78, 0.81, 0.85, 0.85, 0.92, 0.93, 0.95 BB females: 0.15, 0.22, 0.30, 0.37, 0.38, 0.50, 0.95 a. Plot the data using a method other than the box plot. Is there an association in these data between female type and fertilizations by AA sperm? b. Inspect the plot. On this basis, which method from this chapter (mean or median) would be best to compare the locations of the frequency distributions for the two groups? Explain your reasoning. Calculate and compare locations using this method. c. Which method would be best to compare the spread of the frequency distributions for the two groups? Explain your reasoning. Calculate and compare spread using this method. 29. The following graph illustrates an association between two variables. The graph shows density of fine roots in Monterey pines (Pinus radiata) planted in three different years of study (redrawn from Moir and Bachelard 1969, with permission). Identify (a) the type of graph, (b) the explanatory and response variables, and (c) the type of data (whether numerical or categorical) for each variable. The sampling distribution of an estimate Estimation is the process of inferring a population parameter from sample data. The value of an estimate calculated from data is almost never exactly the same as the value of the population parameter being estimated, because sampling is influenced by chance. The crucial question is, “In the face of chance, how much can we trust an estimate?” In other words, what is its precision? To answer this question, we need to know something about how the sampling process might affect the estimates we get. We use the sampling distribution of the estimate, which is the probability distribution of all the values for an estimate that we might have obtained when we sampled the population. We illustrate the concept of a sampling distribution using samples from a known population, the genes of the human genome. EXAMPLE 4.1 The length of human genes The international Human Genome Project was the largest coordinated research effort in the history of biology. It yielded the DNA sequence of all 23 human chromosomes, each consisting of millions of nucleotides chained end to end.2 These encode the genes whose products—RNA and proteins—shape the growth and development of each individual. We obtained the lengths of all 20,290 known and predicted genes of the published genome sequence (Hubbard et al. 2005).3 The length of a gene refers to the total number of nucleotides comprising the coding regions. The frequency distribution of gene lengths in the population of genes is shown in Figure 4.1-1. The figure includes only genes up to 15,000 nucleotides long; in addition, there are 26 longer genes. 4 FIGURE 4.1-1 Distribution of gene lengths in the known human genome. The graph is truncated at 15,000 nucleotides; 26 larger genes are too rare to be visible in this histogram. The histogram in Figure 4.1-1 is like those we have seen before, except that it shows the distribution of lengths in the population of genes, not simply those in a sample of genes. Because it is the population distribution, the relative frequency of genes of a given length interval in Figure 4.1-1 represents the probability of obtaining a gene of that length when sampling a single gene at random. The probability distribution of gene lengths is positively skewed, having a long tail extending to the right. The population mean and standard deviation of gene length in the human genome are listed in Table 4.1-1. These quantities are referred to as parameters because they are quantities that describe the population. TABLE 4.1-1 Population mean and standard deviation of gene length in the known human genome. Name Parameter Value (nucleotides) Mean μ 2622.0 Standard deviation σ 2036.9 In real life, we would not usually know the parameter values of the study population, but in this case we do. We’ll take advantage of this to illustrate the process of sampling. Estimating mean gene length with a random sample To begin, we collected a single random sample of n = 100 genes from the known human genome.5 A histogram of the lengths of the resulting sample of genes is shown in Figure 4.1-2. FIGURE 4.1-2 Frequency distribution of gene lengths in a unique random sample of n = 100 genes from the human genome. The frequency distribution of the random sample (Figure 4.1-2) is not an exact replica of the population distribution (Figure 4.1-1), because of chance. The two distributions nevertheless share important features, including approximate location, spread, and shape. For example, the sample frequency distribution is skewed to the right like the true population distribution. The sample mean and standard deviation of gene length from the sample of 100 genes are listed in Table 4.1-2. How close are these estimates to the population mean and standard deviation listed in Table 4.1-1? The sample mean is Y¯=2411.8, which is about 200 nucleotides shorter than the true value, the population mean of μ = 2622.0. The sample standard deviation (s = 1463.5) is also different from the standard deviation of gene length in the population (σ = 2036.9). We shouldn’t be surprised that the sample estimates differ from the parameter (population) values. Such differences are virtually inevitable because of chance in the random sampling process. TABLE 4.1-2 Mean and standard deviation of gene length Y in our unique random sample of n = 100 genes from the human genome. Name Statistic Sample value (number of nucleotides) Mean Standard deviation Y¯ s 2411.8 1463.5 The sampling distribution of Y¯ We obtained Y¯=2411.8 nucleotides in our single sample, but by chance we might have obtained a different value. When we took a second random sample of 100 genes, we found Y¯=2643.5. Each new sample will usually generate a different estimate of the same parameter. If we were able to repeat this sampling an infinite number of times, we could create the probability distribution of our estimate. The probability distribution of values we might obtain for an estimate make up the estimate’s sampling distribution. The sampling distribution is the probability distribution of all values for an estimate that we might obtain when we sample a population. The sampling distribution represents the “population” of values for an estimate. It is not a real population, like the squirrels in Muir Woods or all the retirees basking in the Florida sunshine. Rather, the sampling distribution is an imaginary population of values for an estimate. Taking a random sample of n observations from a population and calculating Y¯ is equivalent to randomly sampling a single value of Y¯ from its sampling distribution. To visualize the sampling distribution for mean gene length, we used the computer to take a vast number of random samples of n = 100 genes from the human genome. We calculated the sample mean Y¯ each time. The resulting histogram in Figure 4.1-3 shows the values of Y¯ that might be obtained when randomly sampling 100 genes, together with their probabilities. FIGURE 4.1-3 The sampling distribution of mean gene length, Y, when n = 100. Note the change in scale from Figure 4.1-2. Figure 4.1-3 makes plain that, although the population mean μ is a constant (2622.0), its estimate Y¯ is a variable. Each new sample yields a different Y¯ value from the one before. We don’t ever see the sampling distribution of Y¯ because ordinarily we have only one sample, and therefore only one Y¯. Notice that the sampling distribution for Y¯ is centered exactly on the true mean, μ. This means that Y¯ is an unbiased estimate of μ.6 The spread of the sampling distribution of an estimate depends on the sample size. The sampling distribution of Y¯ based on n = 100 is narrower than that based on n = 20, and that based on n = 500 is narrower still (Figure 4.1-4). The larger the sample size, the narrower the sampling distribution. And the narrower the sampling distribution, the more precise the estimate. Thus, larger samples are desirable whenever possible because they yield more precise estimates. The same is true for the sampling distributions of estimates of other population quantities, not just Y¯. FIGURE 4.1-4 Comparison of the sampling distributions of mean gene length, Y¯, when n = 20, 100, and 500. Increasing sample size reduces the spread of the sampling distribution of an estimate, increasing precision. Measuring the uncertainty of an estimate In this section we show how the sampling distribution is used to measure the uncertainty of an estimate. Standard error The standard deviation of the sampling distribution of an estimate is called the standard error. Because it reflects the differences between an estimate and the target parameter, the standard error reflects the precision of an estimate. Estimates with smaller standard errors are more precise than those with larger standard errors. The smaller the standard error, the less uncertainty there is about the target parameter in the population. The standard error of an estimate is the standard deviation of the estimate’s sampling distribution. The standard error of Y¯ The standard error of the sample mean is particularly simple to calculate, so we show it here. We can represent the standard error of the mean with the symbol σY¯. It has a remarkably straightforward relationship with s, the population standard deviation of the variable Y: σY¯=σn. The standard error decreases with increasing sample size. Table 4.2-1 lists the standard error of the sample mean based on random samples of n = 20, 100, and 500 from the known human genome. TABLE 4.2-1 Standard error of the sampling distributions of mean gene length Y¯ according to sample size. These measure the spread of the three sampling distributions in Figure 4.1-4. Sample size, n Standard error, σY¯ (nucleotides) 20 100 500 455.5 203.7 91.1 The standard error of Y¯ from data The trouble with the formula for the standard error of the mean (σY¯) is that we almost never know the value of the population standard deviation (σ), and so we cannot calculate σY¯. The next best thing is to approximate the standard error of the mean by using the sample standard deviation (s) as an estimate of σ. To show that it is approximate, we will use the symbol SEY¯. The approximate standard error of the mean is SEY¯=sn. According to this simple relationship, all we need is one random sample to approximate the spread of the entire sampling distribution for Y¯. The quantity SEY¯ is usually called the “standard error of the mean.” The standard error of the mean is estimated from data as the sample standard deviation (s) divided by the square root of the sample size (n). Calculating SEY¯ is so routine in biology that a sample mean should never be reported without it. For example, if we were submitting the results of our unique random sample of 100 genes in Figure 4.1-2 for publication, we would calculate SEY¯ from the results in Table 4.1-2 as follows: SEY¯=sn=1463.5100=146.3. We would then report the sample mean in the text of the paper as 2411.8 ± 146.3(SE). Every estimate, not just the mean, has a sampling distribution with a standard error, including the proportion, median, correlation, difference between means, and so on. In the rest of this book, we will give formulas to calculate standard errors of many kinds of estimates. The standard error is the usual way to indicate uncertainty of an estimate. Confidence intervals The confidence interval is another common way to quantify uncertainty about the value of a parameter. It is a range of numbers, calculated from the data, that is likely to contain within its span the unknown value of the target parameter. In this section, we introduce the concept without showing exact calculations. Confidence intervals can be calculated for means, proportions, correlations, differences between means, and other population parameters, as later chapters will demonstrate. A confidence interval is a range of values surrounding the sample estimate that is likely to contain the population parameter. An example is the 95% confidence interval for the mean. This confidence interval is a range likely to contain the value of the true population mean μ. It is calculated from the data and extends above and below the sample estimate Y¯. You will encounter confidence intervals frequently in the biological literature. We’ll show you in Chapter 11 how to calculate an exact confidence interval for the mean, but for now we give you the result and its interpretation. The 95% confidence interval for the mean calculated from the unique sample of 100 genes (Table 4.1-2) is The 95% confidence interval provides a most-plausible range for a parameter. Values lying within the interval are most plausible, whereas those outside are less plausible, based on the data. 2121.4<μ<2702.2. For this example, 2121.4 is the lower limit of the confidence interval, whereas 2702.2 is the upper limit. This calculation allows us to say, “We are 95% confident that the true mean lies between 2121.4 and 2702.2 nucleotides.” We do not say that “there is a 95% probability that the population mean falls between 2121.4 and 2702.2 nucleotides,” which is a common misinterpretation of the confidence interval (2121.4 and 2702.2 are both constants, and the true mean either is or is not between them, so there’s no probability involved). To better understand the correct interpretation of “95% confidence,” imagine that 20 researchers independently take unique random samples of n = 100 genes from the human genome. Each researcher calculates an estimate Y¯ and then a 95% confidence interval for the parameter (the population mean, μ). Each researcher ends up with a different estimate and a different 95% confidence interval, because by chance their samples are not the same (Figure 4.3-1). On average, however, 19 out of 20 (95%) of the researchers’ intervals will contain the value of the population parameter. On average, therefore, one out of 20 intervals (5%) will not contain the parameter value. None of the researchers will know for sure whether his or her own confidence interval contains the value of the unknown parameter, but each can be “95% confident” that it does. FIGURE 4.3-1 The 95% confidence intervals for the mean calculated from 20 separate random samples of n = 100 genes from the known human genome. Dots indicate the sample means Y¯. The vertical line (gold) represents the population mean, μ = 2622.0. In this example, 19 of 20 intervals included the population mean, whereas one interval did not. All numbers falling between the lower and upper bounds of a confidence interval can be regarded as the most plausible values for the parameter, given the data sampled. Values falling outside the confidence interval are less plausible. For example, on the basis of our random sample of 100 genes we can say that a mean gene length of 2000 nucleotides in the whole genome is implausible, because it falls outside the 95% confidence interval, 2121.4 < μ < 2702.2. However, a mean gene length of 2500 nucleotides falls within the 95% confidence interval, and so remains plausible. In general, the width of the 95% confidence interval is a good measure of our uncertainty about the true value of the parameter. If the confidence interval is broad, then uncertainty is high and the data are not very informative about the location of the population parameter. If the confidence interval is narrow, on the other hand, then we can be confident that the parameter is close to the estimated value. The 2SE rule of thumb A good “quick-and-dirty” approximation to the 95% confidence interval for the population mean is obtained by adding and subtracting two standard errors from the sample mean (the socalled 2SE rule of thumb). This calculation assumes that the sample is a random sample. A rough approximation to the 95% confidence interval for a mean can be calculated as the sample mean plus and minus two standard errors. For our unique random sample of 100 genes (Figure 4.1-2), for example, the sample mean of gene length was Y¯=2411.8 nucleotides, and its standard error was SEY¯=146.3 nucleotides. Two standard errors below the mean provides the lower limit of the approximate confidence interval: Y¯−2SEY¯=2411.8−(2×146.3)=2119.2 and two standard errors above the mean provides the upper limit: Y¯+2SEY¯=2411.8+(2×146.3)=2704.4. According to the 2SE rule of thumb, then, the 95% confidence interval for the mean gene length in the population can be approximated as 2119.2<μ<2704.4. This is not too far off from the more exact confidence interval (i.e., between 2121.4 and 2702.2 nucleotides) that we calculated previously. Although approximate, the 2SE rule is simple and works reasonably well. Error bars Standard errors or confidence intervals for the mean (and other parameters) are often illustrated graphically with “error bars.” Error bars are lines on a graph that extend outward from the sample estimate to illustrate the precision of estimates, reflecting uncertainty about the value of the parameter being estimated. For example, Figure 4.4-1 reproduces the strip chart of locust serotonin data shown previously in Chapter 2 (Figure 2.1-2) but adds error bars to illustrate the standard error of the sample mean serotonin level in each of the three experimental treatments. The lines projecting outward from the sample mean indicate one standard error above the mean and one standard error below the mean. Remember that standard error bars, unlike the whiskers on a box plot, are not intended to span a specified fraction of the data. Error bars indicate uncertainty about the population parameter, not variability in the data (even though variability in the data contributes to uncertainty). FIGURE 4.4-1 Strip chart of locust serotonin data (from Figure 2.1-2) with error bars added to illustrate the standard error (SE) of the mean for each treatment. Each filled black dot indicates the sample mean. Lines projecting outward indicate one SE above the mean and one SE below the mean. Error bars are lines on a graph extending outward from the sample estimate to illustrate uncertainty about the value of the parameter being estimated. Error bars are used for multiple purposes and so don’t always show the same measure of precision. Often they are used to illustrate standard errors, but sometimes error bars show confidence intervals instead. They may even indicate two standard errors rather than one. For example, Figure 4.4.2 draws an error bar for each of these three measures of precision for the mean of the same random sample. Notice that because they are measuring different quantities, the three error bars do not have the same span. Therefore it is crucial to read carefully the caption of any figure that has error bars to determine which measure of uncertainty is being shown. (And when you draw graphs yourself, it is important to give this information clearly in the figure legend.) Most commonly, error bars are used either for 95% confidence intervals or for standard errors. Because these two quantities differ approximately by a factor of two, you can see how knowing the meaning of the error bars is important. FIGURE 4.4-2 Comparison of alternative error bars calculated from gene lengths in the random sample of n = 100 genes (Example 4.1). The data are plotted as a strip chart on the left. The filled black circles indicate the sample mean of gene length, 2411.8 nucleotides. The leftmost error bar visualizes one standard error of the mean (SE) above and below the sample mean. The line extending above the black dot indicates one SE above the mean; the line extending below indicates one SE below the mean. The adjacent error bar indicates two standard errors above and below the mean. The third error bar indicates the 95% confidence interval for the mean. The rightmost error bar indicates one standard deviation above and below the sample mean. Finally, error bars are sometimes used to indicate the standard deviation of the data, but we recommend against this practice to minimize confusion. Error bars are a poor method for illustrating variability in the data, and they are redundant if you show the data. We added an error bar for the standard deviation to Figure 4.4-2 only to show how different—and potentially misleading—it can be. Use error bars only to illustrate the precision of estimates, not variability in the data. Summary ■ Estimation is the process of inferring a population parameter from sample data. ■ All estimates have a sampling distribution, which is the probability distribution of all the possible values of the estimate that might be obtained under random sampling with a given sample size. ■ The standard error of an estimate is the standard deviation of its sampling distribution. The standard error measures precision. The smaller the standard error, the more precise is the estimate. ■ The usual formulas for standard errors and confidence intervals assume that sampling is random. ■ The standard error of an estimate declines with increasing sample size. ■ The confidence interval is a range of values calculated from sample data that is likely to contain within its span the value of the target parameter. On average, 95% confidence intervals calculated from independent random samples will include the value of the parameter 19 times out of 20. ■ The 2SE rule of thumb (i.e., the sample mean plus or minus two standard errors) provides a rough approximation to the 95% confidence interval for a mean. ■ Add error bars to graphs to illustrate standard errors or confidence intervals. Make sure to clarify which is being illustrated in the figure legend. Quick Formula Summary Standard error of the mean What is it for? Measuring the precision of the sample estimate Y¯ of the population mean μ. What does it assume? The sample is a random sample. Estimate: SEY¯ Parameter: σY¯ Formula: SEY¯=sn where s is the sample standard deviation and n is the sample size. SEY¯ estimates the quantity σY¯=σn, where σY¯ is the standard error of the sample mean, and σ is the standard deviation of Y in the population. PRACTICE PROBLEMS 1. Calculation practice: Standard error of the mean and approximate confidence intervals for the mean. We will use the same data for systolic blood pressure collected for Calculation Practice Problem 1 in Chapter 3. Here again are the data points: 112, 128, 108, 129, 125, 153, 155, 132, 137 The mean is 131.0 mm Hg and the variance is 254.5. a. b. c. d. What is s, the standard deviation of these data? What is n, the sample size? Calculate the standard error of the mean. Using the 2SE rule of thumb, calculate an approximate 95% confidence interval for the mean. Provide the lower and upper limits. 2. Examine the times to rigor mortis of the 114 human corpses tabulated in Practice Problem 9 of Chapter 3. a. What is the standard error of the mean time to rigor mortis? b. The standard error calculated in part (a) measures the spread of what frequency distribution? c. What assumption does your calculation in part (a) require? 3. Examine the frequency distribution of gene lengths in the human genome displayed in Figure 4.1-1. Is the population median gene length in the human genome likely to be larger, smaller, or equal to the population mean? Explain. 4. As a general rule, is the spread of the sampling distribution for the sample mean mainly determined by the magnitude of the mean or by the sample size? 5. Seven of the 100 human genes we sampled randomly from the human genome (in Example 4.1) were found to occur on the X chromosome. The sample fraction of genes on the X was thus pˆ=7/100=0.07. For each of the following statements, specify whether it is true or false: a. pˆ=0.07 is the fraction of all human genes on the X chromosome. b. pˆ=0.07 estimates p, the fraction of all human genes on the X chromosome. c. pˆ has a sampling distribution representing the frequency distribution of values of pˆ that we might obtain when we randomly sample 100 genes from the human genome. d. The fraction of all human genes on the X chromosome has a sampling distribution. e. The standard deviation of the sampling distribution of pˆ is the standard error of pˆ ?. 6. In a poll of 1641 people carried out in Canada in November 2005, 73% of people surveyed agreed with the statement that “you don’t really expect that politicians will keep their election promises once they are in power” (CBC News 2005). a. What is the parameter being estimated? b. What is the value of the sample estimate? c. What is the sample size? d. The poll also reported that “the results are considered accurate within 2.5 percentage points, 19 times out of 20.” Explain what this statement likely refers to. 7. The following data are flash durations, in milli-seconds, of a sample of 35 male fireflies of the species Photinus ignitus (Cratsley and Lewis 2003; see Assignment Problem 19 in Chapter 2): 79, 80, 82, 83, 86, 85, 86, 86, 88, 87, 89, 89, 90, 92, 94, 92, 94, 96, 95, 95, 95, 96, 98, 98, 98, 101, 103, 106, 108, 109, 112, 113, 118, 116, 119 Estimate the sample mean flash duration. What does this quantity estimate? Is the estimate in part (a) likely to equal the population parameter? Why or why not? Calculate a standard error for your sample estimate. What does the quantity in part (c) measure? Using an approximate method, calculate a rough 95% confidence interval for the population mean. f. Provide an interpretation for the interval you calculated in part (e). a. b. c. d. e. 8. Imagine that the results of a study calculated a sample mean of zero, with a narrow 95% confidence interval for the population mean (modified from Borenstein 1997). The most appropriate conclusion is that (choose one): a. The population mean is likely to be zero or close to zero. b. The population mean is probably zero, but there is some chance that it is either slightly less than zero or slightly greater than zero. c. We can be reasonably certain the mean differs from zero. 9. One of the great discoveries of biology is that organisms have a class of genes called “regulatory genes,” whose only job is to regulate the activity of other genes. How many genes does the typical regulatory gene regulate? A study of interaction networks in yeast (S. cerevesiae) came up with the following data for 109 regulatory genes (Guelzim et al. 2002). Number of genes regulated Frequency 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 20 10 7 7 8 8 5 2 4 4 3 4 5 1 2 1 3 18 19 20 22 25 26 28 29 37 2 2 3 3 1 1 1 1 1 Total 109 a. What type of graph should be used to display these data? b. What is the estimated mean number of genes regulated by a regulatory gene in the yeast genome? c. What is the standard error of the mean? d. Explain what this standard error measures. e. What assumption are you making in part (c)? 10. Refer to the previous problem (Practice Problem 9). a. Using an approximate method, provide a rough 95% confidence interval for the population mean. b. Provide an interpretation of the interval you calculated in part (a). 11. Goldman et al. (1988) analyzed data on 405 patients with white blood cell cancer (chronic myelogenous leukemia) who received bone marrow transplants. They estimated the probability of relapse within 4 years of treatment to be 0.19, with a 95% confidence of 0.12 to 0.28. Which of the following statements are true? a. The population proportion is 0.19. b. The population proportion is likely to be between 0.12 and 0.28. c. There is a 95% chance that the population proportion is between 0.12 and 0.28. d. A population proportion of 0.30 is plausible. 12. An absentminded (and not too clever) scientist friend of yours has just analyzed his data, and he has two numbers—25.4 and 2.54—written on a scrap of paper. He says: “I remember that one of these is the standard deviation of my data and the other is the standard error of the mean, but I can’t remember which is which. Can you help?” a. Which number is the standard deviation and which is the standard error of the mean? b. What was your friend’s sample size? 13. The following is a list of sample means for human adult height, in cm. Each was calculated in the same way from samples taken from the same hypothetical population. 160.5, 162.5, 161.7, 160.2, 163.7, 159.8, 160.6, 161.1 The true mean of the population is 158.7 cm. Use the jargon of estimation to describe the likely type of problem in the sampling process. 14. When planning to obtain a sample from a population of interest, what can you do to make the standard error of the mean smaller? ASSIGNMENT PROBLEMS 15. A massive survey of sexual attitudes and behavior in Britain between 1999 and 2001 contacted 16,998 households and interviewed 11,161 respondents aged 16–44 years (one per responding household). The frequency distributions of ages of men and women respondents were the same. The following results were reported on the number of heterosexual partners individuals had had over the previous five-year period (Johnson et al. 2001). Sample size, n Mean Standard deviation Men Women 4620 6228 3.8 2.4 6.7 4.6 a. What is the standard error of the mean in men? What is it in women? Assume that the sampling was random. b. Which is a better descriptor of the variation among men in the number of sexual partners, the standard deviation or the standard error? Why? c. Which is a better descriptor of uncertainty in the estimated mean number of partners in women, the standard deviation or the standard error? Why? d. A mysterious result of the study is the discrepancy between the mean number of partners of heterosexual men and women. If each sex obtains its partners from the other sex, then the true mean number of heterosexual partners should be identical. Considering aspects of the study design, suggest an explanation for the discrepancy. 16. Our unique random sample of 100 human genes from the human genome (Example 4.1) was found to have a median length of 2150 nucleotides. Specify whether each of the following statements is true or false. a. The median gene length of all human genes is 2150 nucleotides. b. The median gene length of all human genes is estimated to be 2150 nucleotides. c. The sample median has a sampling distribution with a standard error. d. A random sample of 1000 genes would likely yield an estimate of the median closer to the population median than a random sample of 100 genes. 17. The following figure is from the website of a U.S. national environmental laboratory.7 It displays sample mean concentrations, with 95% confidence intervals, of three radioactive substances. The text accompanying the figure explained that “the first plotted mean is 2.0 ± 1.1, so there is a 95% chance that the actual result is between 0.9 and 3.1, a 2.5% chance it is less than 0.9, and a 2.5% chance it is greater than 3.1.” Is this a correct interpretation of a confidence interval? Explain. 18. Amorphophallus johnsonii is a plant growing in West Africa, and it is better known as a “corpse-flower.” Its common name comes from the fact that when it flowers, it gives off a “powerful aroma of rotting fish and faeces” (Beath 1996). The flowers smell this way because their principal pollinators are carrion beetles, who are attracted to such a smell. Beath (1996) observed the number of carrion beetles (Phaeochrous amplus) that arrive per night to flowers of this species. The data are as follows: 51, 45, 61, 76, 11, 117, 7, 132, 52, 149 What is the mean and standard deviation of beetles per flower? What is the standard error of this estimate of the mean? Give an approximate 95% confidence interval of the mean. Provide lower and upper limits. If you had been given 25 data points instead of 10, would you expect the mean to be greater, less than, or about the same as the mean of this sample? e. If you had been given 25 data points instead of 10, would you have expected the standard deviation to be greater, less than, or about the same as this sample? f. If you had been given 25 data points instead of 10, would you have expected the standard error of the mean to be greater, less than, or about the same as this sample? a. b. c. d. 19. The following three histograms (A, B, and C) plot information about the number of hours of sleep adult Europeans get per night (Roenneberg 2012). One of them shows the frequency distribution of individual values in a random sample. Another shows the distribution of sample means for samples of size 10 taken from the same population. Another shows the distribution of sample means for samples of size 100. a. Identify which graph goes with which distribution. b. What features of these distributions allowed you to distinguish which was which? c. Estimate by eye the approximate population mean of the number of hours of sleep using the distribution for the data. d. Estimate by eye the approximate mean of the distributions of sample means. 20. The following figure shows two alternative ways of presenting means with standard error bars in a graph. The data are from Daborn et al. (2002), who showed that elevated expression of the gene Cyp6g1 in Drosophila causes resistance to DDT. Expression levels of the gene (relative to a standard) were measured in 12 resistant strains of Drosophila and 6 susceptible strains. Which graphical method is superior? Explain. 21. Is sleep necessary? To investigate, Lesku et al. (2012) measured the activity patterns of breeding pectoral sandpipers (Calidris melanotos) in the high Arctic in summer, when the sun never sets. The accompanying figure shows the observed percent time that individual males were awake and active in a 2008 field study. The data are on the left. To the right of the data are the sample mean (filled circle) and error bars for the standard deviation, the standard error of the mean, and a 95% confidence interval for the mean (in no particular order). a. b. c. d. Which of the error bars indicates the standard deviation? Which error bar indicates the standard error of the mean? Which error bar indicates a 95% confidence interval for the mean? Estimate by eye the smallest plausible value for the mean activity (% time) of male pectoral sandpipers. Using this smallest plausible value, calculate approximately the maximum plausible number of hours (out of 24 hours) that males spend inactive or asleep. 22. How long do you hug somebody? Nagy (2011) measured the duration of spontaneous embraces at the 2008 Summer Olympic Games in Beijing, China. The data are the durations of hugs, in seconds, of athletes immediately after competing in the finals of an event. Hugs were either with their coach, a supporter (e.g., a team member), or a competitor. Descriptive statistics calculated from the data are in the following table. n refers to the sample size. Relationship Coach Supporter Competitor Mean Standard deviation n 3.77 3.16 1.81 3.96 2.76 1.13 77 75 33 a. According to the values in the table, which relationship group gets the longest hugs, on average, and which gets the briefest hugs? Do the values shown represent parameters or sample estimates? Explain. b. Using the numbers in the table, calculate the standard error of the mean hug duration for each relationship group. What do these values measure? c. What assumption(s) about the samples are you making in (b)? d. Using the numbers in the table, calculate an approximate 95% confidence interval for the mean hug duration when athletes embrace competitors. Provide the lower and upper limits of the confidence interval. e. In light of your results in (d), consider the most-plausible values for the mean duration of hugs with competitors in the population of athletes. Is 2 seconds among the most plausible values for the population mean hug duration? f. For which of the relationship groups is the possibility of a 3-second mean hug duration in the population plausible? 23. Pitcher plants of the genus Nepenthes are typically carnivorous, obtaining a great deal of their nutrition from insects that become trapped in the pitcher, die, and decay. N. lowii, a pitcher plant from Borneo, produces a second type of pitcher that attracts tree shrews (Tupaia montana), which provide nutrients by defecating into the pitcher 8 while they feed on a substance secreted by the plant. Based on measurements of 20 plants, Clarke et al. (2009) calculated a 95% confidence interval for the mean fraction of total leaf nitrogen in the plant species derived from tree shrews: 0.57 < μ < 1.0. a. Does this result imply that individual plants receive between 57% and 100% of their leaf nitrogen from tree shrews? Explain. b. Is the confidence interval meant to bracket the sample mean or the population mean? c. Identify two values for the mean shrew fraction of total leaf nitrogen that the analysis suggests are among the most plausible. d. Identify two values for the mean shrew fraction of total leaf nitrogen that the analysis suggests are less plausible. 24. Hagen et al. (2011) estimated the home range sizes of 4 bumblebees (Bombus) by fitting them with tiny radio transmitters and tracking their positions by plane and ground surveys. They estimated the mean home range size to be 20.7 ± 11.6 ha, where the number after the ± sign refers to standard error of the mean. a. Provide a description for the standard error. What does it measure? b. What assumption are we making when calculating the standard error of the mean? c. What would you recommend the researchers do next to reduce the standard error of their estimate of the mean home range size? 25. The following definition of a confidence interval was found on a web page at the National Institute of Standards and Technology. “Confidence intervals are constructed at a confidence level, such as 95%, selected by the user. … It means that if the same population is sampled on numerous occasions and interval estimates are made on each occasion, the resulting intervals would bracket the true population parameter in approximately 95% of the cases.” 9 Is this a correct interpretation of the confidence interval? Explain. 26. When a female jewel wasp (Ampulex compressa) encounters a cockroach (Periplaneta americana), she stings and injects neurotoxins into its head that render the insect unable to initiate self-movement but not paralyzed. The wasp then holds the compliant (zombie) cockroach by the antenna and leads it to her nest, where it will become live food for her larval offspring. The following graph (modified from Gal and Liber-sat 2010) compares the mean self-initiated walking duration of stung and control cockroaches during the first 30 minutes after treatment. The error bars indicate approximate 95% confidence intervals. a. Estimate the lower and upper limits of the confidence intervals for the control group. b. Approximate the value of the standard error for the control group. c. Identify two values of the population mean duration for the control group that are among the most plausible. d. Identify two values of the population mean duration for the stung group that are less plausible. e. Identify the main weakness in the construction of the graph. How would you improve it? 2 INTERLEAF Pseudoreplication Most statistical techniques assume that the data are a random sample from the population, in which individuals are sampled independently and with equal probability. Unfortunately, many experiments are conducted and analyzed in a way that violates the assumption of independence. For example, consider the problem of how to measure differences between male birdsongs in their attractiveness to females. In some species male birds vary in their song complexity. Some males have songs with multiple notes and phrases and other males have simpler songs with fewer notes. A classic way to measure song attractiveness is to play a tape recording of a song to a female and watch what she does. (Females sometimes have recognizable courtship behaviors that can be used to indicate their interest.) In one study1 researchers recorded the complex song of one male and the relatively simple song of another male, and they played these same two songs to each of 40 different females. The goal of the study was to determine by how much on average females preferred complex songs over simple songs. A confidence interval for the mean attractiveness of the two male songs was calculated based on the responses of the 40 females. The result was a very narrow range of plausible values. But something has gone wrong. The 40 measurements of attractiveness of simple and complex songs were not independent. All females listened to the songs of the same two males. When females listened to the “complex” song, they were listening to one male, and when they listened to the “simple” song, females were listening to another single male. In reality, the sample size of complex songs was just n = 1, and the sample size for simple songs was n = 1 also. As a result, the study only showed that females liked the song of one of the two males better than that of the other, not that complex songs are more attractive in general. To ensure independence, the study should have recorded the songs of 40 males with complex songs, and 40 males with simple songs. Each female should have listened to a unique pair of songs, one simple and one complex. The birdsong study is an example of pseudoreplication. Pseudoreplication occurs whenever individual measurements that are not independent are analyzed as if they are independent of one another. In the birdsong study, the 40 measurements of females were treated as 40 independent data points, when they were not independent. The problem with pseudoreplication is that measurements obtained from individuals not sampled independently might be more similar to one another than measurements made on individuals sampled independently from the population. “Replication” by itself is good—it refers to the sampling of multiple independent units from a population, which makes it possible to estimate population characteristics and the precision of those estimates. In general, the larger the level of replication, the greater our confidence in our results. But if we analyze non-independent data points as if they were independent, then we are making a false claim about the amount of replication, hence the “pseudo-” in pseudoreplication.2 Pseudoreplication is probably the single most common fault in the design and analysis of ecological field experiments. It is at least equally common in many other areas of research. —Stuart Hurlbert (1984) Most statistical techniques, including almost everything in this book, assume that each data point is independent of the others. Independence is, after all, built into the definition of a random sample. When we assume that data points are independent of each other, we give each data point equal credence and weigh its information as heavily as every other point. If two data points are not independent, though, then treating them as independent makes it seem as if we have more information than we really do. We would be treating the data set as if it were larger than it really is, and as a result we would calculate confidence intervals that were too narrow and P-values (see Section 6.2) that were too small. Imagine that we wanted to know the blood sugar levels of diabetes patients. An overzealous phlebotomist takes 15 samples from each of 10 patients, yielding a total of 150 measurements. How can we treat these 150 data points? If we threw them all together and analyzed them as if we had 150 independent data points, we would commit the sin of pseudoreplication. If the patients were randomly chosen, then we would have only 10 independent data points. We needn’t throw out any of the 15 samples per patient. Rather, for each patient, we could take the average of the 15 measurements and use this as the independent observation. Doing so gives us a more reliable estimate of the blood sugar level of each patient, which can reduce the amount of sampling variation in our estimate compared with an estimate based on only a single measurement per patient. But no matter how many measurements are taken from a patient, the total sample size of this study is still only n = 10. By recognizing that the patients, not the measurements, represent our unit of replication, we can analyze the data appropriately. Pseudoreplication can often be avoided by summarizing the information on each independently sampled individual and using those summaries as the data for the analysis. Pseudoreplication is often subtle, and it remains a major source of mistakes in the analysis of experiments. The rate of pseudoreplication has been estimated to be one in every eight field studies in ecology (Hurlbert and White 1993, Heffner et al. 1996). When reading the scientific literature, keep in mind the possibility of pseudorepli cation. Be on the lookout for features that group individual data points during the sampling process. Watch out for multiple measurements taken on the same individuals or the same experimental unit. If the number of measurements, and not the number of independent individuals, is counted as the sample size in a statistical analysis, there could be a problem. Suggested reading Hurlbert, S. H. 1984. Pseudoreplication and the design of ecological field experiments. Ecological Monographs 54: 187–211. 5 Probability White-breasted nuthatch The concept of probability is important in almost every field of science, including biology. Probability is the backbone of data analysis. In Chapter 4, we made statements about probability to quantify the uncertainty of parameter estimates, and we will see even more uses of probability throughout the book. Probability is essential to biology because we almost always look at the natural world by way of a sample, and, as we have seen, chance can play a major role in the properties of samples. In this chapter, we will discuss the basic principles of probability and basic probability calculations. In Chapter 6, we will begin to apply these concepts to data analysis. The probability of an event Imagine that you have 1000 songs on your phone, each of them recorded exactly once. When you push the “shuffle” button, the phone plays a song at random from the list of 1000. The probability that the first song played is your single favorite song is 1/1000, or 0.001. The probability that the first song played is not your favorite song is 999/1000, or 0.999. What exactly do these numbers mean? The concept of probability rests on the notion of a random trial. A random trial is a process or experiment that has two or more possible outcomes whose occurrence cannot be predicted with certainty. Only one outcome is observed from each repetition of a random trial. In the phone songs example, a random trial consists of pushing the shuffle button once. The specified outcome is “your favorite song is played,” which is one of 1000 possible outcomes. Other examples of random trials include ■ Flipping a coin to see if heads or tails comes up, ■ Rolling a pair of dice to see what the sum of their numbers is, ■ Rolling a die 10 times to measure the proportion of times a “6” comes up. What is probability? To answer this, we need to define the event of interest and the list of all possible outcomes of a random trial. An event is any potential subset of all the possible outcomes. For example, there are six possible outcomes if we roll a six-sided die—the numbers one through six. The event of interest could be “the result is an even number,” “the result is a number greater than three,” or even the simple event “the result is four.” As the last example shows, an outcome is also an event. However, events can be more complicated subsets of outcomes, so we will define principles of probability mainly in terms of events. The probability of an event is the proportion of all random trials in which the specified event occurs when the same random process is repeated over and over again independently and under the same conditions.1 In an infinite number of random trials carried out in exactly the same way, the probability of an event is the fraction of the trials in which the event occurs. The probability of an event is the proportion of times the event would occur if we repeated a random trial over and over again under the same conditions. Probability ranges between zero and one. A useful shorthand is the following: Pr[A] means “the probability of event A.” Thus, if we want to state the probability of “rolling a four” with a six-sided die, then we can write Pr[rolling a four]=1/6. Because probabilities are proportions, they must always fall between zero and one, inclusive. An event has probability zero if it never happens, and an event has probability one if it always happens. Flipping coins and rolling dice are not biological processes, but they are simple and familiar and the probabilities are known. Their relevance to biology is nevertheless high because they mimic the process of sampling. Randomly sampling 10 new babies and counting the number that are female is mimicked by flipping a coin 10 times and counting the number of heads (assuming both have probability 1/2). Randomly sampling 100 individuals from a human population and counting the proportion that are left-handed is mimicked by rolling a six-sided die 100 times and counting the proportion of sixes (assuming that a “six” and a “left-handed person” both have probability 1/6). In other words, randomly sampling a population represents a random trial just like rolling a die or flipping a coin. The value of a variable measured on a randomly sampled individual is an outcome of a random trial. The following are therefore also random trials: ■ Randomly sampling an individual from a population of sockeye salmon to see what its weight is, and ■ Randomly sampling 1000 fetuses from clinics in a large North American city to measure the proportion that have Down syndrome. Venn diagrams One useful way to think about the probabilities of events is with a graphical tool called a Venn diagram. The area of the diagram represents all possible outcomes of a random trial, and we can represent various events as areas within the diagram. The probability of an event is proportional to the area it occupies in the diagram. Figure 5.2-1 shows a Venn diagram for one roll of a fair six-sided die. The six possible outcomes fill the diagram, indicating that these are all possible results. The area of the box for each outcome is the same, showing that these outcomes are equally probable in this particular example. They each contain 1/6 of the area of the Venn diagram. FIGURE 5.2-1 A Venn diagram for the possible outcomes of a roll of a six-sided die. The area corresponding to the event “the result is a four” is highlighted in red. We can use Venn diagrams to show more complicated events as well. In Figure 5.2-2, for example, the event “the result is greater than two” is shown. FIGURE 5.2-2 A Venn diagram showing the event “the result is greater than two” highlighted in red. The probability of this event is 4/6 = 2/3, equal to the area of the red region. Mutually exclusive events When two events cannot both occur at the same time, we say that they are mutually exclusive. For example, a single die rolled once cannot yield both a one and a six. The events “one” and “six” are mutually exclusive events for this random trial. Two events are mutually exclusive if they cannot both occur at the same time. Sometimes physical constraints explain why certain events are mutually exclusive. It is impossible, for example, for more than one number to result from a single roll of a die. Sometimes events are mutually exclusive because they never occur simultaneously in nature. For example, “has teeth” and “has feathers” are mutually exclusive events when we randomly sample a single living animal species from a list of all existing animal species, because no living animals have both teeth and feathers. If we sample a living animal species at random, the probability that it has both teeth and feathers is zero, although plenty of animals have teeth or feathers. In mathematical terms, two events A and B are mutually exclusive if Pr[A and B]=0. Here, Pr[A and B] means the probability that both A and B occur. Probability distributions A probability distribution describes the probabilities of each of the possible outcomes of a random trial. Some probability distributions can be described mathematically, while others are just a list of the possible outcomes and their probabilities. The precise meaning of a probability distribution depends on whether the variable is discrete or continuous. A probability distribution is a list of the probabilities of all mutually exclusive outcomes of a random trial. Discrete probability distributions A discrete numerical variable is measured in indivisible units. Categorical variables are discrete, as are many numerical variables. A discrete probability distribution gives the probability of each possible value of a discrete variable. Categorical and discrete numerical variables have discrete probability distributions. For example, the probability distribution of outcomes for the single roll of a fair die is given in Figure 5.4-1. In this case, all integers between one and six are equally probable outcomes (probability = 1/6 = 0.167). The histogram in Figure 5.4-2 shows the probability distribution for the sum of the two numbers resulting from a roll of two dice. Here the different outcomes are not equally probable. FIGURE 5.4-1 The probability distribution of outcomes resulting from the roll of a single six-sided fair die. The probability of each possible outcome is 1/6 = 0.167. FIGURE 5.4-2 The probability distribution for the sum of the numbers resulting from rolling two 6-sided fair dice. Because all possible outcomes are taken into account, the sum of all probabilities in a probability distribution must add to one. This is because a probability distribution has to describe all possible outcomes, and the probability that some outcome occurs from a random trial is one. Continuous probability distributions Unlike discrete variables, continuous numerical variables can take on any real number value within some range. Between any two values of a continuous variable (call the variable Y), an infinite number of other values are possible. We describe a continuous probability distribution with a curve whose height is probability density. A probability density allows us to describe the probability of any range of values for Y. The normal distribution, first introduced in Section 1.4, is a continuous probability distribution. It is bell-shaped like the curve shown in Figure 5.4-3. We’ll see much more of this distribution in Chapter 10 and later. FIGURE 5.4-3 A normal distribution. Imagine that we sample a random number from this distribution––let’s call the number Y. Unlike discrete probability distributions, the height of a continuous probability curve at the value of Y = 2.4 does not give the probability of obtaining Y = 2.4. Because a continuous probability distribution covers an infinite number of possible outcomes, the probability of obtaining any specific outcome is infinitesimally small and therefore zero. With continuous probability distributions, such as the normal curve, it makes more sense to talk about the probability of obtaining a value of Y within some range. The probability of obtaining a value of Y within some range is indicated by the area under the curve. For example, the probability that a single randomly chosen individual has a measurement lying between the two numbers a and b equals the area under the curve between a and b (Figure 5.4-4). FIGURE 5.4-4 The probability that a randomly chosen Y-measurement lies between a and b is the area under the probability density curve between a and b (left panel). In the right panel we approximate the same area using discrete bars. The area under the curve between a and b is calculated by integrating2 the probability density function between the values a and b. Integration is the continuous analog of summation, so integrating the probability density function from a to b is analogous to adding together the areas of very many narrow rectangles under the curve between a and b (see the right panel in Figure 5.4-4). For any probability distribution, the area under the entire curve of a continuous probability density function is always equal to one. Finally, because the probability of any individual Yvalue is infinitesimally small under a continuous probability density distribution, Pr[a ≤ Y ≤ b] is the same as Pr[a < Y < b]. Either this or that: adding probabilities Very often, we want to know the probability that we get either one event or another. For example, the probability that a randomly chosen North American has a particular ABO blood type and Rh factor (+ or −) is shown in Table 5.5-1 (Stanford Blood Center 2012). What is the probability that an American has blood type O? A person is blood type O if she is either O+ or O–. We can use the addition rule to calculate this probability. TABLE 5.5-1 Probability that a randomly chosen American will have a given blood type. A, B, and O refer to ABO blood type, and “+” and “–“ refer to Rh factor. Blood type Probability O+ O– A+ A– B+ B– AB+ AB– 0.374 0.066 0.357 0.063 0.085 0.015 0.034 0.006 The addition rule If two events are mutually exclusive, then calculating the probability of one or the other event occurring is both intuitive and easy. The probability of getting either of two mutually exclusive events is simply the sum of the probabilities of each of those events separately. Having blood type O– and having blood type O+ are mutually exclusive events. Therefore, the chance of a person being either O– or O+ is the chance of being O– plus the chance of being O+: Pr[O- or O+]=Pr[O-]+Pr[O+]=0.374+0.066=0.440. Figure 5.5-1 illustrates the probability of O– or O+ with a Venn diagram. FIGURE 5.5-1 The probability of O– or O+ is equal to the probability of O– plus the probability of O+, because the two events are mutually exclusive. This additive property of the probabilities of mutually exclusive events is called the addition rule. The addition rule: If two events A and B are mutually exclusive, then Pr[A or B]=Pr[A]+Pr[B]. The addition rule extends to more than two events as long as they are all mutually exclusive. Let’s say that your blood type is B–, and we want to know the probability that you could safely donate blood to a randomly sampled American in the event of emergency. B– blood can be safely donated to anyone who is B+, B–, AB+, or AB–. These four possibilities are mutually exclusive, because a randomly sampled American cannot have more than one of these blood types at the same time. Thus, the probability that an American is able to receive your B– blood safely can be calculated as follows using the addition rule: Pr[B+ or B- or AB+ or AB-]=Pr[B+]+Pr[B-]+Pr[AB+]+Pr[AB]=0.085+0.015+0.034+0.006=0.140. The addition rule is about “or” statements. If two events are mutually exclusive and we want to know the probability of being either one or the other, we can use the addition rule. This property is vital to analyzing data because it allows us to calculate the probabilities of different outcomes of random sampling when they are mutually exclusive. The probabilities of all possible mutually exclusive outcomes add to one The probabilities of all possible mutually exclusive outcomes of a random trial must add to one. With blood type, for example, there are eight possible outcomes. Therefore, the sum of the probabilities of all outcomes is Pr[O+ or O− or A+ or A− or B+ or B− or AB+ or AB−]=Pr[O+]+Pr[O−]+Pr[A+]+Pr[A−]+Pr[B+ This means that the probability of an outcome or event not occurring is simply one minus the probability that it occurs. For example, the probability that you do not get O+ when you type the blood of a randomly sampled American is Pr[not O+]=1−Pr[O+]=1−0.374=0.626. This calculation is much easier than summing the probabilities of all outcomes other than O+. The probability of an event not occurring is one minus the probability that it occurs. Pr[not A]=1−Pr[A]. The general addition rule Not all events, though, are mutually exclusive. It is possible, for example, for the ABO blood type of a randomly sampled American to be O and his or her Rh factor to be positive (+). If the two events are not mutually exclusive, how do we calculate the probability of either one or the other event occurring? In mathematical notation, a general addition rule can be written as Pr[A or B]=Pr[A]+Pr[B]−Pr[A and B]. This calculates the probability that either A or B (or both) occur. When events A and B are mutually exclusive, Pr[A and B] = 0, so the generalized addition rule reduces to the addition rule for mutually exclusive events introduced previously. The reason we have to subtract the probability of both A and B occurring is illustrated in Figure 5.52. If we do not subtract the probability of both A and B occurring, then we will double-count those outcomes where both A and B occur. FIGURE 5.5-2 The general addition rule. Pr[A and B] is subtracted from Pr[A] + Pr[B] so that the outcomes where both A and B occur (the tan shaded areas) are not counted twice. So, for example, the probability that a randomly chosen American has either the most common ABO type (O) or the most common Rh factor (+) is Pr[O]+Pr[+]−Pr[O and+]=0.440+0.850−0.374=0.916. 5.6 Independence and the multiplication rule Science is the study of patterns, and patterns are generated by relationships between events. Men are more likely to be tall, more likely to have a beard, more likely to die young, and more likely to go to prison than women. In other words, height, beardedness, age at death, and criminal conduct are not independent of sex in the human population. Sometimes, though, the chance of one event occurring does not depend on another event. If we roll two dice, for example, the number on one die does not affect the number on the other die. If knowing one event gives us no information about another event, then these two events are called independent. Two events are independent if the occurrence of one does not in any way inform us about the probability that the other will also occur. When rolling the same fair die twice in a row, for example, the probability that the first roll gives a three is 1/6, as we saw previously: Pr[first roll is three]=1/6. Two events are independent if the occurrence of one does not inform us about the probability that the second will occur. What is the probability that the next roll will also be a three? The probability of rolling a three on the second roll is still 1/6, regardless of whether the first roll was a three or not. Because the outcome of the first roll does not give any information about the probability of rolling a three on the second roll, we can say that the two events are independent (Figure 5.61). FIGURE 5.6-1 A Venn diagram for all the possible outcomes of rolling two 6-sided dice. The first digit of each pair shows the result of the roll of the first die, and the second number shows the result of the roll of the second die. Rolling a three on the first roll is shown in the blue row. The probability of rolling a three on the second roll is shown in the green column and is the same (1/6) regardless of the result of the first roll. When the occurrence of one event provides at least some information about the results of another event, then the two events are dependent. Multiplication rule When two events are independent, then the probability that they both occur is the probability of the first event multiplied by the probability of the second event. This is called the multiplication rule. When we analyze data, we use this multiplication rule to determine what to expect when two variables are independent. The multiplication rule: If two events A and B are independent, then Pr[A and B]=Pr[A]×Pr[B]. We can see the basis of the multiplication rule in Figure 5.6-1. The area of the Venn diagram that corresponds to “rolling a three on the first die” and “rolling a three on the second die” is the region of overlap between the blue and green areas. Because the two events are independent, the area of this overlap zone is just the probability of being in the blue times the probability of being in the green: Pr[(first roll is a three) and (second roll is a three)]=Pr[first roll is a three]×Pr[second roll is a three] =1/6×1/6 =1/36. The multiplication rule pertains to combinations with “and”—that is, that both events occur. If we want to know the probability of this and that occurring, and if the two events are independent, we can multiply the probabilities of each to get the probability of both occurring. Example 5.6A applies the multiplication rule to a study about smoking and high blood pressure. EXAMPLE 5.6A Smoking and high blood pressure Both smoking and high blood pressure are risk factors for strokes and other vascular diseases. In the United States, approximately 17% of adults smoke and about 22% have high blood pressure. Research has shown that high blood pressure is not associated with smoking; that is, they seem to be independent of each other (Liang et al. 2001). What is the probability that a randomly chosen American adult has both of these risk factors? Because these two events are independent, the probability of a randomly sampled individual both “smoking” and “having high blood pressure” is the probability of smoking times the probability of high blood pressure: Pr[smoking and high blood pressure]=Pr[smoking]×Pr[high blood pressure]=0.17×0.22=0.037. Therefore, 3.7% of adult Americans will have both of these risk factors for strokes. This calculation is shown geometrically in the Venn diagram in Figure 5.6-2. FIGURE 5.6-2 Venn diagram for the two independent factors smoking and high blood pressure. The probability of having both risk factors is proportional to the area of the rectangle in the bottom left corner. “And” versus “or” Probability statements involving “and” or “or” statements are common enough, and confusing enough, that it is worth summarizing them together: ■ The probability of A or B involves addition. That is, Pr[A or B] = Pr[A] + Pr[B] if the two events A and B are mutually exclusive. ■ The probability of A and B involves multiplication. That is, Pr[A and B] = Pr[A] × Pr[B] if A and B are independent. What may be confusing is that the statement involving “and” requires multiplication, not addition. Independence of more than two events The multiplication rule also applies to more than two events, as Example 5.6B demonstrates. If several events are all independent, then the probability of all events occurring is the product of the probabilities that each one occurs. EXAMPLE 5.6B Mendel’s peas Like blue eyes in humans, yellow pods in peas is a recessive trait. That is, pea pods are yellow only if both copies of the gene code for yellow. A plant having only one yellow copy and one green copy (a “heterozygote”) has green pods just like the pods of plants having two green copies of the gene (a green “homozygote”). Gregor Mendel devised a method to determine whether a green plant was a heterozygote or a homozygote. He crossed the test plant to itself and assessed the pod color of 10 randomly chosen offspring. If all 10 were green, he inferred the plant was a homozygote, but if even one offspring was yellow, the test plant was classified as a heterozygote. However, he might have missed some heterozygotes, if by chance not a single yellow offspring was chosen. What is the chance of missing a heterozygote by Mendel’s method? If the test plant is a homozygote, every offspring is green. If the test plant is a heterozygote, on the other hand, the chance of an offspring being green is 3/4 and the chance of it being yellow is only 1/4. What is the chance that all 10 offspring from a heterozygote test plant are green? Mendel didn’t carry out these calculations, but we can use our rules of probability to figure out the reliability of his approach. The chance that any one of a heterozygote’s offspring is green is 3/4. Because the genotype of each offspring is independent of the genotypes of other offspring, the probability that all 10 are green can be calculated using the multiplication rule. Pr[all 10 green]=Pr[first is green]×Pr[second is green]×Pr[third is green]×…=3/4×3/4×3/4×…= (3/4)10=0.056. Thus, Mendel likely misidentified about 5.6% of heterozygous individuals. On the other hand, his method correctly identified heterozygotes with probability (1 – 0.056) = 0.944. Probability trees A probability tree is a diagram that can be used to calculate the probabilities of combinations of events resulting from multiple random trials. We show how to use probability trees with Example 5.7. EXAMPLE 5.7 Sex and birth order Some couples planning a new family would prefer to have at least one child of each sex. The probability that a couple’s first child is a boy3 is 0.512. In the absence of technological intervention, the probability that their second child is a boy is independent of the sex of their first child, and so remains 0.512. Imagine that you are helping a new couple with their planning. If the couple plans to have only two children, what is the probability of getting one child of each sex? This question requires that we know the probabilities of all mutually exclusive values of two separate variables. The first variable is “the sex of the first child.” The second variable is “the sex of the second child.” We can start building a probability tree by considering the two variables in sequence. Let’s start with the sex of the first child. Two mutually exclusive outcomes are possible—namely, “boy” and “girl”—which we list vertically, one below the other (see figure at right). We then draw arrows from a single point on the left to both possible outcomes. Along each arrow we write the probability of occurrence of each outcome (0.512 for “boy” and 0.488 for “girl”). Now, we list all possible values for the second variable, but we do so separately for each possible value of the first variable. For example, for the value “boy” for the first child, we list both possible values (i.e., “boy” and “girl”) for the sex of the second child. Next, we draw arrows originating from the value “boy” for the first variable to both possible values for the second variable. Then we write the probability of each value for the second variable along each arrow. We repeat this process for the case when “girl” is the value of the first variable. The resulting probability tree is shown in Figure 5.7-1. FIGURE 5.7-1 A probability tree for all possible values of a two-child family. At this point, we should check that our probabilities are written down correctly. For instance, the probabilities along all arrows originating from a single point must sum to one (within rounding error) because they represent all the mutually exclusive possibilities. If they don’t sum to one, we’ve forgotten to include some possibilities or we’ve written down the probabilities incorrectly. With a probability tree, we can calculate the probability of every possible sequence of values of the two variables. A sequence of values is represented by a path along the arrows of the tree that begins at the root at the far left and ends at one of the branch tips on the right. The probability of a given sequence is calculated by multiplying all of the probabilities along the path taken from the root to the tip. For example, the sequence “boy then girl” in Figure 5.7-1 has a probability of 0.512 × 0.488 = 0.250. On our probability tree, we usually list the probabilities of each sequence of values in a column to the right of the tree tips, as shown in Figure 5.7-1. Each tree tip defines a unique and mutually exclusive sequence of events. Check Figure 5.7-1 (or any probability tree) to make sure that the probabilities of all possible sequences add to one. If they don’t add to one (within rounding error), then something has gone wrong in the construction of the tree. What is the probability of having one child of each sex in a family of two children? According to the probability tree, two of the four possible sequences result in the birth of one boy and one girl. In the first sequence, the boy is born first, followed by the girl, whereas in the second sequence, the girl is born first and the boy is born second. These two different sequences are mutually exclusive, and we are looking for the probability of either the first or the second sequence. By the addition rule, therefore, the probability of getting exactly one boy and one girl when having two children is the sum of the probabilities of the two alternative sequences leading to this event: 0.250 + 0.250 = 0.500. We could also use the probability tree in Figure 5.7-1 to calculate probabilities of the following events: ■ The probability that at least one girl is born, ■ The probability that at least one boy is born, and ■ The probability that both children are the same sex. Calculate these probabilities yourself to test your understanding.4 It is not essential to use probability trees when calculating the probabilities of sequences of events, but they are a helpful tool for making sure that you have accounted for all of the possibilities. Dependent events Independent events are mathematically convenient, but when the probability of one event depends on another, things get interesting. Much of science involves identifying variables that are associated. Sex determination is more exotic in many insects than in humans. In many species, the mother can alter the relative numbers of male and female offspring depending on the local environment. In this case, “sex of offspring” and “environment” are dependent events, as Example 5.8 demonstrates. EXAMPLE 5.8 Is this meat taken? The jewel wasp, Nasonia vitripennis, is a parasite, laying its eggs on its host, the pupae of flies. The larval Nasonia hatch inside the pupal case, feed on the live host, and grow until they emerge as adults from the now dead, emaciated host. Emerging males and females, possibly brother and sister, mate on the spot. Nasonia females have a remarkable ability to manipulate the sex of the eggs that they lay.5 When a female finds a fresh host that has not been previously parasitized, she lays mainly female eggs, producing only the few sons needed to fertilize all her daughters. But if the host has already been parasitized by a previous female, the next female responds by producing a higher proportion of sons.6 Thus, the state of the host encountered by a female and the sex of an egg laid are dependent variables (Werren 1980). Suppose that, when a given Nasonia female finds a host, there is a probability of 0.20 that the host already has eggs, laid by a previous female wasp. Presume that the female can detect previous infections without error. If the host is unparasitized, the female lays a male egg with probability 0.05 and a female egg with probability 0.95. If the host already has eggs, then the female lays a male egg with probability 0.90 and a female egg with probability 0.10. Figure 5.8-1 shows a mosaic plot of these probabilities. FIGURE 5.8-1 A mosaic plot showing that the sex of eggs laid by Nasonia females depends on the state of the host. Based on Figure 5.8-1, the events “host is previously parasitized” and “producing a male egg” are dependent. The probability of laying a male egg changes depending on whether the host has been previously parasitized. Suppose we want to know the probability that a new, randomly chosen egg is male. We can approach this question using a probability tree like the one shown in Figure 5.8-2. FIGURE 5.8-2 A probability tree for the sex of offspring laid by Nasonia according to whether the host has been previously parasitized. According to the probability tree, there are exactly two paths that yield a male egg. In the first, the host is already parasitized and the mother lays a male egg. This path has probability Pr[host already parasitized and sex of new egg is male]=0.20×0.90=0.18. In the second path, the host is not previously parasitized and the female lays a male egg. This second path has probability Pr[host not already parasitized and sex of new egg is male]=0.80×0.05=0.04. The probability of a new egg being male is the sum of the probabilities of these two mutually exclusive paths: Pr[male]=0.18+0.04=0.22. The probability of an egg being male in this population is 0.22. See Figure 5.8-3. FIGURE 5.8-3 A probability tree for the sex of an egg laid by Nasonia. The probability tree shows that the event “sex of new egg is male” depends on whether the host encountered by a mother has been previously parasitized. Does this mean that the events “host already parasitized” and “sex of new egg is male” are not independent? One way to confirm this is via the multiplication rule, which applies only to independent events. The probability that “the host already had been parasitized and the sex of the new egg is male” is 0.18. This is not what we would have expected assuming independence, though. If we multiply the probability that the new egg is a male (0.22, as we just calculated) and the probability that a host is already parasitized (0.20), we get 0.22 × 0.20 = 0.044, which is different from the actual probability of these two events (0.18). Based on the definition of “independence,” then, these two events are not independent. Conditional probability and Bayes’ theorem If we want to know the chance of an event, we need to take account of all existing information that might affect its outcome. If we want to know the probability that we will see an elephant on our afternoon stroll, for example, we would get a different answer depending on whether our walk was in the Serengeti or downtown Manhattan. The algebra of conditional probability lets us hone our statements about the chances of random events in the context of extra information. Conditional probability Conditional probability is the probability of an event given that another event occurs. The conditional probability of an event is the probability of that event occurring given that a condition is met. In Example 5.8, the conditional probability that a jewel wasp will lay a male egg is 0.90 given that the host that she is laying on already has wasp eggs (i.e., has already been parasitized). Confirm this for yourself by looking at Figure 5.8-2 again. We write conditional probability in the following way: Pr[new egg is male|host is previously parasitized]=0.90. More generally, Pr[event | condition] represents the probability that the event will happen given that the condition is met. The vertical bar in the middle of this expression is a symbol that means “given that” or “when the following condition is met.” (Be careful not to confuse it with a division sign.) The Venn diagram in Figure 5.8-1 illustrates the meaning of this conditional probability. Ninety percent of the area corresponding to “host parasitized” represents the cases when the offspring is male, with the remaining 10% being females. The probability of a male is different under the condition “host parasitized” than under the condition “host not parasitized.” Conditional probability has many important applications. If we want to know the overall probability of a particular event, we sum its probability across every possible condition, weighted by the probability of that condition. This is known as the law of total probability. According to the law of total probability, the probability of an event, A, is Pr[A]=∑All values of BPr[B] Pr[A|B], where B represents all possible mutually exclusive values of the conditions. One way of thinking about this formula is that it gives the weighted average probability of A over all possible mutually exclusive conditions. The Venn diagram in Figure 5.8-1 makes it possible to visualize this, too. The probability of being male is obtained by adding the two blue areas, one for the condition when the host is already parasitized and the other for when the host is not already parasitized. The width of these boxes is proportional to the probability of the condition; the height is proportional to Pr[male | host condition]. By multiplying the width by the height of each box we find its area (its probability), and by adding all such boxes together we find the total probability of males. To calculate the probability that a new egg is a male, we must consider two possible conditions: (1) the host is already parasitized and (2) the host is not parasitized. Thus, we’ll have two terms on the right side of our equation: Pr[egg is male]=Pr[host already parasitized] Pr[egg is male | host already parasitized]+ Pr[host not parasitized] Pr[egg is male | host not parasitized]=(0.20×0.90)+(0.80×0.05)=0.22. This is the same answer that we got from the probability tree, but now we can see how it can be derived from statements of conditional probability. The general multiplication rule With conditional probability statements, we can find the probability of a combination of two events even if they are not independent. When two events are not independent, the probability that both occur can be found by multiplying the probability of one event by the conditional probability of the second event, given that the first has occurred. This is the general multiplication rule. The general multiplication rule finds the probability that both of two events occur, even if the two are dependent: Pr[A and B]=Pr[A] Pr[B|A]. This rule makes sense, if we think it through. For two events (A and B) to occur, event A must occur. By definition, this happens with probability Pr[A]. Now that we know A has occurred, the probability that B also occurred is Pr[B | A]. Multiplying these together gives us the probability of both A and B occurring. It doesn’t matter which event we label A and which we label B. The reverse is also true; that is, Pr[A and B]=Pr[B] Pr[A|B]. With the jewel wasps, for example, if we wanted to know the probability that a host had already been parasitized and that the mother wasp laid a male egg, we would multiply the probability that it had been parasitized (0.2) times the probability of a male egg given that the egg was already parasitized (0.9), to get 0.18. We can see the same probability by following the appropriate path (the top one) through the probability tree in Figure 5.8-2. If A and B are independent, then having information about A gives no information about B, and therefore Pr[B | A] = Pr[B]. That is, the general multiplication rule reduces to the multiplication rule when the events are independent. Sampling without replacement One common use of conditional probability is sampling without replacement. This process occurs when the specific outcome of one random trial eliminates or depletes that outcome from the possibilities and so changes the probability distribution of values for subsequent random trials. As a simple example, consider drawing cards randomly from a fair card deck in which the 52 ordinary cards have been shuffled and so are in random order. What is the probability of drawing three cards in the precise sequence “ace-2-3,” ignoring card suit, without returning the cards to the deck? The probability that the first card drawn is an ace is 4/52, because there are four aces out of the 52 cards. The key novelty is that the outcome of the first draw changes the probabilities of outcomes for later draws if the card is not returned to the deck. For example, if the first card is an ace, then the probability of a 2 in the next draw is changed because there are now only 51 cards in the deck. The chance that the second card is a 2 is now 4/51. And if we have already taken an ace and a 2 from the deck, the probability that the third card is a 3 is 4/50, because there are 50 cards left and four of them are 3’s. So the probability that the first three draws are in the sequence ace-2-3 is (4/52) × (4/51) × (4/50). In contrast, when sampling with replacement, the sampled individual is not removed from the population after sampling. In this case, the frequencies of possible outcomes in the population are not changed by successive samples. When sampling populations for biological study, we usually choose populations that are large enough that the sampling of each individual doesn’t change the probability distribution of possible values in the individuals that remain. We assume that the effects of depletion are so slight that they don’t matter. This will not always be the case, however. Bayes’ theorem One powerful mathematical relationship about conditional probability is Bayes’ theorem.7 According to Bayes’ theorem, for two events A and B, Pr[A|B]=Pr[B|A] Pr[A]Pr[B] Bayes’ theorem may seem rather complicated, but it can be derived from the general multiplication rule. Because Pr[A and B]=Pr[B] Pr[A|B] and Pr[A and B]=Pr[A] Pr[B|A], it is also true that Pr[B] Pr[A|B]=Pr[A] Pr[B|A]. Dividing both sides by Pr[B] gives Bayes’ theorem. Example 5.9 applies Bayes’ theorem to the detection of Down syndrome. EXAMPLE 5.9 Detection of Down syndrome Down syndrome (DS) is a chromosomal condition that occurs in about one in 1000 pregnancies. The most accurate test for DS in wide use requires amniocentesis, which unfortunately carries a risk of miscarriage (about one in 200). It would be better to have an accurate test of DS without the risks. One such test in common use is called the triple test, which screens for levels of three hormones in maternal blood at around 16 weeks of pregnancy. The triple test is not perfect, however. It does not always correctly identify a fetus with DS (an error called a false negative), and sometimes it incorrectly identifies a fetus with a normal set of chromosomes as DS (an error called a false positive). Under normal conditions, the detection rate of the triple test (i.e., the probability that a fetus with DS will be correctly scored as having DS) is 0.60. The false-positive rate (i.e., the probability that a test would say incorrectly that a normal fetus had DS) is 0.05 (Newberger 2000). Most people’s intuition is that these numbers are acceptable. Based on the probabilities given, the triple test would seem to be right most of the time. But, if the test on a randomly chosen fetus gives a positive result (i.e., it indicates that the fetus has DS), what is the probability that this fetus actually has DS? Make a guess at the answer before we work it through. To address this question, we need Bayes’ theorem. We want to know a conditional probability—the probability that a fetus has DS given that its triple test showed a positive result. In other words, we want to know Pr[DS | positive result]. Using Bayes’ theorem, Pr[DS | positive result]=Pr[positive result | DS] Pr[DS]Pr[positive result]. We’ve been given Pr[positive result | DS] and Pr[DS], the two factors in the numerator, but we haven’t been given Pr[positive result], the term in the denominator. We can figure out the probability of a positive result, though, by using the law of total probability introduced earlier in this section. That is, we can sum over all the possibilities to find the probability of a positive result. Pr[positive result]=(Pr[positive result | DS] Pr[DS]) +(Pr[positive result | no DS] Pr[no DS])= (0.60×0.001)+[0.05×(1-0.001)]=0.05055. The probability of something not occurring is equal to one minus the probability of it occurring, so the probability that a randomly chosen fetus does not have DS is one minus the probability that it has DS. According to Example 5.9, Pr[DS] = 0.001, so Pr[no DS] = 1 − 0.001 = 0.999 in the preceding equation. Now, returning to Bayes’ theorem, we can find the answer to our question. Pr[DS | positive result]=0.60×0.0010.05055=0.012. There is a very low probability (i.e., 1.2%) that a fetus with a positive score on the triple test actually has DS! Many people find it more intuitive to think in terms of numbers rather than probabilities for these kinds of calculations. For every million fetuses tested, 1000 will have DS, and 999,000 will not. Of those 1000, 60% or 600 will test positive. Of the 999,000, 5% or 49,950 will test false-positive. Out of a million tests, therefore, there are 600 + 49,950 = 50,550 positive results, only 600 of which are true positives. The 600 true positives divided by the 50,550 total positives is 1.2%, the same answer as we got before. DS babies have a high probability of being detected, but they are a very small fraction of all babies. Thus, the true positive results get swamped by the false positives. This high false-positive ratio is not unusual. Many diagnostic tools have high proportions of false positives among the positive cases. In this case, erring on the side of caution is appropriate because, when the triple test returns a positive result, it can be checked by amniocentesis. Did you think that the probability of DS with a positive result would be higher? If so, you’re not alone. A survey of practicing physicians found that their grasp of conditional probability with false positives was extremely poor (Elstein 1988). In a question about falsepositive rates, where the correct answer was that 7.5% of patients with a positive test result had breast cancer, 95% of the doctors guessed that the answer was 75%! If these doctors had a better understanding of probability theory, they could avoid overstating the risks of serious disease to their patients, thus reducing unnecessary stress. .10 Summary ■ Probability is an important concept in biology. One reason is that randomly sampling a population represents a random trial whose outcomes are governed by the rules of probability. ■ A random trial is a process or experiment that has two or more possible outcomes whose occurrence cannot be predicted with certainty. ■ The probability of an event is the proportion of times the event occurs if we repeat a random trial over and over again under the same conditions. ■ A probability distribution describes the probabilities of all possible outcomes of a random trial. ■ Two events (A and B) are mutually exclusive if they cannot both occur (i.e., Pr[A and B] = 0). If A and B are mutually exclusive, then the probability of A or B occurring is the sum of the probability of A occurring and the probability of B occurring (i.e., Pr[A or B] = Pr[A] + Pr[B]). This is the addition rule. ■ The general addition rule gives the probability of either of two events occurring when the events are not mutually exclusive: Pr[A or B]=Pr[A]+Pr[B]−Pr[A and B]. The general addition rule reduces to the addition rule when A and B are mutually exclusive, because then Pr[A and B] = 0. ■ Two events are independent if knowing one outcome gives no information about the other outcome. More formally, A and B are independent if Pr[A and B] = Pr[A] Pr[B]. This is the multiplication rule. ■ Probability trees are useful devices for calculating the probabilities of complicated series of events. ■ If events are not independent, then they are said to be dependent. The probability of two dependent events both occurring is given by the general multiplication rule: Pr[A and B] = Pr[A] Pr[B | A]. ■ The conditional probability of an event is the probability of that event occurring given some condition. ■ Probability trees and Bayes’ theorem are important tools for calculations involving conditional probabilities. ■ The law of total probability, Pr[A]=∑All values of BPr[B] Pr[A|B], makes it possible to calculate the probability of an event (A) from all of the conditional probabilities of that event. The law multiplies, for all possible conditions (B), the probability of that condition (Pr[B]) times the conditional probability of the event assuming that condition (Pr[A | B]). PRACTICE PROBLEMS 1. Calculation practice: Addition rule. When women are asked how much they like Brussels sprouts, 30% say sprouts are “very repulsive,” 20% say that they are “somewhat repulsive,” 43% are “indifferent,” 6% say sprouts are “somewhat delicious,” and 1% claim they are “especially delicious.” Only one answer per woman was allowed. The data are from Trinkaus and Dennis (1991). a. Are these five possible answers mutually exclusive? Explain. b. What is the probability that a woman would say that Brussels sprouts are either very repulsive or somewhat repulsive? c. What is the probability that a woman would say that Brussels sprouts are anything other than especially delicious? 2. Calculation practice: Law of total probability. The survey in the previous problem was conducted on men as well: 34% say Brussels sprouts are “very repulsive,” 19% say that they are “somewhat repulsive,” 38% are “indifferent,” 8% say they are “somewhat delicious” and 1% claim they are “especially delicious.” Assume that in a given population, 52% of the adults are women. Use the following steps to build a probability tree and calculate the probability that a random adult says that Brussels sprouts are somewhat delicious or especially delicious using the law of total probability. a. What is the probability that a randomly chosen adult is a man? What is the probability that a randomly chosen adult is a woman? Draw the first part of the probability tree for these two events. b. What is the probability that a man says that Brussels sprouts are somewhat delicious or especially delicious? c. Write (b) as a shorthand probability statement. Hint: (b) can also be stated as: What is the probability that a randomly chosen adult says that Brussels sprouts are somewhat delicious or especially delicious, given that he is a man? d. What is the probability that a woman says that Brussels sprouts are somewhat delicious or especially delicious? e. Complete the probability tree for the preceding events. f. Apply the law of total probability to determine the probability that a random adult says that Brussels sprouts are somewhat delicious or especially delicious. 3. Calculation practice: General addition rule. Among women voluntarily tested for sexually transmitted diseases in one university, 24% tested positive for human papilloma virus (HPV) only, 2% tested positive for Chlamydia only, and 4% tested positive for both HPV and Chlamydia (Tábora et al. 2005). Use the following steps to calculate the probability that a woman from this population who gets tested would test positive for either HPV or Chlamydia. a. Write the goal of the question as a probability statement. b. Write the general addition rule with words specific to this example. c. Calculate the probability that a randomly sampled woman would test positive for HPV or Chlamydia. 4. Calculation practice: General multiplication rule. In the 1980s in Canada, 52% of adult men smoked. It was estimated that male smokers had a lifetime probability of 17.2% of developing lung cancer, whereas a nonsmoker had a 1.3% chance of getting lung cancer during his life (Villeneuve and Mao 1994).8 a. What is the conditional probability of a Canadian man getting cancer, given that he smoked in the 1980s? b. Draw a probability tree to show the probability of getting lung cancer conditional on smoking. c. Using the tree, calculate the probability that a Canadian man in the 1980s both smoked and eventually contracted lung cancer. d. Using the general multiplication rule, calculate the probability that a Canadian man in the 1980s both smoked and eventually contracted lung cancer. Did you get the same answer as in (c)? e. Using the general multiplication rule, calculate the probability that a Canadian man in the 1980s both did not smoke and never contracted lung cancer. 5. Calculation practice: Bayes’ theorem. Refer to Practice Problem 4. Use the following steps to calculate the probability that a Canadian man smoked, given that he had been diagnosed with lung cancer. a. Write Bayes’ theorem for the specific case described in this question. b. Calculate the probability that a Canadian man in the late eighties would eventually develop lung cancer. (Use the law of total probability.) c. Use Bayes’ theorem to calculate the probability that a man from this population smoked, given that he eventually developed lung cancer. 6. The pizza below, ordered from the Venn Pizzeria on Bayes Street, is divided into eight slices: The slices might have pepperoni, mushrooms, olives, and/or anchovies. Imagine that, late at night, you grab a slice of pizza totally at random (i.e., there is a 1/8 chance that you grabbed any one of the eight slices). Base your answers to the following questions on the drawing of the pizza. a. What is the chance that your slice had pepperoni on it? b. What is the chance that your slice had both pepperoni and anchovies on it? c. What is the probability that your slice had either pepperoni or anchovies on it? d. Are pepperoni and anchovies mutually exclusive on the slices from this pizza? e. Are olives and mushrooms mutually exclusive on the slices from this pizza? f. Are getting mushrooms and getting anchovies independent when choosing slices from this pizza? g. If I pick a slice from this pizza and tell you that it has olives on it, what is the chance that it also has anchovies? h. If I pick a slice from this pizza and tell you that it has anchovies on it, what is the chance that it also has olives? i. Seven of your friends each choose a slice at random and eat them without telling you what toppings they had. What is the chance that the last slice left has olives on it? j. You choose two slices at random from this pizza. What’s the chance that they both have olives on them? (Be careful—after removing the first slice, the probability of choosing one of the remaining slices changes.) k. What’s the probability that a randomly chosen slice does not have pepperoni on it? l. Draw a pizza for which mushrooms, olives, anchovies, and pepperoni are all mutually exclusive. 7. In the first hour of a hunting trip, the probability that a pride of Serengeti lions will encounter a Cape buffalo is 0.035. If it encounters a buffalo, the probability that the pride successfully captures it is 0.40 (numbers are from Scheel 1993). What is the probability that the next one-hour hunt for Cape buffalo by a pride of lions will end in a successful capture? 8. Cavities in trees are important nesting sites for a wide variety of wildlife, including the white-breasted nuthatch shown on the first page of this chapter. Cavities in trees are much more common in old-growth forests than in recently logged forests. A recent survey in Missouri found that 45 out of 273 trees in an old-growth area had cavities, while the rest did not (Fan et al. 2005). What is the probability that a randomly chosen tree in this area has a cavity? 9. The accompanying bar graph gives the relative frequency of letters in texts from the English language. Such charts are useful for deciphering simple codes. FIGURE FOR PROBLEM 9 a. If a letter were chosen at random from a book written in normal English, estimate by eye (and a bit of calculation) the probability that it is a vowel (i.e., A, E, I, O, or U). b. Estimate by eye the probability that five letters chosen independently and at random from an English text would spell out (in order) “S-T-A-T-S.” c. Estimate by eye the probability that two letters chosen at random from an English text are both E’s. 10. The gene Prdm9 is thought to regulate hotspots of recombination (crossing over) in mammals, including humans. In the people of Han Chinese descent living in the Los Angeles area there are five alleles at the Prdm9 gene, labeled A1, A2, A3, A4, and A5. The relative frequencies with which these alleles occur in that population are 0.06, 0.03, 0.84, 0.03, and 0.04, respectively (Parvanov et al. 2010). Assume that in this population, the two alleles present in any individual are independently sampled from the population as a whole (this can happen if people in the community marry and produce children randomly with respect to Prdm9 genotype). a. What is the probability that a single allele chosen at random from this population is either A1 or A4? b. What is the probability that an individual has two A1 alleles (i.e., what is the probability that its first allele is A1 and its second allele is A1)? c. What is the probability that an individual has one A1 allele and one A3 allele? (Note that this can happen if the first allele drawn is A1 and the second is A3, or if the first allele is A3 and the second is A1. A probability tree will help to keep track of all the possibilities.) d. What is the probability that an individual is not A1A1 (i.e., does not have two A1 alleles)? e. What is the probability, if you drew two individuals at random from this population, that neither of them would have an A1A1 genotype? f. What is the probability, if you drew two individuals at random from this population, that at least one of them would have an A1A1 genotype? g. What is the probability that three randomly chosen individuals would have no A2 or A3 alleles? (Remember that each individual has two alleles.) 11. After graduating from your university with a biology degree, you are interviewed for a lucrative job as a snake handler in a circus sideshow. As part of your audition, you must pick up two rattlesnakes from a pit. The pit contains eight snakes, three of which have been defanged and are assumed to be harmless, but the other five are definitely still dangerous. Unfortunately, budget cuts have eliminated the herpetology course from the curriculum, so you have no way of telling in advance which snakes are dangerous and which are not. You pick up one snake with your left hand and another snake with your right. a. What is the probability that you picked up no dangerous snakes? b. Assume that any dangerous snake you pick up has a probability of biting you. This probability is the same for each snake: 0.8. The defanged snakes do not bite. What is the chance that, in picking up your two snakes, you are bitten at least once? c. Still assume that the defanged snakes do not bite and the dangerous snakes have a probability of 0.8 of biting. If you picked up only one snake and it did not bite you, what is the probability that this snake is defanged? 12. Five different researchers independently take a random sample from the same population and calculate a 95% confidence interval for the same parameter. a. What is the probability that all five researchers have calculated an interval that includes the true value of the parameter? b. What is the probability that at least one does not include the true parameter value? 13. Schrödinger’s cat lives under constant threat of death from the random release of a deadly poison. The probability of release of the poison is 0.01 per day, and the release is independent on successive days. a. What is the probability that the cat will survive one day? b. What is the probability that the cat will survive seven days? c. What is the probability that the cat will survive a year (365 days)? d. What is the probability that the cat will die by the end of a year? 14. Rapid HIV tests allow for quick diagnosis without expensive laboratory equipment. However, their efficacy has been called into question. In a population of 1517 tested individuals in Uganda, 4 had HIV but tested negative (false negatives), 166 had HIV and tested positive, 129 did not have HIV but tested positive (false positives), and 1218 did not have HIV and tested negative (Gray et al. 2007). a. What was the probability of a false-positive (also called the false-positive rate)? b. What was the false-negative rate? c. If a randomly sampled individual from this population tests positive on a rapid test, what is the probability that he or she has HIV? 15. Kalani et al. (2008) discovered cells responsive to Wnt proteins in the subventricular zone of developing brains of mouse embryos. These cells included a high fraction of self-renewing stem cells, which suggested that Wnt signaling occurs during brain cell self-renewal. In a particular cell preparation in vitro, 9% of subventricular brain cells were Wnt-responsive. If six cells are sampled randomly from the cell preparations, what is the probability of sampling Wnt-responsive (W) and nonresponsive (L) cells in the following orders, from a large population of cells? a. WWLWWW b. WWWWWL c. LWWWWW d. WLWLWL e. WWWLLL f. WWWWWW g. What is the probability of at least one non-responsive brain cell when six cells are randomly sampled? 16. Studies have shown that the probability that a man washes his hands after using the restroom at an airport is 0.74, and the probability that a woman washes hers is 0.83 (American Society for Microbiology 2005). A waiting room in an airport contains 40 men and 60 women. Assume that individual men and women are equally likely to use the restroom. What is the probability that the next individual who goes to the restroom will wash his or her hands? 17. If you have ever tried to take a family photo, you know that it is very difficult to get a picture in which no one is blinking. It turns out that the probability of an individual blinking during a photo is about 0.04 (Svenson 2006). a. If you take a picture of one person, what is the probability that she will not be blinking? b. If you take a picture of 10 people, what is the probability that at least one person is blinking during the photo? ASSIGNMENT PROBLEMS 18. Imagine that a collection of 1600 pea plants from one of Mendel’s experiments had 900 that were tall plants with green pods, 300 that were tall with yellow pods, 300 that were short with green pods, and 100 that were short with yellow pods. a. Are “tall” and “green pods” mutually exclusive traits for this collection of plants? b. Are “tall” and “green pods” independent traits for this collection of plants? 19. A normal deck of cards has 52 cards, consisting of 13 each of four suits: spades, hearts, diamonds, and clubs. Hearts and diamonds are red, while spades and clubs are black. Each suit has an ace, nine cards numbered 2 through 10, and three face cards. The face cards are a jack, a queen, and a king. Answer the following questions for a single card drawn at random from a well-shuffled deck of cards. a. What is the probability of drawing a king of any suit? b. What is the probability of drawing a face card that is also a spade? c. What is the probability of drawing a card without a number on it? d. What is the probability of drawing a red card? What is the probability of drawing an ace? What is the probability of drawing a red ace? Are these events (“ace” and “red”) mutually exclusive? Are they independent? e. List two events that are mutually exclusive for a single draw from a deck of cards. f. What is the probability of drawing a red king? What is the probability of drawing a face card in hearts? Are these two events mutually exclusive? Are they independent? 20. The human genome is composed of the four DNA nucleotides: A, T, G, and C. Some regions of the human genome are extremely G–C rich (i.e., a high proportion of the DNA nucleotides there are guanine and cytosine). Other regions are relatively A–T rich (i.e., a high proportion of the DNA nucleotides there are adenine and thymine). Imagine that you want to compare nucleotide sequences from two regions of the genome. Sixty percent of the nucleotides in the first region are G–C (30% each of guanine and cytosine) and 40% are A–T (20% each of adenine and thymine). The second region has 25% of each of the four nucleotides. a. If you choose a single nucleotide at random from each of the two regions, what is the probability that they are the same nucleotide? b. Assume that nucleotides over a single strand of DNA occur independently within regions and that you randomly sample a three-nucleotide sequence from each of the two regions. What is the chance that these two triplets are the same? 21. In Vancouver, British Columbia, the probability of rain during a winter day is 0.58, for a spring day is 0.38, for a summer day is 0.25, and for a fall day is 0.53. Each of these seasons lasts one quarter of the year. a. What is the probability of rain on a randomly chosen day in Vancouver? b. If you were told that on a particular day it was raining in Vancouver, what would be the probability that this day would be a winter day? 22. When asked an embarrassing question in a survey—such as whether the respondent has ever shoplifted—individuals may be reluctant to answer truthfully. However, answers might be more truthful if the survey incorporates a random component, such as a coin toss, that prevents the questioner from determining whether any given individual is guilty (Warner 1965). For example, consider a survey of a population in which 20% of individuals really have shoplifted at least once. The survey asks every participating individual to begin by flipping a fair coin twice. If the result of the first toss is heads, then the individual is instructed to answer honestly the question “did the second toss also yield heads?” If the first coin toss yields tails, however, the respondent is instructed to answer honestly the question “have you ever shoplifted?” a. Draw a probability tree that describes all possible outcomes of such a survey and their probabilities. b. What is the overall probability that a randomly sampled respondent answers yes? 23. Imagine that a long stretch of single-stranded DNA has 30% adenine, 25% thiamine, 15% cytosine, and 30% guanine. (These make up the nucleotides of the DNA.) What is the probability of randomly drawing 10 adenines in a row in a sample of 10 randomly chosen nucleotides? 24. The Hox genes are responsible for determining the anterior–posterior identity of body regions (segments) in the developing insect embryo. Different Hox genes are turned on (expressed) in different segments of the body, and in this way they determine which segments become head and which thorax, which develop legs and which antennas. One surprising thing about the Hox genes is that they usually occur in a row on the same chromosome and in the same order as the body regions that they control. For example, the fruit fly Drosophila melanogaster has eight Hox genes located on a chromosome in exactly the same order as the body regions in which they are expressed, from head to tail (see figure below; Lewis et al. 2003; Negre et al. 2005). If the eight genes were thrown randomly onto the same chromosome, what is the probability that they would line up in the same order as the segments in which they are expressed? FIGURE FOR PROBLEM 24 25. The flour beetle, Tribolium castaneum, has 10 chromosomes, roughly equal in size, and it also has eight Hox genes (Brown et al. 2002; see Assignment Problem 24). If the eight Hox genes were randomly distributed throughout the genome of the beetle, what is the probability that all eight would land on the same chromosome? 26. A seed randomly blows around a complex habitat. It may land on any of three different soil types: a high-quality soil that gives a 0.8 chance of seed survival, a medium-quality soil that gives a 0.3 chance of survival, and a low-quality soil that gives only a 0.1 chance of survival. These three soil types (high, medium, and low) are present in the habitat in proportions of 30:20:50, respectively. The probability that a seed lands on a particular soil type is proportional to the frequency of that type in the habitat. a. Draw a probability tree to determine the probabilities of survival under all possible circumstances. b. What is the probability of survival of the seed, assuming that it lands? c. Assume that the seed has a 0.2 chance of dying before it lands in a habitat. What is its overall probability of survival? 27. A flycatcher is trying to catch passing bugs. The probability that it catches a bug on any given try is 20%. a. What is the probability that it catches its first bug on its fourth try? b. What is the probability that at least four failures occur before the flycatcher has its first success? 28. Blackjack is a game played with an ordinary deck of cards. (See Assignment Problem 19 for a description of such a deck.) “Blackjack” itself means that, of two cards dealt to a player, one is an ace and the other is either a 10, jack, queen, or king. If you are dealt two cards randomly from the same deck, what is the probability that you get blackjack? (Remember that, when a card is dealt, it is removed from the deck.) 29. Ignoring leap years, there are 365 days in a year. a. If people are born with equal probability on each of the 365 days, what is the probability that three randomly chosen people have different birthdates? b. If people are born with equal probability on each of the 365 days, what is the probability that 10 randomly chosen people all have different birthdates? c. If, as in fact turns out to be the case, birth rates are higher during some parts of the year than other times, would this increase or decrease the probability that 10 randomly chosen people have different birthdates, compared with your answer in part (b)? 30. During the Manhattan Project, the physicist Enrico Fermi asked Leslie R. Groves, the general in charge, “How do you define a ‘great general’?” General Groves replied, “Any general who wins five battles in a row is great.” He went on to say that only about 3% of generals are great. If battles are won entirely at random with a probability of 0.50 per side, what fraction of generals engaging in exactly five battles would be great by this definition? How does this compare to the percentage given by the general? 31. The figure at the bottom of the page shows the probability density of colony diameters (in mm) in a hypothetical population of Paenibacillus bacteria. The distribution is continuous, so the probability of sampling a colony within some range of diameter values is given by area under the curve. Numbers next to the curve indicate the area of the region indicated in red. Consider the case in which a single colony is randomly sampled from the population. a. Are the events “diameter is between 4 and 6” and “diameter is between 8 and 12” mutually exclusive? Explain. b. What is the probability that a randomly chosen colony diameter is between 4 and 6 or between 8 and 12? FIGURE FOR PROBLEM 31 c. What is the probability that a randomly chosen colony diameter is greater than or equal to 10? d. What is the probability that a randomly chosen colony diameter is between 8 and 10? e. What is the probability that a randomly chosen colony diameter is between 8 and 12 or greater than or equal to 10? 32. “After taking 10 mammograms, a patient has a 50% chance of having had at least one false alarm.” (A false alarm is a false-positive result.) Given this information (from Elmore et al. 2005), and assuming that false alarms are independent of each other, what is the probability of a false alarm on a single mammogram? 33. A boy mentions that none of the 21 kids in his third-grade class has had a birthday since school started 56 days previously. Assume that kids in the class are drawn from a population whose birthdays have the same probability on all days of the year. What is the probability that 21 kids in such a class would not yet have a birthday in 56 days? 34. Refer to the figure accompanying Assignment Problem 31. Consider the case in which two colonies are randomly sampled from the probability distribution shown. a. Are the events “the first diameter is between 4 and 6” and “the second diameter is between 8 and 12” mutually exclusive? Explain. b. Are the events “the first diameter is between 4 and 6” and “the second diameter is between 8 and 12” independent? Explain. c. What is the probability that the first diameter is between 4 and 6 and the second diameter is between 8 and 12? d. What is the probability that the first diameter is between 8 and 12 or the second diameter is between 10 and 12? 35. Three variants of the gene encoding the b-globin component of hemoglobin occur in the human population of the Kassena-Nankana district of Ghana, West Africa. The most frequent allele, A, occurs at frequency 0.83. The two other variants, S (“sickle cell”) and C, occur at frequency 0.04 and 0.13, respectively (Ghansah et al. 2012). Each individual has two alleles, determining its genotype at the b-globin gene. Assume that knowing the identity of one of the alleles of any individual provides no information about the identity of the second allele (i.e., alleles occur independently in individuals). a. CC individuals, having two copies of allele C, are slightly anemic. What is the probability that a randomly sampled individual from the population has two copies of the C allele (in other words, what is the probability that the individual’s first allele is C and his or her second allele is also C)? b. What is the probability that a randomly sampled individual is a homozygote (has two copies of the same allele)? c. Compared with AA individuals, AS and AC individuals are largely resistant to malaria, which is endemic to the region. They also experience fewer deleterious side effects than SS and CC individuals. What is the probability that a randomly sampled individual is AS? (Remember that an individual can be AS by getting A from mom and S from dad or by getting S from mom and A from dad.) d. What is the probability that a randomly sampled individual is AS or AC? 36. Some people are hypersensitive to the smell of asparagus, and can even detect a strong odor in the urine of a person who has recently eaten asparagus. This trait turns out to have a simple genetic basis. An individual with one or two copies of the A allele at the gene (AA or Aa genotypes) can smell asparagus in urine, whereas a person with two copies of the alternative “a” allele (aa genotypes) cannot (Online Mendelian Inheritance in Man, 2012). Assume that men and women in the population have the same allele frequencies at the asparagus-smelling gene and that marriage and child production are independent of the genotype at the gene. In the human population, 5% of alleles are A and 95% are a. a. What is the probability that a randomly sampled individual from the population has two copies of the a allele (that is, that it has an aa genotype)? b. What is the probability that both members of a randomly sampled married couple (man and woman) are aa at the asparagus-smelling gene? c. What is the probability that both members of a randomly sampled married couple (man and woman) are heterozygotes at this locus (meaning that each person has one allele A and one allele a)? d. Consider the type of couple described in (c). What is the probability that the first child of such a couple also has one A allele and one a allele (is a heterozygote)? Remember that the child must receive exactly one allele from each parent. 37. Refer to Assignment Problem 36. If a randomly sampled child has the aa genotype, what is the probability that both its parents were also aa? 38. Refer to Table 5.5-1. It turns out that blood type is controlled by two unlinked genes. One gene determines ABO blood type, and the other determines Rh factor (+ or –). Using the probabilities presented in Table 5.5-1, determine whether the events “individual is blood type O” and “Rh factor is –” are independent. 6 Hypothesis testing Cyanella alba Hypothesis testing, like estimation, uses sample data to make inferences about the population from which the sample was taken. Unlike estimation, however, which puts bounds on the value of a population parameter, hypothesis testing asks only whether the parameter differs from a specific “null” expectation. Estimation asks, “How large is the effect?” Hypothesis testing asks, “Is there any effect at all?” To better understand hypothesis testing, consider the polio vaccine developed by Jonas Salk. In 1954, Salk’s vaccine was tested on elementary-school students across the United States and Canada. In the study, 401,974 students were divided randomly into two groups: kids in one group received the vaccine, whereas those in the other group (the control group) were injected with saline solution instead. The students were unaware of which group they were in. Of those who received the vaccine, 0.016% developed paralytic polio during the study, whereas 0.057% of the control group developed the disease (Brownlee 1955). The vaccine seemed to reduce the rate of disease by two-thirds, but the difference between groups was quite small, only about four cases per 10,000. Did the vaccine work, or did such a small difference arise purely by chance? Hypothesis testing uses probability to answer this question. The null hypothesis is that the vaccine didn’t work, and that any observed difference between groups happened only by chance. Evaluating the null hypothesis involved calculating the probability, under the assumption that the vaccine has no effect, of getting a difference between groups as big or bigger than that observed. This probability turned out to be very small. Even though the rate of disease was not hugely different between the vaccine and control groups, the Salk vaccine trial was so large (over 400,000 participants) that it was able to demonstrate a real difference. Thus, the “null” hypothesis was rejected. The vaccine had an effect, sparing many kids from disease, which was borne out by the success of the vaccine in the ensuing decades. Hypothesis testing quantifies how unusual the data are, assuming that the null hypothesis is true. If the data are too different from what is expected by the null hypothesis, then we reject the null hypothesis. Hypothesis testing compares data to what we would expect to see if a specific null hypothesis were true. If the data are too unusual, compared to what we would expect to see if the null hypothesis were true, then the null hypothesis is rejected. In this chapter, we illustrate the basics of hypothesis testing in the simplest possible setting: a test about a proportion in a single population. Our goal is to present the main concepts with a minimum of calculation. The rest of this book will present many specific methods of hypothesis testing. Making and using statistical hypotheses Formal hypothesis testing begins with clear statements of two hypotheses—the null and alternative hypotheses—about a population. The null hypothesis is the default, whereas the alternative hypothesis usually includes every other possibility except the one stated in the null hypothesis. One of the two hypotheses is true, and the other must be false. We analyze the data to help determine which is which. Both statistical hypotheses, the null and the alternative, are simple statements about a population. They are not to be confused with scientific hypotheses, which are statements about the existence and possible causes of natural phenomena. Scientists design experiments and observational studies to test predictions of scientific hypotheses. When applied to the resulting data, statistical hypotheses help to decide which predictions of these scientific hypotheses are met and which are not met. Null hypothesis The null hypothesis is a specific claim about the value of a population parameter. It is made for the purposes of argument and often embodies the skeptical point of view. Often, the null hypothesis is that the population parameter of interest is zero (i.e., no effect, no preference, no correlation, or no difference). In general, the null hypothesis is a statement that would be interesting to reject. For example, if we can reject the statement, “Medication X does not affect the average life span of patients suffering from illness Y,” then we have learned something useful—that such patients do in fact live longer—or shorter—lives on average when taking medication X. Rejecting the null hypothesis would provide support for the scientific hypothesis that predicted a beneficial effect of medication X, whereas failing to reject the null hypothesis would not provide support. The null hypothesis is a specific statement about a population parameter made for the purposes of argument. A good null hypothesis is a statement that would be interesting to reject. The null hypothesis, which we can abbreviate as H0 (pronounced “H-naught” or “H-zero”), is always specific; it identifies one particular value for the parameter being studied. In a study to investigate the impact of drift-net fishing on the density of dolphins, for example, a valid null hypothesis could be the following: H0: The density of dolphins is the same in areas with and without drift-net fishing. A clinical trial designed to compare the effects of the antidepressant medication sertraline (Zoloft) and the older, tricyclic medication amitriptyline would state the null hypothesis as H0: The antidepressant effects of sertraline do not differ from those of amitriptyline. In other cases, the null hypothesis might represent an expectation from theory or from prior knowledge. For example, the following are valid null hypotheses: H0: Brown-eyed parents, each of whom had one parent with blue eyes, have brown- and blue-eyed children in a 3:1 ratio. H0: The mean body temperature of healthy humans is 98.6°F. Alternative hypothesis Every null hypothesis is paired with an alternative hypothesis (abbreviated HA) that usually represents all other feasible parameter values except that stated in the null hypothesis. The alternative hypothesis typically includes possibilities that are biologically more interesting than that stated in the null hypothesis. The alternative hypothesis often includes parameter values predicted by a scientific hypothesis being evaluated. For this reason the alternative hypothesis is often, but not always, the statement that the researcher hopes is true. The alternative hypothesis includes all other feasible values for the population parameter besides the value stated in the null hypothesis. The following are some alternative hypotheses that go with the null hypotheses stated previously: HA: The density of dolphins differs between areas with and without drift-net fishing. HA: The antidepressant effects of sertraline differ from those of amitriptyline. HA: Brown-eyed parents, each of whom had one parent with blue eyes, have brown- and blue-eyed children at something other than a 3:1 ratio. HA: The mean body temperature of healthy humans is not 98.6°F. In contrast to the null hypothesis, the alternative hypothesis is nonspecific. Every possible value for a population characteristic or contrast is included, except that specified by the null hypothesis. To reject or not to reject Crucially, null and alternative hypotheses do not have equal standing. The null hypothesis is the only statement being tested with the data. If the data are consistent with the null hypothesis, then we say we have failed to reject it (we never “accept” the null hypothesis). If the data are inconsistent with the null hypothesis, we reject it and say the data support the alternative hypothesis. Rejecting H0 means that we have ruled out the null hypothesized value. It also tells us in which direction the true value likely lies, compared to the null hypothesized value. But rejecting a hypothesis by itself reveals nothing about the magnitude of the population parameter. We use estimation to provide magnitudes. Hypothesis testing: an example To show you the basic concepts and terminology of hypothesis testing, we’ll take you through all the steps by using an example. Our goal is to illuminate the basic process without distraction from the details of the probability calculations. We’ll get to plenty of the details in later chapters. Four basic steps are involved in hypothesis testing: 1. 2. 3. 4. State the hypotheses. Compute the test statistic. Determine the P-value. Draw the appropriate conclusions. We’ll define the new terms we just used in this section. Example 6.2 tests a hypothesis about a proportion, but hypothesis testing can address a wide variety of quantities, such as means, variances, differences in means, correlations, and so on. We’ll try to emphasize the general over the specific here. Further details of how to test hypotheses about proportions are discussed in Chapter 7. EXAMPLE 6.2 The right hand of toad Humans are predominantly right-handed. Do other animals exhibit handedness as well? Bisazza et al. (1996) tested the possibility of handedness in European toads, Bufo bufo, by sampling and measuring 18 toads from the wild. We will assume that this was a random sample. The toads were brought to the lab and subjected one at a time to the same indignity: a balloon was wrapped around each individual’s head. The researchers then recorded which forelimb each toad used to remove the balloon. It was found that individual toads tended to use one forelimb more than the other. At this point the question became: do right-handed and left-handed toads occur with equal frequency in the toad population, or is one type more frequent than the other, as in the human population? Of the 18 toads tested, 14 were right-handed and four were left-handed. Are these results evidence of a predominance of one type of handedness in toads? Stating the hypotheses The number of interest is the proportion of right-handed toads in the population. Let’s call this proportion p. The default statement, the null hypothesis, is that the two types of handedness are equally frequent in the population, in which case p = 0.5. H0: Left- and right-handed toads are equally frequent in the population (i.e., p = 0.5). This is a specific statement about the state of the toad population, one that would be interesting to prove wrong. If this null hypothesis is wrong, then toads, like humans, on average favor one hand over the other. This statement establishes the alternative hypothesis: HA: Left- and right-handed toads are not equally frequent in the population (i.e., p ≠ 0.5). The alternative hypothesis is two-sided. This just means that the alternative hypothesis allows for two possibilities: that p is greater than 0.5 (in which case right-handed toads outnumber left-handed toads in the population), or that p is less than 0.5 (i.e., left-handed toads predominate). Neither possibility can be ruled out before gathering the data, so both should be included in the alternative hypothesis. In a two-sided (or two-tailed) test, the alternative hypothesis includes parameter values on both sides of the parameter value specified by the null hypothesis. “Two-tailed” has the same meaning as “two-sided.” It refers to the tails of the sampling distribution, where a “tail” is the region at the upper or lower extreme of the distribution. The test statistic The test statistic is a number calculated from the data that is used to evaluate how compatible the results are with those expected under the null hypothesis. The test statistic is a number calculated from the data that is used to evaluate how compatible the data are with the result expected under the null hypothesis. For the toad study, we use the observed number of right-handed toads as our test statistic. On average, if the null hypothesis were correct, we would expect to observe nine right-handed toads out of the 18 sampled (and nine left-handed toads, too). Instead, we observed 14 righthanded toads out of the 18 sampled. Fourteen, then, is the value of our test statistic. The null distribution Unfortunately, data do not always perfectly reflect the truth. Because of the effects of chance during sampling, we don’t really expect to see exactly nine right-handed toads when we sample 18 from the population, even if the null hypothesis is true. There is usually a discrepancy, due to chance, between the observed result and that expected under H0. The mismatch between the data and the expectation under H0 can be quite large, even when H0 is true, particularly if there are not many data. To decide whether the data are compatible with the null hypothesis, we must calculate the probability of a mismatch as extreme as or more extreme than that observed, assuming that the null hypothesis is true. To obtain this probability, we need to determine the sampling distribution of the test statistic assuming that the null hypothesis is true. We need to determine what values of the test statistic are possible under H0 and their associated probabilities. The probability distribution of values for the test statistic, assuming the null hypothesis is true, is called the “sampling distribution under H0” or, more simply, the null distribution. The null distribution is the sampling distribution of outcomes for a test statistic under the assumption that the null hypothesis is true. The tricky part is to figure out what the null distribution is for the test statistic. For the moment, let’s use the power of a computer to do the calculations. (We’ll learn a simpler and more elegant way to calculate this null distribution in Chapter 7.) Sampling 18 toads under H0 is like tossing 18 coins into the air and counting the number of heads that turn up when they land (letting heads represent right-handed toads). Tossing coins mimics well the sampling process under this H0 because the probability of obtaining heads in any one toss is 0.5, which matches the null hypothesis. When we tossed 18 coins a vast number of times with the aid of a computer and counted the number of heads (right-handed toads) each time, we obtained the sampling distribution of outcomes illustrated in Figure 6.2-1. The probabilities themselves are listed in Table 6.2-1. FIGURE 6.2-1 The null distribution for the test statistic, the number of right-handed toads out of 18 sampled. TABLE 6.2-1 All possible outcomes for the number of right-handed toads when 18 toads are sampled, and their probabilities under the null hypothesis. Number of right-handed toads Probability 0 1 2 3 4 5 6 7 0.000004 0.00007 0.0006 0.0031 0.0117 0.0327 0.0708 0.1214 8 0.1669 9 10 0.1855 0.1669 11 12 13 14 15 0.1214 0.0708 0.0327 0.0117 0.0031 16 17 18 0.0006 0.00007 0.000004 Total 1.0 Based on this null distribution, any number of right-handed toads between 0 and 18 is possible in a random sample of 18 individuals, but some numbers have a much higher probability of occurring than others. Quantifying uncertainty: the P-value Fourteen right-handed toads out of 18 total is not a perfect match to the expectation of the null hypothesis, but is the mismatch large enough to reject the possibility that chance alone is responsible? The usual way of describing the mismatch between data and a null hypothesis is to calculate the chance of getting those data, or data that are even more different from that expected, while assuming the null hypothesis. In other words, we want to know the probability of all results as unusual as or more unusual than that exhibited by the data. If this probability is small, then the null hypothesis is inconsistent with the data and we would reject the null hypothesis in favor of the alternative hypothesis. If the probability is not small, then we do not have enough evidence to doubt the null hypothesis, and we would not reject it. The probability of obtaining the data (or data that are an even worse match to the null hypothesis), assuming the null hypothesis, is called the P-value. If the P-value is small, then the null hypothesis is inconsistent with the data and we reject it.1 Otherwise, we do not reject the null hypothesis. In general, the smaller the P-value, the stronger is the evidence against the null hypothesis. The P-value is the probability of obtaining the data (or data showing as great or greater difference from the null hypothesis) if the null hypothesis were true. The P-value is not the probability that the null hypothesis is true. (Hypotheses are not outcomes of random trials and so do not have probabilities.) The P-value refers to the probability of a specific event when sampling data under the null hypothesis: it is the probability of obtaining a result as extreme as or more extreme than that observed. In practice, we calculate the P-value from the null distribution for the test statistic, shown for the toad data in Figure 6.2-2. FIGURE 6.2-2 The null distribution for the number of right-handed toads out of the 18 sampled. Outcomes in red are values as different as, or more different from, the expectation under H0 than 14, the number observed in the data. According to Figure 6.2-2, a total of 14 or more right-handed toads out of 18 is fairly unusual, assuming the null hypothesis. These values lie at the right tail of the null distribution and have a low probability of occurring if H0 is true. Equally unusual are 0, 1, 2, 3, or 4 righthanded toads, which are outcomes at the other tail of the null distribution. Remember that our alternative hypothesis HA is two-sided: it allows for the possibility that right-handed toads outnumber left-handed toads in the population, and also the possibility that left-handed toads outnumber right-handed toads. Therefore, outcomes from both tails of the null distribution that are as unusual as the observed data, or even more unusual, must be accounted for in the calculation of the P-value. Based on the data in Figure 6.2-2, the probability of 14 or more right-handed toads, assuming the null hypothesis is true, is Pr[14 or more right-handed toads]=Pr[14]+Pr[15]+Pr[16]+Pr[17]+Pr[18]=0.0155, where Pr[14] is the probability of exactly 14 right-handed toads. We can add the probabilities of 14, 15, 16, 17, and 18 because each outcome is mutually exclusive. This sum is not the Pvalue, though, because it does not yet include the equally extreme results at the left tail of the null distribution—that is, those outcomes involving a predominance of left-handed toads. The quickest way to include the probabilities of the equally extreme results at the other tail is to take the above sum and multiply by two: P=2×(Pr[14]+Pr[15]+Pr[16]+Pr[17]+Pr[18])=2×0.0155=0.031. This number is our P-value. In other words, the probability of an outcome as extreme as or more extreme than 14 right-handed toads out of 18 toads sampled is P = 0.031, assuming that the null hypothesis is true. Draw the appropriate conclusion Having calculated the P-value, what conclusion can we draw from it? On page 157, we said that if P is “small,” we reject the null hypothesis; otherwise, we do not reject H0. But what value of P is small enough? By convention in most areas of biological research, the boundary between small and not-small P-values is 0.05. That is, if P is less than or equal to 0.05, then we reject the null hypothesis; if P > 0.05, we do not reject it. The P-value for the toad data, P = 0.031, is indeed less than 0.05, so we reject the null hypothesis that left-handed and right-handed toads are equally frequent in the toad population. We conclude from these data that most of the toads in the population are right-handed. This decision threshold for P (i.e., P = 0.05) is called the significance level, which is signified by α (the lowercase Greek letter alpha). In biology, the most widely used significance level is α = 0.05, but you will encounter some studies that use a different value for α. After α = 0.05, the next most commonly used signifi-cance level is α = 0.01. In Section 6.3, we explain the consequences of choosing a significance level and consider why α = 0.05 is the most common choice.2 The significance level, α, is a probability used as a criterion for rejecting the null hypothesis. If the P-value is less than or equal to α, then the null hypothesis is rejected. If the P-value is greater than α, then the null hypothesis is not rejected. Reporting the results When writing up your results in a research paper or laboratory report, always include the following information in the summary of the results of a statistical test: ■ the value of the test statistic ■ the sample size ■ the P-value Leaving out any of these three values (e.g., presenting only the bare P-value), makes it difficult for the reader to determine how you obtained it. When writing up the results of the toad study, we would need to indicate that 14 out of 18 toads were right-handed (which in this case gives both the test statistic and the sample size) and that P = 0.031. In addition, the best practice is to provide confidence intervals, or at least the standard errors, for the parameters of interest. This is because although the P-value indicates the weight of evidence against the null hypothesis (smaller P means stronger evidence), P does not measure the size of the effect. A very small P-value may result even when the size of the effect being measured is small. The confidence interval puts bounds on the estimated magnitude of effect. Using the methods we explain in Chapter 7, we calculated the following 95% confidence interval for the proportion p of right-handed toads in the study population: 0.54<p<0.91. This calculation reveals that the range of most-plausible values for the true proportion of righthanded toads in the population is very broad. We would need a larger sample size to obtain a more precise estimate. Errors in hypothesis testing The most unsettling aspect of hypothesis testing is the possibility of errors. Rejecting H0 does not necessarily mean that the null hypothesis is false. Similarly, failing to reject H0 does not necessarily mean that the null hypothesis is true. This is because chance affects samples, sometimes with large impact. Some uncertainty can be quantified, though, if the data are a random sample, so making rational decisions is possible. Type I and Type II errors There are two kinds of errors in hypothesis testing, prosaically named Type I and Type II. Rejecting a true null hypothesis is a Type I error. Failing to reject a false null hypothesis is a Type II error. Both types of error are summarized in Table 6.3-1. Type I error is rejecting a true null hypothesis. The significance level α sets the probability of committing a Type I error. Type II error is failing to reject a false null hypothesis. TABLE 6.3-1 Types of error in hypothesis testing. Reality Conclusion H0 true H0 false Reject H0 Type I error Correct Do not reject H0 Correct Type II error The significance level, α, gives us the probability of committing a Type I error. If we go along with convention and use a significance level of α = 0.05, then we reject H0 whenever P is less than or equal to 0.05. This means that, if the null hypothesis were true, we would reject it mistakenly one time in 20. Biologists typically regard this as an acceptable error rate. We could reduce our Type I error rate if we wanted to, simply by using a smaller significance level than 0.05. For example, a more cautious approach would be to use α = 0.01 instead of 0.05—that is, reject the null hypothesis only if P is less than or equal to 0.01. This would have the beneficial effect of reducing the probability of committing a Type I error down to 0.01 (i.e., one time in 100). Unfortunately, this has the side effect of increasing the chance of committing a Type II error. Reducing α makes the null hypothesis more difficult to reject when true, but it also makes the null hypothesis more difficult to reject when false. For this reason the convention is to use a higher value such as α = 0.05. If a null hypothesis is false, we need to reject it to get the right answer. Failure to reject a false null hypothesis is a Type II error. Because the Salk vaccine really did reduce the probability of catching polio, another study that by chance found the vaccine had no effect would have committed a Type II error. A study that has a low probability of Type II error is said to have high power. Power is the probability that a random sample taken from a population will, when analyzed, lead to rejection of a false null hypothesis. All else being equal, a study is better if it has more power. The power of a test is the probability that a random sample will lead to rejection of a false null hypothesis. Power is difficult to quantify, because the probability of rejecting a null hypothesis depends on how different the truth is from the null hypothesis. Detecting a small effect is more difficult than detecting a large effect. Because we never know how large the true value is, we usually cannot predict with any confidence how much power a study really has. If we can guess the magnitude of the deviation from the null hypothesis, however, we can usually estimate the power of a study. A study has more power if the sample size is large, if the true discrepancy from the null hypothesis is large, or if the variability in the population is low. We discuss how to calculate power and how to design a study to optimize power when we study experimental design in Chapter 14. When the null hypothesis is not rejected Example 6.4 describes a study in which the null hypothesis is not rejected. We discuss how to interpret such a nonsignificant result. EXAMPLE 6.4 The genetics of mirror-image flowers Individuals of most plant species are hermaphrodites (with both male and female sexual organs) and are therefore prone to the worst kind of inbreeding: having sex with themselves. The mud plantain, Heteranthera multiflora, has a simple mechanism to avoid “selfing.” The female sexual organ (the style) deflects to the left in some individuals and to the right in others (see the pair of flower images above). The male sexual organ (the anther) is on the opposite side. Bees visiting a left-handed plant are dusted with pollen on their right side, which then is deposited on the styles of only right-handed plants visited later. To investigate the genetics of this variation, Jesson and Barrett (2002) crossed pure strains of left- and right-handed flowers, yielding only right-handed plants in the next generation. These right-handed plants were then crossed to each other. The expectation under a simple model of inheritance would be that their offspring should consist of left- and right-handed individuals in a 1:3 ratio. Of 27 offspring measured from one such cross, six were left-handed and 21 were right-handed. Do these data support the simple genetic model? Let’s go through the four steps of hypothesis testing with this new example: state the hypotheses; compute the test statistic; determine the P-value; draw the appropriate conclusions. The test The null hypothesis states the expectation of the simple genetic model: H0: Left- and right-handed offspring occur at a 1:3 ratio (i.e., the proportion of left-handed individuals in the offspring population is p = 1/4). The alternative hypothesis covers every other possibility: HA: Left- and right-handed offspring do not occur at a 1:3 ratio (i.e., p ≠ 1/4). As with the study of handedness in toads (Example 6.2), the test in this flower study is twosided: under the alternative hypothesis, the proportion of left-handed offspring may be less than 1/4 or it may be greater than 1/4. Neither possibility can be ruled out before gathering the data, so both possibilities must be included. The number of left-handed offspring (out of 27) is the test statistic. We also could have used the number of right-handed individuals as the test statistic; the choice between the two doesn’t much matter as long as the null distribution reflects our choice. Under the null hypothesis, the expected number of left-handed offspring is 27 × 1/4 = 6.75. This expected frequency is a long-run average, because we don’t really expect to find 0.75 of a left-handed flower. To get the null distribution for the number of left-handed offspring, we again used the computer to take a vast number of random samples of 27 individuals from an imaginary population in which 1/4 of the individuals were left-handed. The resulting null distribution is illustrated in Figure 6.4-1. FIGURE 6.4-1 The null distribution for the number of left-handed individuals out of 27 sampled. Red bars on the left indicate outcomes less than or equal to six, the observed number of left-handed individuals. We’ve used red bars to indicate just one tail of the null distribution in this drawing. However, the test is two-sided and a equivalent portion of the right tail of the distribution must be incorporated when calculating the P-value for a two-sided test. The observed number of left-handed flowers (6) is less than the expected number from the null hypothesis (6.75), so we begin the calculation of the P-value by determining the probability of obtaining six or fewer left-handed offspring, assuming that the null distribution is true: Pr[number of left-handed flowers≤6]=Pr[6]+Pr[5]+…+Pr[0]. Summing these probabilities (given here by the height of the bars in Figure 6.4-1) yields Pr[number of left-handed flowers≤6]=0.471. This sum yields only the probability under the left tail of the null distribution (Figure 6.4- 1). Because the test is two-sided, we also need to account for outcomes at the other tail of the distribution that are as unusual as or more unusual than the outcome observed. The most straightforward method to obtain P is to multiply the above sum by two (Yates 1984), yielding P=2 Pr[number of left-handed flowers≤6]=2×0.471=0.942. The P-value is quite high, and there is a high probability of getting data like these when the null hypothesis is true. The P-value is not less than or equal to the conventional significance level α = 0.05, so our conclusion is that the null hypothesis is not rejected. Interpreting a nonsignificant result What does failure to reject H0 mean? Can we conclude that the null hypothesis is true? Sadly, we can’t, because it is always possible that the true value of the proportion p differs from 1/4 by a small or even moderate amount. You can tell this from the 95% confidence interval for p, the proportion of left-handed flowers in the population of possible offspring from the cross. Using methods we will detail in Chapter 7, we calculated this interval to be: 0.10<p<0.41 This result indicates that 1/4 falls between the lower and upper limits for p, and so it is certainly among the most-plausible values for this proportion. However, many other possible values for the proportion also fall within this most-plausible range. In other words, it’s possible that p is 1/4, but it is also possible that p differs from 1/4 even though H0 was not rejected, because the power of the test was limited by the relatively small sample size of 27. How, then, do we interpret the result? A valid interpretation is that the data are compatible or consistent with the null hypothesis; in other words, the data are compatible with the simple genetic model of inheritance of handedness in the mud plantain. There is no need or justification to build more complex genetic models of inheritance for flower handedness. A time may come when a new study—one with a larger sample size and more power— convincingly rejects the null hypothesis. If so, then researchers will need to develop a new genetic model. In the meantime, no evidence warrants a more complex scenario. This attitude treats the null hypothesis for what it is: the default until data show otherwise. Keep in mind that an analysis should never be terminated when having only the results of a hypothesis test to show for it. Drawing conclusions about populations from data also requires that we estimate useful parameters and put bounds on these estimates (Chapter 4). Calculating 95% confidence intervals will help to identify the most plausible set of parameter values given the data. If a test fails to reject a null hypothesis, but the confidence interval is wide, then we know that we do not yet have enough information to draw a strong conclusion. But if the confidence interval is narrow and tightly bounded around the parameter value stated in the null hypothesis, then any real deviation from H0 is likely to be either small or nonexistent. One-sided tests The studies of handedness in toads (Example 6.2) and flowers (Example 6.4) required twosided tests, but one-sided tests are justified in some circumstances. In a one-sided test, the alternative hypothesis includes values for the population parameter exclusively to one side of the value stated in the null hypothesis. In a one-sided (or one-tailed) test, the alternative hypothesis includes parameter values on only one side of the value specified by the null hypothesis. H0 is rejected only if the data depart from it in the direction stated by HA. For example, imagine a study designed to test whether daughters resemble their fathers. In each trial of the study a participant examines a photo of one girl and photos of two adult men, one of whom is the girl’s father. The participant must guess which man is the father. If there is no daughter–father resemblance, then the probability that the participant guesses correctly is only 1/2: H0: Participants pick the father correctly half the time (p = 1/2). The only reasonable alternative hypothesis is that daughters indeed resemble their fathers, in which case the probability that the participant guesses correctly should exceed 1/2: HA: Participants pick the father correctly more frequently than half the time (p > 1/2). The test is “one-sided” because the alternative hypothesis includes values for the parameter on only one side of the value stated in the null hypothesis. This is justified if the values on the other side of the value stated in the null hypothesis are inconceivable for any reason other than chance. It is not really conceivable that daughters would on average resemble their fathers less than they resemble randomly chosen men. The alternative hypothesis for a one-tailed test must be chosen before looking at the data. Data will always be on one side or another of the null hypothesis, so the data themselves should not be used to predict in which direction the deviation might be. Now let’s imagine that the study was carried out using 18 independent trials (which would require 18 different sets of photographs and participants) and that 13 out of 18 participants successfully guessed the father of the daughter. Under the null hypothesis, we would expect only 18 × 1/2 = 9 correct guesses on average. The null distribution is the same as that shown in Figure 6.2-1 and was obtained by taking a vast number of random samples of 18 participants from an imaginary population in which 1/2 guess correctly. The actual probabilities are the same as those presented previously in Table 6.2-1. Calculating a P-value for a one-sided test is different from the procedure used in a twosided test. This is because the only outcomes that would cause us to reject H0 and prefer HA are those with an unusually large number of correct guesses (the direction stated in the alternative hypothesis). Therefore, we examine only the right tail of the null distribution (Figure 6.5-1). FIGURE 6.5-1 The null distribution for the number of participants who correctly guessed the father of the daughter from photographs. Bars filled in red on the right tail correspond to values greater than or equal to 13, the observed number of correct guesses. If 13 participants guessed the father correctly, then the P-value is P=Pr[number of correct guesses≥13]=Pr[13]+Pr[14]+Pr[15]+…+Pr[18]=0.048, assuming that H0 is true. There is no need to multiply this number by two, as in the case of the two-sided tests presented earlier, because we are accounting for the probability in only one tail of the distribution. To reiterate: the appropriate tail of the null distribution to use when calculating P is given by the alternative hypothesis. Had we observed only four correct guesses out of 18, the P-value would still be calculated using the probabilities in the right tail of the null distribution, even though four correct guesses is smaller than the null expectation: P=Pr[number of correct guesses≥4]=0.996. One-sided tests should be used sparingly because the decision whether to use a one-sided or a two-sided alternative hypothesis is usually less clear-cut than the daughter–father resemblance study, and it is therefore subjective. For example, what if we carried out a subsequent study to test whether daughters, when they marry, choose husbands who resemble their fathers? The null hypothesis is that there is no resemblance, but what is the alternative hypothesis? Should it be one-sided (husbands resemble fathers) or two-sided (husbands resemble fathers or husbands are very unlike fathers)? One researcher may have a clear theoretical basis for predicting a deviation from the null hypothesis in one direction, but a second researcher may have a different prediction. More importantly, even when we predict that a result may go in a particular direction, we may still be tempted to interpret a result in the tail of the opposite direction as a significant deviation from the null hypothesis. Two-tailed tests keep us honest. For these reasons, we recommend using two-sided tests except in very special circumstances, and we adopt that policy in this book. Hypothesis testing versus confidence intervals The confidence interval puts bounds on the most-plausible values for a population parameter based on the data in a random sample (Chapter 4). Would confidence intervals and hypothesis tests on the same data, then, give the same answer? In other words, if the parameter value stated in the null hypothesis fell outside the 95% confidence interval estimated for the parameter, would H0 be rejected by a hypothesis test at α = 0.05? And, if the parameter value stated in the null hypothesis fell inside the 95% confidence interval, would a test at α = 0.05 fail to reject H0? The answer is almost always “yes.”3 A 95% confidence interval usually contains all values of the parameter that would not be rejected if it were stated as a null hypothesis and tested with the same data at α = 0.05. Why, then, don’t we just skip hypothesis testing altogether? The confidence interval does virtually everything that hypothesis testing can do, and it has the added benefit of being much more revealing about the actual magnitude of the parameter. Hypothesis testing is used all the time in biology. And while it is fair to say that confidence intervals are not used enough in biological research, hypothesis testing is informative, too. Its main use is to decide whether sufficient evidence has been presented to support a scientific claim (Frick 1996). The kinds of claims addressed by hypothesis testing are largely qualitative, such as “this new drug is effective” or “this pollutant harms fish.” Such claims are embodied by the alternative hypothesis, where “sufficient evidence” is defined as P≤0.05 (or some other significance level). For this reason, both hypothesis testing and confidence intervals are a big part of statistical analyses and of this book. Summary ■ The four steps of hypothesis testing are (1) state the hypotheses; (2) compute the test statistic; (3) determine the P-value; and (4) draw the appropriate conclusions. ■ Hypothesis testing uses data to decide whether a parameter equals the value stated in a null hypothesis. If the data are too unusual, assuming the null hypothesis is true, then we reject the null hypothesis. ■ The null hypothesis (H0) is a specific claim about a parameter. The null hypothesis is the default hypothesis, the one assumed to be true unless the data lead us to reject it. A good null hypothesis would be interesting if rejected. ■ The alternative hypothesis (HA) usually includes all values for the parameter other than that stated in the null hypothesis. ■ The test statistic is a quantity calculated from data, used to evaluate how compatible the data are with the null hypothesis. ■ The null distribution is the sampling distribution of the test statistic under the assumption that the null hypothesis is true. ■ The P-value is the probability of obtaining a difference from the null expectation as great as or greater than that observed in the data if the null hypothesis were true. If P is less than or equal to α, then H0 is rejected. ■ The threshold α is called the significance level of a test. Typically, α is set to 0.05. ■ The P-value is not the probability that the null hypothesis is true or false. ■ The P-value reflects the weight of evidence against the null hypothesis, but P does not measure the size of the effect. Use confidence intervals to put bounds on the magnitude of effect. ■ A Type I error is rejecting a true null hypothesis. A Type II error is failing to reject a false null hypothesis: Reality Decision H0 true H0 false Reject H0 Type I error (no error) Do not reject H0 (no error) Type II error ■ The probability of making a Type I error is set by the significance level, α. If α = 0.05, then the probability of making a Type I error is 0.05. ■ The power of a test is the probability that a random sample, when analyzed, leads to rejection of a false null hypothesis. ■ Increasing sample size increases the power of a test. ■ In a two-sided test, the alternative hypothesis includes parameter values on both sides of the parameter value stated by the null hypothesis. In a one-sided test, the alternative hypothesis includes parameter values on only one side of the parameter value stated by the null hypothesis. ■ Most hypothesis tests are two-sided. One-sided tests should be restricted to rare instances in which a parameter value on one side of the null value is inconceivable. PRACTICE PROBLEMS 1. A scientist tests the null hypothesis that the mean height of plants in his population is 0.75 meters, as it is in a nearby population observed with a complete census of all plants. Assume that this null hypothesis is true. However, the plants in his sample were chosen nonrandomly, and the tallest plants were more likely to be chosen than expected by chance. Which of the following statements is true? a. A hypothesis test based on this biased sample would have an increased probability of making a Type I error. b. This sampling bias will not affect the probability of a Type I error. 2. Imagine that you are using a random sample of data to test a null hypothesis. Answer whether the following statement is true or false: A parameter estimate with high sampling error will result in a test with a higher Type I error rate compared to an estimate with low sampling error. 3. Answer the following questions: a. Define Type II error. b. Define significance level. c. If the significance level of a test is increased, will the probability of making a Type II error increase, decrease, or stay the same? Explain. 4. Assume that a null hypothesis for a statistical test is true. Say whether each of the following statements is true or false: a. Calculating the P-value assumes that the sample is a random sample. b. If you reject H0 with a test, you will be making a Type II error. c. If you fail to reject H0 with a test, you will be making a Type I error. 5. Do people have powers of extrasensory perception (ESP)? Some people claim to have such abilities or to know someone who has them. Some people claim that individuals who consistently get the wrong answer must have ESP too. Other people disbelieve such claims. Imagine that you are to set up an experiment to test the existence of ESP. Each trial in your experiment involves one person privately rolling a fair six sided die and holding the image of the face of the die that turned up firmly in her mind. In another room, a second person attempts to identify the result correctly without being shown the outcome. a. What would the null hypothesis be for your test? b. What would the alternative hypothesis be? 6. Identify whether each of the following statements is more appropriate as the null hypothesis or as the alternative hypothesis in a test: a. Hypothesis: The number of hours preschool children spend watching television affects how they behave with other children when at day care. b. Hypothesis: Most genetic mutations are deleterious to health. c. Hypothesis: A diet of fast foods has no effect on liver function. d. Hypothesis: Cigarette smoking influences risk of suicide. e. Hypothesis: Growth rates of forest trees are unaffected by increases in carbon dioxide levels in the atmosphere. 7. What effect does reducing the value of the significance level from 0.05 to 0.01 have on a. the probability of committing a Type I error? b. the probability of committing a Type II error? c. the power of a test? d. the sample size? 8. Assume a random sample. What effect does increasing the sample size have on a. the probability of committing a Type I error? b. the probability of committing a Type II error? c. the power of a test? d. the significance level? 9. In the toad experiment (Example 6.2), what would the P-value have been if a. 15 toads were right-handed and the rest were left-handed? b. 13 toads were right-handed and the rest were left-handed? c. 10 toads were right-handed and the rest were left-handed? d. 7 toads were right-handed and the rest were left-handed? 10. Why do we “fail to reject H0” rather than “accept H0” after a test in which the P-value is calculated to be greater than α? 11. An imaginary researcher examined the 18 largest mammal species in the Americas that occur on both the mainland and on islands. Sixteen of the mammal species were smaller on islands than on the mainland, such as the Channel Island pygmy mammoth shown next to a continental mammoth in the drawing. The remaining two species were larger on the islands than the mainland. With these data, test whether large mammals are likely to differ in size between islands and the mainland in a particular direction. Proceed through all four steps of the hypothesis testing procedure. Use the null sampling distribution in Table 6.2-1 to calculate your P-value. Apply the conventional signifi-cance level, α = 0.05. 12. A clinical trial was carried out to test whether a new treatment affects the recovery rate of patients suffering from a debilitating disease. The null hypothesis “H0: The treatment has no effect” was rejected with a P-value of 0.04. The researchers used a significance level of α = 0.05. State whether each of the following conclusions is correct. If not, explain why. a. The treatment has only a small effect. b. c. d. e. The treatment has some effect. The probability of committing a Type I error is 0.04. The probability of committing a Type II error is 0.04. The null hypothesis would not have been rejected had the significance level been set to α = 0.01 instead of 0.05. 13. As neuronal activity increases in the brain, blood flow to the brain also rises to meet the increasing demands for oxygen. Sheth et al. (2004) measured blood dynamics in the somatosensory cortex of rat brains to determine whether volume increased linearly with greater neuronal activity, or whether the relationship might be nonlinear, with blood volume to the brain increasing more steeply the greater the amount of neuronal activity. They estimated that the rate at which total hemoglobin to the tissue increased with increasing neuronal activity was 1.17. A rate of 1 is expected if the relationship between the two variables is linear, whereas a value different from one indicates a nonlinear relationship. The 95% confidence interval for the rate in the population was 0.74≤rate≤1.59. The researchers also tested the null hypothesis that the relationship between the variables is linear (H0: rate = 1). Did their test reject the null hypothesis or not? Explain. 14. Imagine that you have carried out a study to determine whether sons resemble their mothers. In each of 18 independent trials, you showed a participant a photo of one boy and photos of two adult women, one of whom is the boy’s mother and the other of whom is randomly chosen. You ask the participant to guess which woman is the mother. If there is no son– mother resemblance, the probability that the participant guesses correctly is 1/2. If sons really do resemble their mothers, the probability that the participant guesses correctly is greater than 1/2. By answering the following questions, carry out the four steps of hypothesis testing. a. State the appropriate null and alternative hypotheses. b. Is this a one-sided or two-sided test? How can you tell? c. What would you use as the test statistic for this study? d. Suppose you found that 7 of the 18 participants guessed correctly and 11 guessed incorrectly. Calculate the P-value for this result. The null distribution is the same as that shown in Figure 6.2-1, and the probabilities of each outcome are given in Table 6.2-1. e. What is the conclusion from your test? f. What should be your next step, once you have completed the hypothesis test, to help interpret your findings and to determine the most-plausible range of values for the population parameter, p? ASSIGNMENT PROBLEMS 15. For the following alternative hypotheses, give the appropriate null hypothesis. a. Pygmy mammoths and continental mammoths differ in their mean femur lengths. b. Patients who take phentermine and topira-mate lose weight at a different rate than control patients without these drugs. c. Patients who take phentermine and topiramate have different proportions of their babies born with cleft palates than do patients not taking these drugs. d. Shoppers on average buy different amounts of candy when Christmas music is playing in the shop compared to when the usual type of music is playing. e. Male white-collared manakins (a tropical bird) dance more often when females are present than when they are absent. 16. Identify whether each of the following statements is more appropriate as a null hypothesis or an alternative hypothesis. a. Hypothesis: The number of hours that grade-school children spend doing homework predicts their future success on standardized tests. b. Hypothesis: King cheetahs on average run the same speed as standard spotted cheetahs. c. Hypothesis: The mean length of African elephant tusks has changed over the last 100 years. d. Hypothesis: The risk of facial clefts is equal for babies born to mothers who take folic acid supplements compared with those born to mothers who do not. e. Hypothesis: Caffeine intake during pregnancy affects mean birth weight. 17. State the most appropriate null and alternative hypothesis for each of the following experiments or observational studies: a. A test of whether cigarette smoking causes lung cancer b. An experiment to test whether mean herbivore damage to a genetically modified crop plant differs from that in the related unmodified crop c. A test of whether industrial effluents from a factory into the Mississippi River are affecting fish densities downstream d. A test of whether providing municipal safe-injection sites for drug addicts influences the rate of HIV transmission 18. Assume that a null hypothesis is true. Which one of the following statements is true? a. A study with a larger sample is more likely than a smaller study to get the result that P < 0.05. b. A study with a larger sample is less likely than a smaller study to get the result that P < 0.05. c. A study with a larger sample is equally likely compared to a smaller study to get the result that P < 0.05. 19. Assume that a null hypothesis is false. Which one of the following statements is true? a. A study with a larger sample is more likely than a smaller study to get the result that P < 0.05. b. A study with a larger sample is less likely than a smaller study to get the result that P < 0.05. c. A study with a larger sample is equally likely compared to a smaller study to get the result that P < 0.05. 20. Tikal National Park in Guatemala is heavily visited by tourists. Does the disturbance affect animal densities? To investigate, Hidinger (1996) compared the densities of various bird and mammal species in places immediately next to heavily visited ruins to places in the park that were rarely visited by tourists. The mean densities (in animals/km2) are found in the accompanying table. The table also lists the P-value associated with a test of the null hypothesis that the two types of plots do not differ in mean density. TABLE FOR PROBLEM 20 Species Agouti Coatimundi Collared peccary Deppes squirrel Howler monkey Spider monkey Crested guan Great curassow Ocellated turkey Tinamou Mean density near ruins Mean density far from ruins P-value 160.2 99.4 4.6 32.3 7.3 170.8 0 10.8 47.0 0 14.5 1.0 1.8 2.2 1.9 15.0 49.4 72.0 0 4.9 0.03 0.01 0.79 0.54 0.03 0.88 0.001 0.048 0.02 0.049 a. Which species show a statistically significant reduction in mean density near heavily visited ruins? Use a significance level of α = 0.05. b. Which species show a significant increase in density near heavily visited ruins? c. Which species provide no significant evidence of a difference in mean density between areas frequented by tourists and those rarely visited? d. For which species is the evidence strongest for a change in density near tourist sites? 21. Imagine that two researchers independently carry out clinical trials to test the same null hypothesis, that COX-2 selective inhibitors (which are used to treat arthritis) have no effect on the risk of cardiac arrest. They use the same population for their study, but one experimenter uses a sample size of 60 participants, whereas the other uses a sample size of 100. Assume that all other aspects of the studies, including significance levels, are the same between the two studies. a. Which study has the higher probability of a Type II error, the 60-participant study or the 100-participant study? b. Which study has higher power? c. Which study has the higher probability of a Type I error? d. Should the tests be one-tailed or two-tailed? Explain. 22. A study showed that the sex ratio of children born to families in a native community of Ontario deviated significantly from the continental average ratio. A newspaper report about this study claimed that “there was only a 1-percent probability that the results were due to chance.”4 What do you think this statement refers to? Rewrite this statement so that it is correct. 23. A group of researchers tested whether snakes tend to choose a warm resting site when both a warm site and a cool site are presented to them. Their hypotheses were H0: Snakes do not prefer the warmer site. HA: Snakes prefer the warmer site. They carried out the experiment and with their data calculated a one-tailed P-value of P = 0.03. They rejected their null hypothesis and concluded that snakes prefer the warmer sites. a. Is a one-tailed test appropriate here? Explain. b. What would their hypothesis statements have been had they used a two-tailed test instead? c. What would their P-value have been had they used a two-tailed test instead? 24. Does being told how a suspenseful story will end ruin the experience for the reader? Movie and book critics are careful to avoid giving too much away in their reviews. But the impact of such “spoilers” has only recently been tested scientifically (Leavitt and Christenfeld 2011). Students were asked to read a variety of crime-mystery stories. Some were told the ending before starting to read whereas others were not told. At the end, readers were asked to rate their enjoyment of the stories on a scale from 1 to 10, with 10 indicating greatest enjoyment. The mean enjoyment score was 7.3 in the group of students told the endings beforehand, while average enjoyment score was 6.6 in the other group. A test of the statistical null hypothesis of no difference between the means of the two groups yielded P = 0.001. a. State the appropriate conclusion of the test. b. Do the data support the idea that knowing the ending reduces the mean enjoyment of the stories? c. The authors used a two-sided test. Is this appropriate? Explain. 25. Can parents distinguish their own children by smell alone? To investigate, Porter and Moore (1981) gave new T-shirts to children of nine mothers. Each child wore his or her shirt to bed for three consecutive nights. During the day, from waking until bedtime, the shirts were kept in individually sealed plastic bags. No scented soaps or perfumes were used during the study. Each mother was then given the shirt of her child and that of another, randomly chosen child and asked to identify her own by smell. Eight of nine mothers identified their children correctly. Use this study to answer the following questions, using a two-sided test and a significance level of α = 0.05. a. To carry out a statistical test based on these data, what is the appropriate null hypothesis? b. What is the alternative hypothesis? c. What test statistic should you use? d. The following figure shows the null distribution for the number of mothers out of nine guessing correctly. The probability of each outcome is given above the bars. If the null hypothesis were true, what is the probability of exactly eight correct identifications? e. If the null hypothesis were true, what is the probability of obtaining eight or more correct identifications? f. What is the P-value for the test? g. What is the appropriate conclusion? h. As part of the analysis of these data, why would it be a good idea to calculate a 95% confidence interval for the true proportion of correct identifications? 26. Dondorp et al. (2010) carried out a large clinical trial that compared effectiveness of two drugs, artesunate and quinine, for treatment of African children having severe malaria. They used the results to test the null hypothesis that the proportion of children dying was the same in the two treatments against the alternative hypothesis that one treatment was better than the other. Which single answer below is correct? Explain your answer. The null distribution used in the test of the null hypothesis … a. describes the probability distribution of possible true benefits of each of the two drug treatments. b. describes the possible probability distribution of true benefits and costs of one drug treatment or the other drug treatment. c. describes the probability distribution of possible observed differences between the treatment groups if there were truly no difference between drug treatments. d. describes the probability distribution of possible observed differences between treatments given that one of the drugs really is better than the other. 27. About 30% of people cannot detect any odor when they sniff the steroid androstenone, but they can become sensitive to its smell if exposed to the chemical repeatedly. Does this change in sensitivity happen in the nose or in the brain? Mainland et al. (2002) exposed one nostril of each of 12 non-detector participants to androstenone for short periods every day for 21 days. The other nostril was plugged and had humidified air flow to prevent androstenone from entering. After the 21 days, the researchers found that 10 of 12 participants had improved androstenone-detection accuracy in the plugged nostril, whereas two had reduced accuracy. This suggested that increases in sensitivity to androstenone happen in the brain rather than the nostril, since the epithelia of the nostrils are not connected. The authors conducted a statistical hypothesis test of whether accuracy in fact did change. Let p refer to the proportion of non-detectors in the population whose accuracy scores improve after 21 days. Under the null hypothesis, p = 0.5 (as many participants should improve as deteriorate in their accuracy after 21 days). The alternative hypothesis is that p ≠ 0.5 (the proportion of participants increases or decreases after 21 days). a. Did the authors carry out a one- or a two-sided test? What justification might they provide? b. The accompanying figure shows the null distribution for the number of participants out of 12 having an improved accuracy score. The probability of each outcome is given above the bars. To what do these probabilities refer? c. What is the test statistic for the test? d. What is the P-value for the test? e. What is the appropriate conclusion? Use significance level α = 0.05. 28. Refer to problem 27. The researchers also found that 11 of 12 participants showed improved accuracy in the exposed nostril after 21 days. a. Carry out the four steps of hypothesis testing, to test whether accuracy scores changed after 21 days in the exposed nostrils of participants. Use significance level α = 0.05. b. Why would it be useful also to provide a confidence interval for the proportion of participants improving? 29. A team of researchers conducted 100 independent hypothesis tests using a significance level of α = 0.05. a. If all 100 null hypotheses were true, what is the probability that the researchers would reject none of them? b. If all 100 null hypotheses were true, how many of these tests on average are expected to reject the null hypothesis? 30. In many animal species, individuals communicate using ultraviolet (UV) signals, such as bright patches of skin or feathers, that are visible to one another but invisible to humans. Secondi et al. (2012) investigated the role of UV colors as sexual signals in two closely related species of Lissotriton newt, by asking whether females find males of their own species more attractive when UV radiation is present than when UV light is absent. Each trial consisted of confining a male newt to one end of an aquarium and measuring how much time a female of the same species chose to spend near the male (rather than at the opposite end of the aquarium). Each male was tested both under natural light conditions (UV light present) and when UV light was absent (by using a UV filter). The box plots in the following figure (modified from Secondi et al. 2012) show the difference between the two treatments in the time females spent near the males. A positive number indicates that a female spent more time near the male when UV light was present, and a negative number indicates that she spent less time near the male when UV light was present. For both species, the authors tested the null hypothesis that UV light treatment had no effect; that is, H0: Mean change in time females spent close to males = 0. Test results for the two species are as follows, along with 95% confidence intervals for the mean change. L. vulgaris: X¯=50.7, L. helveticus: X¯=199.8, s = 87.3, n = 23, P = 0.011, 12.9 ≤ μ ≤ 88.4. s = 587.0, n = 25, P = 0.102, −42.5 ≤ μ ≤ 442.1. Using these results, indicate whether the following statements are true or false. Explain your answers. a. The null hypothesis was rejected in the case of L. vulgaris but not in the case of L. helveticus at significance level 0.05. b. UV light treatment affects the attractiveness of male L. vulgaris to females but has no effect on the attractiveness of male L. helveticus. c. The weight of evidence against the null hypothesis was stronger in L. vulgaris than in L. helveticus. d. The magnitude of the effect of UV treatment was greater in L. vulgaris than in L. helveticus. 3 INTERLEAF Why statistical significance is not the same as biological importance In the early 20th century, the word “significant” had one dominant meaning. If you said that something was “significant” you meant that it signified or showed something—that it had or conveyed a meaning. When R. A. Fisher said that a result was significant, he meant that the data showed some difference from the null hypothesis. In other words, we were able to learn something from those data. This sense of the word “significant” has persisted in the scientific literature. When discussing data, a “statistically significant” result means that a null hypothesis has been rejected. But languages, like species, evolve. Over the 20th century, the word “significant” came to mean important. Today, outside of statistics, when we say that something is significant, we usually mean that it has value and import—that it ought to be paid attention to. This creates ambiguity when the term is applied now to scientific findings. When a newspaper article describes a new scientific result as “a significant new finding” or “a significant advance in our knowledge,” for example, is that a statistical statement or a value judgment? Sometimes the meaning is blurred. A statistically significant result is not the same as a biologically important result. An important result in biology refers to one whose effect is large enough to matter in some way. The mean lengths of the third molars may differ between two closely related species of mammals, but this doesn’t by itself mean that the difference is vital. To address the importance of the difference, we need to know its magnitude and how it might matter. The problem is that extremely small, biologically uninteresting effects can be statistically significant, as long as the sample size is sufficiently large. For example, automobile accidents increase during full moons, and this result is statistically significant (Templer et al. 1982). Such results attracted media attention because they bring to mind stories of werewolves and vampires wreaking havoc on the human population. But the size of the effect is a trivial 1% increase in the accident rate, far too small to make it worth cautioning you to check the phase of the moon before you go driving. In fact, almost any null hypothesis can be disproved with a large enough sample. There are likely few factors in biology whose effects on humans and other organisms are exactly zero. We should care about the effects only if they are large enough to matter. Full of sound and fury, signifying nothing. — Shakespeare, Macbeth We already have some sense that statistically significant does not mean important when we read of scientific studies that showed such earth-shattering facts as “teenagers like to listen to music sometimes,” “driving fast makes the ride feel bumpier,” and “spiders scare some people.” Each of these results was backed by hard data and statistics, but each surprised no one. On the other hand, a result can be important even if it is not statistically significant. Sometimes new data suggest a pattern that, if true, would be very important, provoking further study. For example, the first studies testing whether administering streptokinase prevents strokes did not reject the null hypothesis that this drug had no effect on mortality rate. But they were small studies that showed a suggestive pattern. As a result, further larger studies were conducted, and streptokinase was eventually shown to be effective in reducing mortality rates from strokes. This is why we don’t “accept the null hypothesis”—a small study on a small effect will have a low probability of rejecting a false null hypothesis. At the other extreme, sometimes it is important to show the lack of an effect. A large-scale study of the efficacy of hormone replacement therapy (HRT) showed no statistically significant evidence for a benefit of HRT to postmenopausal women. Moreover, confidence intervals showed that any plausible effect was small. HRT had been in wide use with substantial money being invested and with some known deleterious side effects. Knowing that it had little benefit saved a great deal of resources and prevented many unnecessary side effects. This result was not “statistically significant,” but it was very important medically. When presenting data, we should always report the estimated magnitude of the effect with a confidence interval, not just the P-value. The confidence interval gives us a plausible range for the size of the effect, and if this interval includes values with greatly varying interpretations, we know that we have to revisit the question with further data. We should look at a graphical presentation of the data to gauge the magnitude of the effect. The importance of a result depends on the value of the question and the size of the effect. Statistical significance tells us merely how confidently we can reject a null hypothesis, but not how big or how important the effect is. 7 Analyzing proportions Follicle mites in human hair follicle What proportion of people with Lou Gehrig’s disease will survive at least 10 years after diagnosis? What proportion of the North Carolina red wolf population is female? In what fraction of years does global temperature increase? Each of these questions is about a proportion, the fraction of the population that has a particular characteristic of interest. The proportion of individuals sharing some characteristic in a population is also the probability that an individual randomly sampled from that population will have that attribute. A proportion can range from zero to one. In this chapter, we’ll describe how best to estimate a population proportion using a random sample, including how to calculate its confidence interval. We continue the development, begun in Chapter 4, of one of the major themes of data analysis: estimation with confidence intervals. We also outline the best way to test hypotheses about a population proportion. In Chapter 6, we used a computer to take a vast number of random samples to obtain the null distribution for a proportion. Here in Chapter 7 we show a much quicker method to test hypotheses about a population proportion. This method, called the binomial test, provides exact P-values. The key to estimation and hypothesis testing is an understanding of the sampling distribution for a proportion. Therefore, we begin this chapter by exploring the properties of random samples from a population when each individual can be categorized into one of two types. The binomial distribution Consider a measurement made on individuals that divides them into two mutually exclusive groups, such as success or failure, alive or dead, left-handed or right-handed, or diabetic or nondiabetic. In the population, a fixed proportion p of individuals fall into one of the two groups (call it “success”) and the remaining individuals fall into the other group (call it “failure”). Calling one of the categories “success” and the other “failure” is a convenience, not a value judgment.1 If we take a random sample of n individuals from this population, the sampling distribution for the number of individuals falling into the success category is described by the binomial distribution. The term “binomial” reveals its meaning: there are only two (bi-) possible outcomes, and both are named (-nomial) categories. The binomial distribution provides the probability distribution for the number of “successes” in a fixed number of independent trials, when the probability of success is the same in each trial. Formula for the binomial distribution The binomial formula gives the probability of X successes in n trials, where the outcome of any single trial is either success or failure. The binomial distribution assumes that ■ the number of trials (n) is fixed, ■ separate trials are independent, and ■ the probability of success (p) is the same in every trial. Under these conditions, the probability of getting X successes in n trials is Pr[X successes]=(nX)pX(1-p)n-X. The left side of this equation, Pr[X successes], means the “probability of X successes,” where X is an integer between 0 and n. On the right-hand side, the quantity (nX) is read “n choose X.” This represents the number of unique ordered sequences of successes and failures that yield exactly X successes in n trials.2 The term is shorthand for (nX)=n!X!(n-X)!' where n! is called “n factorial” and refers to the product n! = n × (n − 1) × (n − 2) × (n − 3) × … × 2 × 1. Similarly, X! is “X factorial” and (n − X)! is “(n − X) factorial.” By definition, 0! = 1, so (n0) and (nn) are both equal to 1. Factorials get very large very fast. For example, 5! = 120, but 20! = 2,432,902,008,176,640,000. Even with a reasonably small number of trials, calculating the binomial coefficient can require a good calculator or computer.3 Suppose, for example, that you randomly sample n = 5 wasps from a population (representing five independent trials), where each wasp has probability p = 0.2 of being male. The probability, then, that exactly three of the wasps in your sample are male is Pr[3 males]=(35)(0.2)3(1−0.2)5−3=(5×4×3×2×1)(3×2×1)(2×1)(0.2)3(0.8)2=0.0512 The binomial distribution describes the sampling distribution for the number of successes in a random sample of n trials, but it also describes the proportion of successes. Because the number of trials is fixed at n, the probability that the sample has the proportion X/n successes is the same as the probability that the sample has X successes. For example, the probability that a proportion 0.6 of the five wasps are male is the same as the probability that exactly three are male—namely, 0.0512. Number of successes in a random sample Let’s look at a complete binomial distribution. Imagine, for example, that we are randomly sampling n = 27 individuals from a population in which p = 0.25 of the individuals are successes and 1 − p = 0.75 of the individuals are failures. What is the probability that the sample contains exactly X successes? This scenario is identical to the one stated by the null hypothesis in Example 6.4 (the study of mirror-image flowers). Recall that we sampled a population of mud plantains in which p = 0.25 had left-handed flowers and 1 − p = 0.75 had right-handed flowers. The process of taking a random sample from this population exactly matches the assumptions of the binomial distribution: n independent random trials, where the probability of success in each trial p is equal in every trial. Hence, the binomial distribution can be used to determine the probability of any given number of successes. For example, the probability of getting exactly six left-handed flowers with n = 27 and p = 0.25 is Pr[6 left-handed flowers]=(276)(0.25)6(1−0.25)27−6. The binomial coefficient, “27 choose 6,” is (276)=27!6!(27−6)!=27!6!21!=27×26×25×⋅⋅⋅×3×2×1(6×5×4×3×2×1) (21×20×19×⋅⋅⋅×3×2×1)=296,010. Therefore, Pr[6 left-handed flowers]=296,010(0.25)6(0.75)21=0.1719 In other words, there is about a 17% chance that exactly six of 27 randomly chosen flowers are left-handed, if the proportion of left-handed flowers in the population is 0.25. We continue in this way to calculate all the probabilities of the sampling distribution (Table 7.1-1). Check some of the probabilities in this table for practice. TABLE 7.1-1 The probability of obtaining X lefthanded flowers out of n = 27 randomly sampled, if the proportion of left-handed plants in the population is 0.25. Number of left-handed Pr[X] flowers (X) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 4.2 × 10−4 0.0038 0.0165 0.0459 0.0917 0.1406 0.1719 0.1719 0.1432 0.1008 0.0605 0.0312 0.0138 0.0053 0.0018 16 1.3 × 10−4 17 2.8 × 10−5 18 5.1 × 10−6 19 8.1 × 10−7 20 1.1 × 10−7 21 1.2 × 10−8 22 1.1 × 10−9 23 7.9 × 10−11 24 4.4 × 10−12 25 1.8 × 10−13 26 4.5 × 10−15 27 5.5 × 10−17 5.1 × 10−4 The probability distribution given in Table 7.1-1 is plotted in Figure 7.1-1. The distribution in Figure 7.1-1 looks the same as the distribution in Figure 6.4-1, which was obtained by using a computer to take a vast number of random samples of n = 27 from the population. In other words, the mathematical formula for the binomial distribution predicts the distribution generated using a computer simulation of the sampling process, but it does so much more rapidly and easily. The binomial distribution also gives exact probabilities. As a result, we can use the binomial distribution in place of repeated sampling on the computer to test null hypotheses. FIGURE 7.1-1 Plot of the probabilities given in Table 7.1-1 that were calculated from the binomial distribution. The plot shows the probability of obtaining X left-handed flowers out of n = 27 randomly sampled, if the proportion of left-handed plants in the population is 0.25. Sampling distribution of the proportion If there are X successes out of n trials in a random sample, then the estimated proportion of successes is pˆ : pˆ=Xn. (We pronounce pˆ as “p-hat.” Recall that p refers to the proportion in the population, whereas pˆ refers to the sample proportion.) We can use the same hypothetical population of flowers, having a true proportion of p = 0.25 successes, to illustrate the sampling distribution for the sample proportion pˆ. The panel on the top in Figure 7.1-2 is the sampling distribution when n = 10 (a relatively small sample size), whereas the panel on the bottom is the distribution for a larger sample size, n = 100. Both are based on binomial distributions, but rather than showing the number of successes X, we have divided X by n to yield pˆ FIGURE 7.1-2 The sampling distribution for the proportion of successes pˆ for sample size n = 10 (top) and n = 100 (bottom). In both of these graphs, the population proportion is p = 0.25. The distribution is narrower (smaller standard deviation) when n is larger. The mean of the sampling distribution of pˆ, the proportion of successes in a random sample of size n, is p. In other words, the proportion of successes in random samples is the same on average as the proportion of successes in the population. Therefore, pˆ is an unbiased estimate of the population proportion—on average, it gives the right answer. Notice in Figure 7.1-2 how the sample size affects the width of the sampling distribution for pˆ. When n is large (bottom panel), the sampling distribution is narrow. This effect is quantified in the formula for the standard error of pˆ (Remember from Section 4.2 that the standard error of an estimate is the standard deviation of its sampling distribution.) The standard error of pˆ (designated by σp) is σp=p(1-p)n. The sample size (n) is in the denominator of the standard error equation, so the standard error decreases as the sample size increases. That is why the estimates from samples of size 10 in Figure 7.1-2 (top panel) are more spread out than the estimates based on 100 individuals (bottom panel). Larger samples yield more precise estimates. The improvement in precision as sample size increases is called the law of large numbers. Testing a proportion: the binomial test The binomial test applies the binomial sampling distribution to hypothesis testing for a proportion. The types of questions it is suitable for have already been encountered in Chapter 6. The binomial test is used when a variable in a population has two possible states (i.e., “success” and “failure”), and we wish to test whether the relative frequency of successes in the population (p) matches a null expectation (p0). The hypothesis statements look like this: H0: The relative frequency of successes in the population is p0. HA: The relative frequency of successes in the population is not p0. The null expectation (p0) can be any specific proportion between zero and one, inclusive. The binomial test uses data to test whether a population proportion (p) matches a null expectation (p0) for the proportion. Example 7.2 shows how to apply the binomial test to real data. EXAMPLE 7.2 Sex and the X A study of 25 genes involved in spermatogenesis (sperm formation) found their locations in the mouse genome. The study was carried out to test a prediction of evolutionary theory that such genes should occur disproportionately often on the X chromosome.4 As it turned out, 10 of the 25 spermatogenesis genes (40%) were on the X chromosome (Wang et al. 2001; see Figure 7.2-1). If genes for spermatogenesis occurred “randomly” throughout the genome, then we would expect only 6.1% of them to fall on the X chromosome, because the X chromosome contains 6.1% of the genes in the genome. Do the results support the hypothesis that spermatogenesis genes occur preferentially on the X chromosome? FIGURE 7.2-1 Cartoon of the mouse genome. Each vertical line represents one of the mouse chromosomes and indicates its length relative to the others. Each mark on a line indicates a single gene involved in spermatogenesis. Note the abundance of these genes on the X chromosome. The null hypothesis is that the spermatogenesis genes would be on the X chromosome about 6.1% of the time, if they were randomly spread around the genome. To express this in terms of the binomial distribution, let’s call the placement of each gene in the sample a “trial,” and if the gene is on the X chromosome we’ll call it a success. The null hypothesis is that the probability of success (p) is 0.061. The more interesting alternative hypothesis is that the probability of success (p) is not 0.061—that is, either spermatogenesis genes occur more frequently than 0.061 or they occur less frequently on the X chromosome than expected by chance. We can write these hypotheses more formally as follows: H0: The probability that a spermatogenesis gene falls on the X chromosome is p = 0.061. HA: The probability that a spermatogenesis gene falls on the X chromosome is something other than 0.061 (p ≠ 0.061). Note once again the asymmetry of these two hypotheses. The null hypothesis is very specific, while the alternative hypothesis is not specific, referring to every other possibility. Also note that there are two ways to reject the null hypothesis: there can be an excess of spermatogenesis genes on the X chromosome (i.e., p > 0.061) or there can be too few (i.e., p < 0.061). Too few is not inconceivable, so it should also be included in the alternative hypothesis. Therefore, the test is two-sided. The next step is to identify the test statistic that will be used to compare the observed result with the null expectation. In the case of the binomial test, the test statistic is the observed number of successes. For the data in Example 7.2, that would be 10 spermatogenesis genes on the X chromosome. The null expectation is 0.061 × 25 = 1.525. On average, we expect the fraction 0.061 of the 25 spermatogenesis genes sampled—namely, 1.525—to be located on the X chromosome if H0 is true. Therefore, we know that in the sample more genes were found on the X chromosome than were expected by the null hypothesis. The question now is whether we are likely to get such an excess by chance alone if the null hypothesis were true. To decide this we need the null distribution, the sampling distribution for the test statistic assuming that the null hypothesis is true. As mentioned previously, the sampling distribution for the number of successes X in a random sample of n individuals from a population having the proportion p of successes is described by the binomial distribution. Under the null hypothesis, the proportion is p = 0.061, so, for the above data (where n = 25 genes), the null distribution is given by Pr[X successes]=(25X)(0.061)X(1 - 0.061)25-X. This null distribution allows us to calculate the P-value, the probability of getting a result as extreme as, or more extreme than, 10 spermatogenesis genes on the X chromosome when the null expectation is 1.525. Because the test is two-tailed, P is the probability of getting 10 or more genes on the X chromosome plus the probability of similarly extreme results at the other tail of the null distribution, corresponding to too few genes on the X chromosome. We account for all the extreme outcomes by doubling the probability of getting 10 or more: P=2Pr[number of successes ≥ 10]. The probability of getting exactly 10 out of 25 on the X chromosome, when the probability of being on the X chromosome is 0.061, is Pr [10 successes]=(2510)(0.061)10(1 -0.061)15 =9.07×10-7. This calculation is listed in Table 7.2-1, along with the probability of more than 10 genes on the X chromosome, if the null hypothesis were true. TABLE 7.2-1 Probabilities in the right-hand tail of the binomial distribution with n = 25 and p = 0.061. Number of genes on X Probability under the null hypothesis 10 9.1 × 10−7 11 8.0 × 10−8 12 6.1 × 10−9 13 4.0 × 10−10 14 2.2 × 10−11 15 1.0 × 10−12 16 4.3 × 10−14 17 1.5 × 10−15 18 4.2 × 10−17 19 1.0 × 10−18 20 2.0 × 10−20 21 3.1 × 10−22 22 3.6 × 10−24 23 3.1 × 10−26 24 1.7 × 10−28 25 4.3 × 10−31 The probability of getting 10 or more spermatogenesis genes on the X chromosome, assuming the null hypothesis is true, is the sum over all of these mutually exclusive possibilities: Pr [number of successes ≥10]=Pr [10] +Pr [11] +Pr [12]+ … +Pr [25]=9.9×10-7. The final P-value is P=2Pr[number of successes≥10]=2(9.9×10−7)=1.98×10−6. This P-value5 is well below the conventional significance level of α = 0.05. The probability of getting a result as extreme as, or more extreme, than the observed result is very low if the null hypothesis were true. Therefore, we reject the null hypothesis and conclude that there is a disproportionate number of spermatogenesis genes on the X chromosome. Our best estimate of the proportion of spermatogenesis genes that are located on the mouse X chromosome is pˆ=1025=0.40, which is much greater than 0.061, the proportion stated in the null hypothesis. These results might be stated in a scientific report: “A disproportionately large proportion of spermatogenesis genes occur on the X chromosome (0.40, SE = 0.10; binomial test, n = 25, P < 0.001).” This statement includes the standard error of the proportion, which we show you how to calculate in the next section. Approximations for the binomial test The binomial test gives us an exact P-value and can be applied in principle to any data that fit into two categories. But calculating P-values for the binomial test can be tedious without a computer, especially when n is large. Alternatives that are faster to calculate can be used under certain situations. They yield only approximate P-values, but they can save a lot of time when appropriate. We discuss two of them in subsequent chapters, but here we just want to let you know that they exist. The first is the χ2 goodness-of-fit test (Chapter 8), and the second is the normal approximation to the binomial test (Chapter 10). Estimating proportions Here we show you how to measure the precision of an estimate of a population proportion. We explain how to estimate a standard error for a sample proportion and how to calculate a confidence interval for a population proportion. We’ll motivate this endeavor with the data from Example 7.3 throughout. EXAMPLE 7.3 Radiologists’ missing sons Male radiologists have long suspected that they tend to have fewer sons than daughters. What is the proportion of males among the offspring of radiologists? In a sample of 87 offspring of “highly irradiated” male radiologists, 30 were male (Hama et al. 2001). Assume that this was a random sample. The best estimate of the proportion of male offspring in this population is the sample proportion pˆ=Xn=3087=0.345. Estimating the standard error of a proportion As we learned in Section 4.2, the standard deviation of the sampling distribution for an estimate is known as the standard error of that estimate. We have seen in Section 7.1 that the standard deviation of pˆ (and therefore its standard error) is σpˆ=p(1-p)n. We cannot usually calculate σpˆ because we don’t know the population parameter p. However, we can approximate this standard error with SEpˆ, which uses the estimate of the 6 proportion: SEpˆ=pˆ(1-pˆ)n. In the sample of offspring of radiologists, the standard error of the estimate of the proportion who are male is approximated by SEpˆ=pˆ(1-pˆ)n=0.345(1-0.345)87=0.051. This value tells us how close, on average, our sample estimate (pˆ ) is likely to be to the population proportion (p). Confidence intervals for proportions—the Agresti–Coull method Recall from Section 4.3 that a confidence interval is the range of most-plausible values of the parameter we are trying to estimate, based on the data. The 95% confidence interval of a proportion will enclose the true value of the proportion 95% of the time that it is calculated from new data. There are many methods in the statistical literature for calculating an approximate confidence interval for a proportion. We recommend using the Agresti–Coull method (Agresti and Coull 1998). For the 95% confidence interval, we first calculate a number called p': p'=X+2n+4. This p' is just an intermediate calculation needed in the next equation, not an estimate of the proportion. The 95% confidence interval for a proportion is p'-1.96p'(1-p')n+4<p<p'+1.96p'(1-p')n+4. Getting back to the male radiologists and their many daughters, recall that our best estimate of the true proportion of sons among the offspring of highly irradiated radiologists is 0.345. We can calculate the 95% confidence interval for the population proportion using X = 30 and n = 87 in the formula for the Agresti–Coull method: p'=X+2n+4=30+287+4=0.352 The 95% confidence interval is p'-1.96p'(1-p')n+4<p<p'+1.96p'(1-p')n+40.352-1.960.352(10.352)87+4<p<0.352 +1.960.352(1-0.352)87+40.254<p<0.450. This interval does not include the value 0.512, which is the proportion of sons typically found in the human population. In other words, 0.512 is not one of the most-plausible values for the proportion of sons of radiologists. We can be reasonably confident, therefore, that the proportion of sons of radiologists is lower than the human average, assuming that the data are indeed a random sample. The reason for so few sons is not known. Confidence intervals for proportions—the Wald method The most commonly used method to determine a confidence interval for a proportion is called the Wald method. In fact, most statistics packages for the computer still use the Wald method. We show it here only because it is used so often, but we do not recommend using it because it is not accurate in some commonly encountered situations. The Wald method brackets the population estimate pˆ by a multiple of its standard error. By the Wald method, the 95% confidence interval of the proportion is p^−1.96 SEp^<p<p^+1.96 SEp^ You can see that this formula is approximately the same as the 95% confidence interval calculated using the 2SE rule (Chapter 4). For the radiologist data in Example 7.3, for instance, the 95% confidence interval calculated using the Wald method is 0.244 < p < 0.445. Unfortunately, the method is accurate only when n is large and the population p is not close to 0 or 1. A 95% confidence interval should bracket the true population parameter in 95% of samples. Unfortunately, when n is small, or when p is close to 0 or 1, the Wald confidence interval for the proportion contains the true proportion less than 95% of the time. We recommend using the Agresti–Coull method instead. Deriving the binomial distribution We eventually use several probability distributions in this book. Each of these has been mathematically derived from first principles, and often this derivation is quite challenging. The binomial distribution, on the other hand, is relatively straightforward to derive. Here in this section, we sketch out how it is done. Imagine that we randomly sample n individuals from a population and that we take them in order, one at a time, from 1 to n. We want to calculate the probability of getting X successes. The first step in calculating this probability is to determine the number of sequences of successes and failures that lead to X successes in total. For most values of X, many different sequences of successes and failures will yield a total of exactly X successes. For example, imagine that we sample five children and we want to know the probability of getting exactly three boys (B) and two girls (G). For a sample this small, it is relatively easy to list all of the possible sequences of n = 5 trials that yield three boys: BBBGG BBGBG BBGGB BGBBG BGBGB BGGBB GBBBG GBBGB GBGBB GGBBB There are exactly 10 such sequences. In general, the number of different sequences of n events yielding exactly X successes is given by the binomial coefficient, (nX). children, the binomial coefficient is In the above example of X = 3 boys out of n = 5 (53)=5!3!(5-3)!=5 ×4 ×3 ×2 ×1(3 ×2 ×1)(2×1)=10. The next step in finding the probability of X successes in n random trials relies on the assumption that separate trials are independent. In this case, each success happens with the same probability (p), and each failure happens with probability 1 − p. Thus, from the multiplication rule, the probability of any string of successes and failures is the product of these probabilities for each event. Thus, a single sequence that has X successes and n − X failures has probability pX(1−p)n−X. Under independence, each sequence of trials yielding exactly X successes has this same probability of occurring. The last step is to add up the probabilities of all sequences yielding exactly X successes (i.e., we apply the addition rule). Because each of the alternative sequences is mutually exclusive and each has the same probability, we find the overall probability of X successes in n trials by multiplying the probability of any given sequence by the number of sequences: Pr [X successes] =(nX)pX(1- p)n-X. This is the formula for the binomial distribution that we first introduced in Section 7.1 and used thereafter. Summary ■ The binomial distribution expresses the probability of getting X successes out of n trials, assuming that each trial is independent and has the same probability (p) of a success. ■ The best estimate of a population proportion is the sample proportion. ■ According to the law of large numbers, very large samples will have a proportion of successes that is close to the true proportion in the population. ■ The binomial test compares the observed number of successes in a data set to that expected under a null hypothesis. The null distribution of the number of successes under H0 is the binomial distribution, and so the binomial formula can be used to calculate the P-value for the test. ■ A confidence interval for a proportion can be found using the Agresti–Coull method. Quick Formula Summary Binomial distribution Formula: Pr [X successes] =(nX)pX(1- p)n-X, where p is the probability of success in each trial, X is the number of successes, and n is the number of trials. Proportion Estimate: pˆ=Xn Standard error: The standard error of a proportion is estimated by SEpˆ=pˆ(1-pˆ)/n. Agresti–Coull 95% confidence interval for a proportion What does it assume? A random sample. Formula: (p′-1.96p′(1-p′)n+4)<p<(p′+1.96p′(1-p′)n+4), where p′=X+2n+4, X is the number of successes in the sample, and n is the sample size. Binomial test What is it for? Tests whether a population proportion (p) matches a null expectation (p0) for the proportion. What does it assume? A random sample. Test statistic: The observed number of successes, X. Formula: P=2(∑i=XnPr[i successes])for X/n > p0 or P=2(∑i=0XPr[i successes]) for X/n < p0, where X is the observed number of successes, n is the sample size, and Pr[i successes] is the probability of obtaining i successes from n trials given by the binomial distribution. PRACTICE PROBLEMS 1. Calculation practice: Binomial probabilities. Enterococcus bacteria are part of the normal intestinal flora of humans, but some strains can cause disease. In U.S. hospitals, 30% of pathogenic isolates are resistant to the antibiotic vancomycin (Wenzel 2004). Assume that seven independent pathogenic isolates have been extracted from patients and tested for resistance. Using the following steps, calculate the probability that five or more of the isolates are resistant to vancomycin: a. What are the assumptions of the binomial distribution? Does this example match those assumptions? b. Using the binomial distribution, what is the probability of success for this example? What is n? c. Calculate the probability of exactly five resistant isolates using the binomial distribution. d. Calculate the probability of exactly six resistant isolates, and calculate the probability of exactly seven resistant isolates. e. Using the addition principle, combine the information from the previous answers to calculate the probability of five or more resistant isolates out of seven. 2. Calculation practice: Binomial test. Do people typically use a particular ear preferentially when listening to strangers? Marzoli and Tomassi (2009) had a researcher approach and speak to strangers in a noisy nightclub. An observer scored whether the person approached turned either the left or right ear toward the questioner. Of 25 participants, 19 turned the right ear toward the questioner and 6 offered the left ear. Is this evidence of population difference from 50% for each ear? Use the following steps to help answer this question with a binomial test. Assume that the assumptions of the binomial test are met in this study. a. State the null and alternative hypotheses for the binomial test. b. What is the observed value of the test statistic? c. Under the null hypothesis, calculate the probability of getting exactly 19 right ears and six left ears. d. List all possible outcomes in which the number of right ears is greater than the 19 observed. e. Calculate the probability under the null hypothesis of each of the extreme outcomes listed in (d). f. Use the addition rule to calculate the probability of 19 or more right-eared turns under the null hypothesis. g. Give the two-tailed P-value based on your answer to (f). h. Interpret this P-value. What does it indicate? i. State the conclusion from your test. 3. Calculation practice: Confidence interval for a population proportion. In a study in Scotland (as reported by Devlin 2009), researchers left a total of 240 wallets around Edinburgh, as though the wallets were lost. Each contained contact information including an address. Of the wallets, 101 were returned by the people who found them. With the following steps, use the data to estimate the proportion of lost wallets that are returned, and give a 95% confidence interval for this estimate. a. What is the observed proportion of wallets that were returned? b. Calculate p' to use in the Agresti–Coull method of calculating a 95% confidence interval for the population proportion. c. Calculate the lower bound of the 95% confidence interval. d. Calculate the upper bound of the 95% confidence interval. e. Provide two values for p that lie within the most-plausible range, according to these data, and two that lie outside. f. If the authors had tested the null hypothesis that p was 1/2 at a significance level 0.05, is it likely that they would have rejected the null hypothesis (base your answer only on your results above)? 4. In 1955, John Wayne played Genghis Khan in a movie called The Conqueror. It was not an artistic success. More unfortunately, the movie was filmed downwind of the site of 11 previous aboveground nuclear bomb tests. Of the 220 people who worked on this movie, 91 had been diagnosed with cancer by the early 1980s, including Wayne, his costars, and the director. According to large-scale epidemiological data, only about 14% of people of this age group, on average, should have been stricken with cancer within this time frame.7 We want to know whether there is evidence for an increased cancer risk for people associated with this film. Assume that this probability is the same for each member of the cast. a. What is the best estimate of the probability of a member of the cast or crew getting cancer within the study interval? b. What is the standard error of your estimate? What does this quantity measure? c. What is the 95% confidence interval for this probability estimate? Does this interval bracket the typical cancer rate of 14% for people of the same age group? Interpret this result. 5. In the United States, paper currency often comes into contact with cocaine either directly, during drug deals or usage, or in counting machines where it wears off from one bill to another. A forensic survey collected fifty $1 bills and measured the cocaine content of the bills. Forty-six of the bills had measurable amounts of cocaine on them (Jenkins 2001). Assume that the sample of bills was a random sample. a. From these data, what is the best estimate of the proportion of U.S. $1 bills that have a measurable cocaine content? b. What is a 95% confidence interval for the estimate in part (a)? c. What is the correct interpretation of your 95% confidence interval? 6. For the following scenarios, state whether the binomial distribution would describe the probability distribution of possible outcomes. If not, explain why not. a. The number of red cards out of the first five cards drawn from the top of a regular, randomly shuffled deck of cards. b. The number of red balls out of 10 drawn one by one from a vat of 50 red and blue balls, if the balls are replaced and mixed after each draw. c. The number of red balls out of 10 drawn one by one from a vat of 50 red and blue balls, if the balls are not replaced after each draw. d. The number of red-eyed flies among 200 Drosophila individuals drawn at random from a large population having both red- and black-eyed flies. e. The total number of red-eyed flies in five Drosophila families, each of 40 individuals, with the families chosen at random from a large population. 7. The mite Demodex folliculorum lives in the hair follicles of humans, including the follicles of the eyelashes. (The bluish creatures in the photo on the first page of this chapter are these mites. The yellowish shaft in the photo is a human hair.) Having heard that “most people” have these mites living in their skin and eyelashes, we wanted to know what “most” really meant. The only data we could find was in a paper that compared 16 North American women with a skin condition called rosacea to 16 women that did not have this skin condition (call this the “control” group). Of the 16 women in the control group, 15 of the 16 had these mites. All 16 of the women in the rosacea group had mites (Al Am et al. 1997). a. From these data, a researcher estimated the proportion of people in North America who have these mites as 31/32. What is wrong with this estimate? b. What is your best estimate for the proportion of North Americans without rosacea who have these mites? Assume random sampling. c. What is the 95% confidence interval for this estimate? d. What is the best estimate (with a 95% confidence interval) for the proportion of women with rosacea who have these mites? 8. In the garden spider Araneus diadematus, the female often attempts to eat the male before or after mating, making sex a daunting prospect for the males. (This seemingly bizarre behavior, called sexual cannibalism, is not uncommon in the animal world.) In a series of mating observations, the courting male was captured and eaten by the female in 21 of 52 independent mating trials (Elgar and Nash 1988). a. Based on the sample, estimate the proportion of males that are eaten by females, and give a 95% confidence interval for this estimate. b. Is this estimate and confidence interval consistent with a true proportion of 50% capture of males? Is it consistent with 10% capture? c. If the sample were much larger and the data showed instead that 210 out of 520 matings involved sexual cannibalism, would this change the estimated proportion? Would it change the confidence interval? How? (You don’t need to do the calculations—just answer qualitatively.) 9. As our planet warms, as it has done for the last century or so, the change in temperature will have major effects on life. There are basically three possibilities for what might happen to a species: it can evolve to be better adapted to the new temperature, it can move closer to the poles so that it experiences temperatures closer to what it has experienced in the past, or it can go extinct. There have been a large number of studies of the second possibility. A recent study of the range limits of European butterflies found that, of 24 species that had changed their ranges in the last 100 years, 22 of them had moved further north and only two had moved further south (Parmesan et al. 1999). Assume that these 24 are a random sample of butterfly species with altered ranges. Test the hypothesis that the fraction of butterfly species moving north is different from the fraction moving south. 10. Imagine that there were two studies of the prevalence of melanism (solid black coat color). One estimated that the proportion of black leopards in this population was 52%, with a 95% confidence interval that ranged from 46% to 58%. The other study estimated the proportion to be 64% with a 95% confidence interval that ranged from 35% to 85%. a. Which study most likely had a larger sample size? b. Which estimate is more believable? 11. Mice have litters of several pups at once. The pups are arranged in a line within the mother’s uterus, so many fetuses lie between two of their siblings. It has been shown that female fetuses located between two male fetuses (2M) experience higher testosterone levels than those adjacent to no male fetuses (0M), because the hormone is produced by the males and diffuses across the fetal membranes and through the amniotic fluid to adjacent females. The higher fetal testosterone levels are known to have several effects on the females later in life, including increased aggression levels, growth rates, and even which eye opens first. One study wanted to measure whether fetal testosterone levels affected how attractive these female mice were to male mice as adults (vom Saal and Bronson 1980). Twenty-four male mice were given a choice between a female that was 0M and another unrelated female that was 2M. (Both the males and females were randomly chosen from their populations.) Each male was placed on a platform, and he could jump into the cage of whichever female he preferred. Nineteen of the 24 males chose the 0M female. a. Is this evidence that the males prefer one type of female over the other? b. If the two females presented to each male in a given trial had been sisters, would this have been a better or worse experimental design? Why? 12. A giant vat contains large numbers of two types of bacteria, called strain A and strain B. Assume that 30% of the bacteria in the vat are strain A, and the other 70% are strain B. The vat is well mixed. Twenty technicians each collect a random sample of 15 cells from the vat and then determine which strain each cell belongs to. a. What will the proportion of strain A cells be in these 20 samples, on average? b. Each technician counts the number of strain A bacteria cells in his or her sample. To what distribution should the number of strain A bacteria in samples conform? c. Each technician counts the proportion of strain A cells in his or her sample. What should the standard deviation be among samples in this proportion? d. Each technician correctly calculates a 95% confidence interval for the proportion of A cells in the vat. On average, what fraction of technicians will construct an interval that includes the value 0.3? 13. Twelve six-sided dice are rolled. Assume all of the dice are fair. a. How many threes do you expect, on average, out of these 12 rolls? b. What is the probability of rolling no threes out of the 12 rolls? c. What is the probability of rolling exactly 3 threes out of the 12 rolls? d. On average, what is the sum of the number of spots showing on top of the 12 dice? e. What is the probability that all the dice show only ones and sixes (in any proportion)? 14. A common perception is that people suffering from chronic illness may be able to delay their deaths until after a special upcoming event, like Christmas. Out of 12,028 deaths from cancer in either the week before or after Christmas, 6052 happened in the week before (Young and Hade 2004). a. What is the best estimate of the proportion of deaths out of this time interval that occurred in the week before Christmas? b. What is the 95% confidence interval for this estimate? c. Use this confidence interval to ask, “Are these data consistent with a true value of 50% for the percentage of deaths in the week before Christmas?” Do these data support the common perception? 15. In 1964, Ehrlich and Raven proposed that plant chemical defenses against attack by herbivores would spur plant diversification. In a test of this idea (Farrell et al. 1991), the number of species in 16 pairs of sister clades8 whose species differed in their level of chemical protection were counted. In each pair of sister clades, the plants of one clade were defended by a gooey latex or resin that was exuded when the leaf was damaged, whereas the plants of the other clade lacked this defense. In 13 of the 16 pairs, the clade with latex/resin was found to be the more diverse (had more species), whereas in the other three pairs the sister clade lacking latex/resin was found to be the more diverse. a. With these data, test whether the clade having latex/resin and its sister clade lacking it are equally likely to be more diverse. b. Is this a controlled experiment or an observational study? Why? 16. In the same survey of the cocaine content of currency discussed in Practice Problem 5, heroin was detected on seven of the 50 bills. a. What is the best estimate of the proportion of U.S. $1 bills that have heroin on them? b. What is the standard error of the estimate? What does this quantity measure? c. What is the 95% confidence interval for this estimate? d. If the proportion that you estimated from these data were in fact the true proportion in the population, what would be the probability of getting exactly seven bills with heroin when 50 bills are randomly sampled? 17. The Global Amphibian Assessment in 2005 declared that 1856 out of 5743 of the known species of amphibians worldwide are at risk of extinction. (Approximately 122 species have gone extinct already since 1980.) Amphibians are vulnerable to environmental change, so they are thought to be a bellwether of coming changes for other species. a. What is the proportion of amphibian species that are at risk of extinction? b. Would it make sense to calculate a confidence interval for this proportion in the usual way? Why or why not? ASSIGNMENT PROBLEMS 18. We all believe that we see most of what goes on around us, at least the most obvious things. Recently, however, psychologists have identified a phenomenon called “selective looking” which means that, if our attention is drawn to one aspect of what we see, we can miss even seemingly obvious features presented at the same time. In a striking demonstration of this phenomenon, a series of randomly chosen students was shown a video of six people throwing a basketball around, and they were asked to count how many times the people in white shirts threw the ball (Simons and Chabris 1999). In the middle of this video,9 a woman dressed as a gorilla walked through the shot, pausing in the center to thump her chest, and then walked out of the shot. Look at the photo, and you will realize that nothing could be more obvious. Or was it? Of the 12 students watching the video, only five noticed the gorilla. a. What is the best estimate from these data of the proportion of students in the population who notice the woman in the gorilla suit? b. What is the 95% confidence interval for the proportion of students in the population who notice the woman in the gorilla suit? c. What is the best estimate from these data of the proportion of students who fail to notice the woman in the gorilla suit? 19. A survey in the U.K. interviewed shoppers encountered in grocery stores about whether they had ever received injuries as a result of food or drink packaging, such as cuts sustained while cleaning up broken glass containers (Winder et al. 2002). Of the 200 who agreed to participate, 109 had received injuries “over the last few years” (27% of those injuries were significant enough to be treated by a doctor or emergency room). a. What is the best estimate, and 95% confidence interval, for the proportion of shoppers in the population who have injured themselves with their food or drinks? b. Do you think this was a random sample of all U.K. consumers? What factors might have rendered it a non-random sample? 20. A Royal Society for the Prevention of Cruelty to Animals (RSPCA) survey of 200 randomly chosen Australian pet owners found that 10 said that they had met their partner through owning the pet (RSPCA 2005). Find the 95% confidence interval for the proportion of Australian pet owners who find love through their pets. 21. One classical experiment on ESP (extrasensory perception) tests for the ability of an individual to show telepathy—to read the mind of another individual. This test uses five cards with different designs, all known to both participants. In a trial, the “sender” sees a randomly chosen card and concentrates on its design. The “receiver” attempts to guess the identity of the card. Each of the five cards is equally likely to be chosen, and only one card is the correct answer at any point. a. Out of 10 trials, a receiver got four cards correct. What is her success rate? What is her expected rate of success, assuming she is only guessing? b. Is her higher actual success rate reliable evidence that the receiver has telepathic abilities? Carry out the appropriate hypothesis test. c. Assume another (extremely hypothetical) individual tried to guess the ESP cards 1000 times and was correct 350 of those times. This is very significantly different from the chance rate, yet the proportion of her successes is lower than the individual in part (a). Explain this apparent contradiction. 22. In a test of Murphy’s law, pieces of toast were buttered on one side and then dropped. Murphy’s law predicts that they will land butter-side down. Out of 9821 slices of toast dropped, 6101 landed butter-side down. (Believe it or not, these are real data!10) a. What is a 95% confidence interval for the probability of a piece of toast landing butter-side down? b. Using the results of part (a), is it plausible that there is a 50:50 chance of the toast landing butter-side down or butter-side up? 23. Out of 67,410 surgeries tracked in a study in the U.K., 2832 were followed by surgical site infections (Coello et al. 2005). a. What is a 95% confidence interval for the proportion of surgeries followed by surgical site infection in the U.K.? Assume that the data are a random sample. b. If the study had followed a total of only 674 surgeries from the same population, would the confidence interval have been wider or narrower? 24. Each member of a large genetics class grows 12 pea plants from an independent pea family. Each family is expected to have 3/4 plants with smooth peas and 1/4 plants with wrinkled peas. a. On average, how many wrinkled pea plants will a student see in her 12 plants? b. What is the standard deviation of the proportion of wrinkled pea plants per student? c. What is the variance of the proportion of wrinkled pea plants per student? d. Predict what proportion of the students saw exactly two wrinkled pea plants in their sample. 25. Juvenile long-tailed tits (Aegithalos caudatus), a European relative of the chickadee, “help” adult birds raise offspring, such as by feeding their nestlings. What is the evolutionary advantage of helping behavior: practice for parenthood; increased changes of inheriting the adults’ territory in future; or indirect genetic benefits via increased success of kin? To investigate, Russell and Hatchwell (2001) monitored the behavior of 17 juveniles, each of which lived equidistant from two nests of adult birds. In each case, one nest was parented by a relative of the helper, and the other was parented by non-kin adults. Sixteen of the juveniles helped at the nest of their kin, whereas one helped at the non-kin nest. Do these results provide evidence for preferential helping at the nests of kin? Conduct the appropriate test. 26. In a blind taste test, do people prefer pâté or dog food? To investigate, Bohannon et al. (2010) presented 18 college-educated adults with unlabeled samples of dog food (Newman’s Own Organics Canned Turkey & Chicken) and four meat products meant for humans (duck liver mousse, pork liver pâté, liverwurst, and Spam). Participants were asked to rank their preferences. Two of 18 participants ranked the dog food first, whereas the other 16 participants chose one of the other items. Based on these results, can we conclude that people are less likely to prefer dog food over all human food than would be expected by chance? 27. Mood variation is related to photoperiod in some people, and the likelihood of depression increases in the winter months. As a result, people often assume that suicide rates increase in winter. A study in Finland (Valtonen et al. 2006) divided the year 1997 into equal halves and compared the number of suicides in “winter” (24 September to 19 March) and “summer” (remainder of year). Out of a total of 1636 suicides, 766 were in winter and 870 were in the summer. Based on these data, estimate the proportion of suicides that occurred in winter, assuming that the suicides were independent. Are the data compatible with a greater suicide rate in winter than summer, based on a 95% confidence interval? 28. Biff and Dilara were having an argument over what fraction of people would likely go out of their way to drive over a live organism if it were standing innocently by the side of the road. Dilara, whose heart is pure, guessed that fewer than 2% of people would behave that badly— roughly the proportion of people who score as psychopaths in standard testing. Biff, who isn’t revealing what he knows, guessed that the fraction would be higher, perhaps 5%. To settle the debate they analyzed data from an experiment in which a rubber facsimile of a turtle, a tarantula spider, a snake, or a leaf was placed on the paved shoulder of a two-way road (Rober 2012).11 Of 1000 vehicles observed to drive by, 60 swerved onto the shoulder in an effort to drive over the rubber organism. Let’s assume (perhaps unrealistically) that each vehicle represents an independent trial and that the probability of someone attempting to flatten the rubber organism is the same for each organism. Are these data consistent with a fraction of 2%? Are they consistent with a fraction of 5%? 29. Refer to Practice Problem 1. a. Produce a graph illustrating the probability distribution of the number of vancomycinresistant isolates when seven isolates are randomly sampled from the U.S. hospital Enterococcus population. b. Does the distribution in (a) represent an ordinary sampling distribution or, more specifically, a null sampling distribution? Explain your answer. 30. Biff and Dilara were arguing over a news feed about a population of catfish (Silurus glanis) in France that had figured out how to hunt feral pigeons. A submerged catfish lying in wait would rush into shallow water (like a killer whale beaching itself when hunting seals), grab a pigeon that had been drinking or bathing at the water’s edge, drag it into deeper water, and swallow it. Biff was skeptical and suggested that if the story had merit, the chance of a successful capture would surely be low: less than 10%. Dilara suspected that for the strategy to be profitable, the success rate would need to be fairly high, perhaps as high as 40%. To investigate, they tracked down the original article (Cucherousset et al. 2012), which reported that 15 out of 54 attempts were successful. Are these data consistent with Dilara’s conjecture, Biff’s argument, both, or neither? 31. Refer to Practice Problem 15. a. Plot the null probability distribution for this case. b. What is the meaning of this distribution? 4 INTERLEAF Correlation does not require causation Science is aimed at understanding how the world works. Identifying the causes of events is a key part of the process of science. A first step in that process is the discovery of patterns, which usually involves noticing associations between events. When two events are associated, the possibility is raised that one is the cause of the other. For example, our ancestors noticed that chewing on willow bark made sufferers of headaches and fevers feel better. We now know that the association between bark-chewing and pain relief is explained by the presence of salicylic acid in the bark. The acid blocks the release of prostaglandins in the body, which mediate pain and inflammation. This association led to the invention of aspirin. An association (or correlation) between variables is a powerful clue that there may be a causal relationship between those variables. Smoking is associated with lung cancer; drinking is associated with fatal automobile accidents; taking streptomycin is associated with reduced bacterial infection. These associations exist because one thing causes the other, and the causal relationship was discovered because someone noticed their correlation. The problem is that two variables can be correlated without one being the cause of the other. A correlation between two variables can result instead from a common cause. For example, the number of violent crimes tends to increase when ice cream sales increase. Does this mean that violence instills a deep need for ice cream? Or does it mean that squabbles over who ate the last of the Chunky Monkey® escalate to violence? Perhaps, but the more likely explanation for this association is that they share a common cause: hot weather encourages both ice cream consumption and bad behavior. Ice cream sales and violence are correlated, but this doesn’t mean that one causes the other. If we plot the mean life expectancy of the people of a country against the number of televisions per person in that country, we see quite a strong relationship (Rossman 1994): But the magical healing powers of the TV have yet to be demonstrated. Instead, it is likely that both televisions per capita and life expectancy have a common cause in the overall wealth of the citizens of the country. These examples demonstrate the problems posed by confounding variables. A confounding variable is an unmeasured variable that changes in tandem with one or more of the measured variables; this gives the false appearance of a causal relationship between the measured variables. The apparent relationship between violence and ice cream sales is probably the result of the confounding variable temperature. Overall wealth, or something related to it, is probably the confounding variable in the correlation between the number of televisions and life expectancy. Even more confusingly, a correlation between two variables may actually result from reverse causation. That is, the variable identified as the effect by the researcher may actually be the cause. For example, studies have repeatedly shown that babies who are breast-fed grow slightly more slowly than babies fed with formula. This has been interpreted as evidence that breast-feeding causes slow growth, but the truth turns out to be the reverse. Babies who grow rapidly are more likely to feed more, to be more demanding, and to be moved off the breast onto formula to give the poor mothers a break. Experimental studies confirm that exclusive breast-feeding has no measurable effect on infant growth (Kramer and Kakuma 2002). These examples demonstrate the limitations of observational studies, which illuminate patterns but are unable to fully disentangle the effects of measured explanatory variables and unmeasured confounding variables. The main purpose of experimental studies is to disentangle such effects. In an experiment, the researcher is able to assign participants randomly to different treatment groups. Random assignment breaks the association between the confounding variable and the explanatory variable, allowing the causal relationship between treatment and response variables to be assessed. Sir Ronald Fisher, our hero in many other respects, never believed that smoking caused lung cancer; instead, he thought that smoking may be caused by a genetic predisposition and that this genetic effect might also predispose one to cancer.1 In other words, he thought that the genotype of an individual was a confounding factor. Fisher himself invented experimental design, which would make it possible to test his claim. In theory, one could assign participants randomly to smoking and nonsmoking treatments and any underlying correlation with genetics would be broken, because on average, equally as many participants genetically predisposed to cancer would be assigned to both treatments. If Fisher’s hypothesis were correct, the smokers would not have an increased cancer rate. Such an experiment is not ethically possible with humans, but it has been done with other species. Finding correlations and associations between variables is the first step in developing a scientific view of the world. The next step is determining whether these relationships are causal or coincidental. This requires careful experimentation. 8 Fitting probability models to frequency data Fossil marine diatom, Actinoptychus heliopelta The binomial test, introduced in the preceding chapter, is an example of a “goodness-of-fit test.” A goodness-of-fit test is a method for comparing an observed frequency distribution with the frequency distribution that would be expected under a simple probability model governing the occurrence of different outcomes. In Chapter 6, for example, we tested the simplest probability model imaginable: whether left- and right-handed toads occur with equal probability in a population. In Chapter 7, we tested whether the frequency of sperm genes on the X chromosome is proportional to the size of that chromosome. Rejecting the null hypothesis in both cases confirmed real patterns in nature. The binomial test, however, is limited to categorical variables with only two possible outcomes. In this chapter, we introduce a more general goodness-of-fit test that allows us to handle categorical and discrete numerical variables having more than two outcomes. It also allows us to assess the fit of more complex probability models. We also show how to test the fit between observed frequency data and the frequencies predicted by simple probability models and how to interpret the results if the null hypothesis is rejected. Example of a probability model: the proportional model The proportional model is a simple probability model in which the frequency of occurrence of events is proportional to the number of opportunities. We encountered the proportional model in Example 7.2 when we tested whether the frequency of sperm genes on the mouse X chromosome was proportional to the size of that chromosome. Here we will explore the χ2 goodness-of-fit test on a proportional model by using the data from Example 8.1. EXAMPLE 8.1 No weekend getaway The U.S. National Center for Health Statistics records information on each new baby born, such as time and date of birth, weight, and sex (Ventura et al. 2001). One bit of information available from these data is the day of the week on which each baby was born. Under the proportional model, we would expect that babies should be born at the same frequency on all seven days of the week. But is this true? Table 8.1-1 lists the number of babies born on each day of the week in a random sample of 350 births from 1999. Figure 8.1-1 displays these data in a bar graph. TABLE 8.1-1 Day of the week for 350 births in the U.S. in 1999. Day Number of births Sunday Monday Tuesday Wednesday Thursday Friday Saturday 33 41 63 63 47 56 47 Total 350 FIGURE 8.1-1 Bar graph of the day of the week for 350 births in the U.S. in 1999. The data show a lot of variation in the number of births from one day to the next during the week, from a low of 33 on Sundays to a high of 63 on Tuesdays and Wednes-days. Under the proportional model, which will be the null hypothesis, the number of births on Monday should be proportional to the numbers of Mondays in 1999, except for chance differences. The same should be true for the other days of the week. Does the variation among days evident in Table 8.1-1 represent only chance variation? We can test the fit of the proportional model to the data with a χ2 goodness-of-fit test. χ2 goodness-of-fit test The χ2 goodness-of-fit test uses a test statistic called χ2 to measure the discrepancy between an observed frequency distribution and the frequencies expected under a simple probability model serving as the null hypothesis. (χ is the Greek letter “chi” and is usually pronounced “kye” in English.) The simple model is rejected if the discrepancy, χ2, is too large. The χ2 goodness-of-fit test compares frequency data to a probability model stated by the null hypothesis. Null and alternative hypotheses Under the proportional model, each day of the week should have the same probability of a birth, that is, 1/7 (see Example 8.1). This is the simplest possible model, so it’s our null hypothesis: H0: The probability of birth is the same on every day of the week. HA: The probability of birth is not the same on every day of the week. Once again, the null (H0) and alternative (HA) hypotheses are statements about the population from which the data are a random sample. The null hypothesis is very specific, describing the expectation under the simple probability model. The alternative hypothesis is not specific, because it includes every other possibility. Observed and expected frequencies Because the proportional model is the null hypothesis, we use it to generate the expected frequency of births on each day of the week. We expect the accumulated number of births on each day of the week to reflect the number of times each day of the week occurred in 1999. It turns out that in 1999 every day of the week occurred 52 times—except Friday, which occurred 53 times. In Table 8.2-1, we divide these numbers by 365, the total number of days in 1999, yielding proportions. TABLE 8.2-1 Expected frequency of births on each day of the week in 1999 under the proportional model. Number of days Proportion of days Expected frequency Day in 1999 in 1999 of births Sunday Monday Tuesday Wednesday Thursday 52 52 52 52 52 52/365 52/365 52/365 52/365 52/365 49.863 49.863 49.863 49.863 49.863 Friday Saturday Sum 53 52 53/365 52/365 365 1 50.822 49.863 350 We can now use these proportions to calculate the expected frequencies of births for each day of the week under the proportional model. For example, there were 350 total births in the data set, and under H0 the fraction 52/365 of them should have occurred on Sundays. The expected frequency of births for Sunday is therefore Expected=350×52365=49.863. Note that expected frequencies can have fractional components, even if, in any given case, the number of individuals per category will be an integer. The expected frequencies are the average values expected with the null model. The sum of the expected values should be the same as the sum of the observed values (i.e., 350, except for rounding error). If this isn’t the case, you need to check your calculations for errors. The χ2 test statistic The χ2 statistic measures the discrepancy between the observed and expected frequencies. It is calculated by the following sum: χ2=∑i(Observedi - Expectedi)2Expectedi. Observedi is the frequency of individuals observed in the ith category, and Expectedi is the frequency expected in that category under the null hypothesis. The numerator of this quantity is a difference between the data and what was expected, which is squared so that positive and negative deviations are treated equally. When we divide this squared deviation by the expected value, the deviation of the observed and expected is scaled to the expected value. The χ2 statistic measures the discrepancy between observed frequencies from the data and expected frequencies from the null hypothesis. It’s important to notice that the χ2 calculations use the absolute frequencies (i.e., counts) for the observed and expected frequencies, not proportions or relative frequencies. Using proportions in the calculation of χ2 will give the wrong answer. In the Example 8.1 data set, i can take on the values 1 through 7, where Sunday = 1, Monday = 2, and so on. If the observed frequencies in all categories exactly matched the expected frequencies under the null hypothesis, χ2 would be zero. The larger χ2 is, the greater is the discrepancy between the data and the frequencies expected under the null hypothesis. To determine χ2, we must calculate (Observedi − Expectedi)2/Expectedi for each day of the week. For Sundays, for example, i = 1 and (Observed1 - Expected1)2Expected1=(33-49.863)249.863=5.70. Repeating this calculation for the rest of the days, we obtain the values shown in the last column of Table 8.2-2. (Make sure you can obtain these same values for yourself.) Note that this discrepancy is largest for Sundays, which has the largest difference between the number of observed births and the number of expected births. TABLE 8.2-2 Observed and expected numbers of births on each day of the week under the pr portional model. Observed Expected number number Day of births of births (Observed - Expected)2Expected Sunday Monday Tuesday Wednesday Thursday Friday Saturday Sum 33 41 63 63 47 56 47 350 49.863 49.863 49.863 49.863 49.863 50.822 49.863 350 5.70 1.58 3.46 3.46 0.16 0.53 0.16 15.05 Adding these up, we get χ2=5.70+1.58+3.46+3.46+0.16+0.53+0.16=15.05. The χ2 statistic is the test statistic for the χ2 goodness-of-fit test, the quantity measuring the level of agreement between the data and the null hypothesis. All we need now is the sampling distribution of the χ2 test statistic under H0. This will allow us to decide whether χ2 = 15.05 is large enough to warrant rejection of the null hypothesis. The sampling distribution of χ2 under the null hypothesis Recall from Chapter 6 that we can determine the sampling distribution for a test statistic under the null hypothesis by computer simulation.1 This approach is tedious and not recommended, but we ran the simulation for these data just to show you what the approximate null distribution for the χ2 statistic looks like (see the histogram in Figure 8.2-1). FIGURE 8.2-1 A histogram showing the approximate sampling distribution of χ2 values under the null hypothesis that births in 1999 occurred with the same probability on each day of the week (Example 8.1). The solid black curve shows the theoretical χ2 probability distribution with six degrees of freedom. The curve provides an excellent approximation of the sampling distribution of the χ2 test statistic under the null hypothesis. Fortunately, there’s an easier way to obtain the null sampling distribution for the χ2 statistic. The null distribution is well approximated by the theoretical χ2 distribution, which has a known mathematical form. (See the solid curve superimposed on the histogram in Figure 8.21.) Happily, the key features of this theoretical χ2 distribution (hereafter referred to as “the χ2 distribution”) have been compiled in tables that are easy and quick to use. We demonstrate the use of the tables in the next two subsections of Section 8.2. The χ2 distribution is a mathematical function, and to use it we need to specify a number called the degrees of freedom (df, for short). The number of degrees of freedom of a χ2 statistic specifies which χ2 distribution to use as the null distribution. The degrees of freedom for a χ2 goodness-of-fit test is calculated using the following formula.2 df = (Number of categories)−1−(Number of parameters estimated from the data). Ignore the last term of this formula (i.e., “Number of parameters estimated from the data”) for now, because it is zero for the birth data. We explain what this term means in Sections 8.5 and 8.6, where we first use it. The birth data have seven categories (one for each day of the week), so the number of degrees of freedom is df = 7 − 1 = 6. This tells us that we need to compare our χ2 value calculated from the birth data (χ2 = 15.05) to the χ62 distribution with six degrees of freedom. (The subscript 6 on χ62 indicates the number of degrees of freedom.) The χ2 distribution with six degrees of freedom is shown as the black curve in Figure 8.2-1. Note how similar the χ2 distribution (the solid line) is to the simulated distribution (the histogram), only smoother. The smallest possible value for χ2 is zero. On the right, the χ2 distribution extends to positive infinity. This distribution is the one we will use to calculate the P-value for our test. Calculating the P-value The P-value for a test is the probability of getting a result as extreme as, or more extreme than, the result observed if the null hypothesis were true. For the χ2 goodness-of-fit test, the P-value is the probability of getting a χ2 value greater than the observed χ2 value calculated from the data (χ2 = 15.05 for the birth data). Remember that if the data exactly matched the expectation of the null hypothesis, χ2 would be zero. A deviation in either direction between an observed frequency and the expected frequency causes χ2 to be greater than zero. Greater deviations from the null expectation result in a higher value of χ2. As a result, we use only the right tail of the χ2 distribution to calculate P. The χ2 distribution is a continuous probability distribution, so probability is measured by the area under the curve, not the height of the curve (Chapter 5). The probability of getting a χ2 value greater than or equal to a single specified value, which is what we need to calculate a Pvalue, is equal to the area under the curve to the right of that value extending to positive infinity. The data from Example 8.1 yielded χ2 = 15.05. The probability of getting a χ2 value of 15.05 or greater is equal to the area under the χ62 curve beyond 15.05, as shown by the region highlighted in red in Figure 8.2-2. FIGURE 8.2-2 The χ2 distribution with six degrees of freedom. The red area shows the probability of getting a χ2 value greater than or equal to 15.05. How do we go about finding this area beyond the measured χ2 value? Two options are available. First, most computer statistical packages will provide the P-value directly: P = 0.0199. At the standard significance level of α = 0.05, such a small P-value leads us to reject our null hypothesis. That is, these data provide evidence that births are not “randomly” distributed over the days of the week. The variation among days of the week in number of births listed in Table 8.1-1 is simply too large to be explained by chance. The second option for assessing the P-value uses critical values, as we discuss in the next subsection. Critical values for the χ2 distribution The second way to calculate the P-value for a χ2 statistic does not require a computer. This method uses tables of critical values to set bounds on the P-value. A critical value is the value of a test statistic that marks the boundary of a specified area in the tail (or tails) of the sampling distribution under H0. If we want a significance level of α = 0.05, for example, we would need to know the value of χ2 for which the area under the curve to its right is 0.05. This value of χ2 is called the “critical value corresponding to α = 0.05.” A critical value is the value of a test statistic that marks the boundary of a specified area in the tail (or tails) of the sampling distribution under H0. Statistical Table A at the back of this book (p. 703) gives critical values for the χ2 distribution. An excerpt from this table is shown in Table 8.2-3. To read the table, first find the column corresponding to the significance level of interest (we will use the standard α = 0.05). Then find the row corresponding to the number of degrees of freedom for the test statistic (df = 6 for Example 8.1). The number in the corresponding cell of the table is the critical value [χ0.05,62=12.59]. Under the null hypothesis, the probability of obtaining a χ2 value as extreme as, or more extreme than, 12.59 is 0.05: Pr[χ62≥12.59]=0.05. TABLE 8.2-3 An excerpt from the table of χ2 critical values (Statistical Table A). Numbers down the left side are the number of degrees of freedom (df). Numbers across the top are significance levels (α). The critical value for a χ2 distribution with df = 6 and α α 0.05 is 12.59 (indicated in red). Significance level α df 0.999 0.995 0.99 0.975 0.95 0.05 0.025 0.01 0.005 0.001 1 2 3 4 5 0.000002 0.002 0.02 0.09 0.21 0.00004 0.01 0.07 0.21 0.41 0.00016 0.02 0.11 0.30 0.55 0.00098 0.05 0.22 0.48 0.83 0.00393 0.10 0.35 0.71 1.15 3.84 5.99 7.81 9.49 11.07 6 0.38 0.68 0.87 1.24 1.64 12.59 14.45 16.81 18.55 22.46 7 0.60 8 0.86 0.99 1.34 1.24 1.65 1.69 2.18 2.17 2.73 14.07 16.01 18.48 20.28 24.32 15.51 17.53 20.09 21.95 26.12 5.02 7.38 9.35 11.14 12.83 6.63 9.21 11.34 13.28 15.09 7.88 10.6 12.84 14.86 16.75 10.83 13.82 16.27 18.47 20.52 Figure 8.2-3 illustrates the area under the curve to the right of 12.59. FIGURE 8.2-3 The χ2 distribution with six degrees of freedom. The area under the right tail of the curve in red is 5% of the total area under the curve. This is the region to the right of χ2 = 12.59. Under the null hypothesis, χ62 will be greater than 12.59 with probability 0.05. Because our observed χ2 value (15.05) is greater than 12.59 (i.e., further out in the right tail of the distribution), χ2 values of 15.05 or greater occur more rarely under the null hypothesis than 5% of the time. Therefore, our P-value must be less than 0.05, P=Pr[χ62≥15.05]<0.05, so we reject the null hypothesis. We can use Statistical Table A to get closer to the true P-value. Note that Statistical Table A also includes columns of critical values for other values of α. Our observed χ2 test statistic (15.05) falls between the critical values for α = 0.025 [χ0.025,62=14.45] and that for [χ0.01,62=16.81]. Our observed test statistic is greater than 14.45 but less than 16.81. Thus, Statistical Table A makes it possible to bound the P-value as 0.01 < P < 0.025. This would be reported as just P < 0.025 in a scientific paper. Note that this conclusion is consistent with the P-value of 0.0199 calculated by a computer statistics package. Based on this analysis, we can conclude that births are not equitably distributed over the days of the week.3 Assumptions of the χ2 goodness-of-fit test The χ2 goodness-of-fit test assumes that the individuals in the data set are a random sample from the whole population. This means that each individual was chosen independently of all of the others and that each member of the population was equally likely to find its way into the sample. This is an assumption of every test described in this book. The sampling distribution of the χ2 statistic follows a χ2 distribution only approximately. The approximation is excellent, as long as the following rules4 are obeyed: ■ None of the categories should have an expected frequency less than one. ■ No more than 20% of the categories should have expected frequencies less than five. Notice that these restrictions refer to the expected frequencies, not to the observed frequencies. If these conditions are not met, then the test becomes unreliable. If one of these conditions is not met, then we have two options. One option, if possible, is to combine some of the categories having small expected frequencies to yield fewer categories having larger expected frequencies (remember to change the degrees of freedom accordingly). We can do this only if the new combined categories make biological sense. We’ll see examples of this approach in Section 8.6. A second option is to find an alternative to the χ2 goodness-offit test, perhaps making use of computer simulation (Chapter 19). Goodness-of-fit tests when there are only two categories The χ2 goodness-of-fit test works even when there are only two categories, so it’s a quick substitute for the binomial test (Chapter 7), provided that the assumptions of the χ2 test are met. The calculations are much quicker than those required for the binomial test, although they are less exact. We demonstrate these calculations using Example 8.4. EXAMPLE 8.4 Gene content of the human X chromosome The sex chromosomes are inherited in a very different pattern from that of the other chromosomes, which is known to affect their evolution in many ways. Are they unusual in other ways? For example, are there as many genes on the human X chromosome as we would expect from its size? The Human Genome Project has found 781 genes on the human X chromosome, out of 20,290 genes found so far in the entire genome.5 The X chromosome represents 5.2% of the DNA content of the whole human genome, so under the proportional model we would expect 5.2% of the genes to be on the X chromosome. Is this what we observe? The null and alternative hypotheses are H0: The percentage of human genes on the X chromosome is 5.2%. HA: The percentage of human genes on the X chromosome is not 5.2%. Observed frequencies and the frequencies expected under H0 are listed in Table 8.4-1. The expected number of genes on the X chromosome, under the null hypothesis, is 20,290 × 0.052 = 1055. We observed only 781, which is substantially fewer. What is the probability of a result as extreme as, or more extreme than, the result observed assuming the null hypothesis? TABLE 8.4-1 Numbers of genes on the human X chromosome and on the rest of the genome. Chromosome Observed Expected X Not X 781 19,509 1,055 19,235 Total 20,290 20,290 It would be a challenge to calculate the P-value using the binomial test. We would need to calculate P=2×Pr[X≤781]. When the number of trials (genes) is n = 20,290 and the probability of a gene being on the X chromosome is p = 0.052, this number P would be calculated as P=2×(Pr[X=0]+Pr[X=1]+Pr[X=2]+...+Pr[X=781]). The tedium of this sum causes the mind to boggle.6 It would be much faster to calculate the P-value using the χ2 goodness-of-fit test. This procedure yields χ2=(781-1055)21055+(19,509-19,235)219,235=75.1. This test statistic has two categories and, therefore, only one degree of freedom: df=2−1=1. From Statistical Table A, we see that the critical value of χ12 for a significance level α = 0.05 is 3.84. Because our observed χ2 = 75.1 is greater than 3.84, we know that P < 0.05, and we reject the null hypothesis. In fact, we can use Statistical Table A to be even more precise. Because our calculated χ2 is greater than the largest critical value given for one [χ0.001,12=10.83], we can say that P < 0.001. Thus, there are significantly fewer genes on the X chromosome in humans than would be expected from its size. Remember, the P-value reflects the weight of evidence against the null hypothesis, not how big the difference is between the true proportion and the null expectation of 0.052. We can estimate the true proportion of genes on the X chromosome as pˆ = 781/20,290 = 0.038, yielding a 95% confidence interval of 0.036 < p < 0.041. This reveals that the proportion of genes on the X chromosome is modestly smaller than the expectation of 0.052. When there are only two categories, the binomial test is the best option when n is small and when the expected frequencies are too low to meet the assumptions of the χ2 goodness-of-fit test. Even when n is large, though, the binomial test is preferred when a computer is available, because it yields an exact P-value.7 Fitting the binomial distribution The proportional model is not the only probability model that can be tested using a goodnessof-fit approach. Biologists often fit their data to other probability distributions that also represent simple models for how nature behaves. By “model” we mean a mathematical description that mimics how we think a natural process works, or at least how it would work in the absence of complications. For this reason, probability distributions are used as null hypotheses in many branches of biology. EXAMPLE 8.5 Designer two-child families? In Chapter 5, we claimed that the sex of consecutive children is independent in humans. For example, having had one boy already does not change the probability that the next child will also be a boy. In the absence of complications, then, we expect the numbers of sons and daughters in families containing two children to match a binomial distribution with n = 2 and p equal to the probability of having a son in any single trial. Is this what we see? Rodgers and Doughty (2001) tested this hypothesis using data from the National Longitudinal Survey of Youth, which compiles data on the sex of children in a random sample of families of different sizes. Table 8.5-1 lists the number of sons in 2444 two-child families. TABLE 8.5-1 The frequency distribution of the number of boys in families with two children. Number of Observed number of boys families 0 1 2 530 1332 582 Total 2444 There are three possible outcomes for families containing exactly two children: zero, one, or two boys. We can test the fit of the binomial distribution to the data in Table 8.5-1 using the χ2 goodness-of-fit test. The null and alternative hypotheses are as follows: H0: The number of boys in two-child families has a binomial distribution. HA: The number of boys in two-child families does not have a binomial distribution. Here we are testing the fit of a distribution to the data on multiple families. We are not testing a hypothesis about the true proportion of boys. When testing the fit to a binomial distribution, we are fitting the results of multiple sets of trials and comparing the frequencies of sets having different numbers of successes to the expectation of the binomial distribution. This is different from using the binomial distribution to test a null hypothesis about a proportion. In a binomial test, we have only one set of trials. Notice that in this case our null hypothesis does not specify p, the probability that an individual offspring is a boy. This complicates our task slightly, because we must first estimate p from the data before we can calculate the expected frequencies. Here’s how we estimate p from the data. There are 4888 children in the study, a value obtained by multiplying the number of families (2444) by the family size (2). The total number of sons is (2 × 582) + 1332 = 2496. Thus, we can estimate the probability of a child being a boy as p^=2496/4888 = 0.5106. Next, we use this estimate of p and the binomial distribution with n = 2 to calculate the expected frequencies under the null hypothesis. For example, the expected fraction of two-child families having exactly one boy is Pr[1 boy]=(21)(0.5106)1(1-0.5106)1=0.49977. Thus, the expected frequency of 2444 two-child families having exactly one boy is Expected[1 boy]=2444×0.49977=1221.4. Table 8.5-2 lists the expected frequencies for all possible outcomes, and Figure 8.5-1 shows expected values alongside the data. Surprisingly, the observed frequencies don’t seem to match the frequencies expected under the binomial distribution. There is a shortage of two-child families having either no boys or two boys compared with the expectation, and an excess of families having exactly one boy. The differences between observed and expected frequencies are not huge; but can they be explained by chance, or must the null hypothesis be rejected? TABLE 8.5-2 Observed and expected number of boys in two-child families. Number of Observed number of Expected number of boys families families 0 1 2 530 1332 582 585.3 1221.4 637.3 Total 2444 2444.0 FIGURE 8.5-1 The observed number of two-child families with a given number of boys (red) compared with the frequency expected from a binomial distribution (gold). Compared with expected frequencies, there is an excess of two-child families with exactly one boy and a shortage of families with no boys or two boys. The formula to calculate χ2, first introduced in Section 8.2, gives us χ2=(530 -585.3)2585.3+(1332 -1221.4)21221.4+(582 -637.3)2637.3=20.04. Next, we need to calculate the number of degrees of freedom for our test. There are three categories, which would ordinarily leave us with two degrees of freedom. However, we needed to estimate one parameter from the data to generate the expected frequencies (the probability of boys, p). Using this estimate costs us an extra degree of freedom.8 As a result, the number of degrees of freedom is df=3−1−1=1. The critical value for the χ12 distribution having one degree of freedom and a significance level α = 0.05 is 3.84 (see Statistical Table A). Because 20.04 is further into the tail of the distribution than 3.84, we know that P < 0.05; therefore, we reject the null hypothesis. If we probe Statistical Table A further, we find that 20.04 is greater even than the critical value corresponding to α = 0.001, so P < 0.001. A statistics package on the computer gave us a more exact value, P = 7.7 × 10–6. These data show that the frequency distribution of the number of boys and girls in twochild families does not match the binomial distribution. This means that one of the assumptions of the binomial distribution is not met in these data. Either the probability of having a son varies from one family to the next, or the individuals within a family are not independent of each other, or both. What is the reason for the poor fit of the binomial distribution to the number of boys in two-child families? Is the sex of the second child not independent of that of the first, as we’ve assumed? Are parents manipulating the sex of their children? One likely explanation is that many parents of two-child families having either no boys or two boys are unsatisfied with not having at least one child of each sex and decide to have a third child, thus “removing” their family from the set of two-child families. Random in space or time: the Poisson distribution When the dust settled after the 1980 explosion of Mount St. Helens, spiders were among the first organisms to recolonize the moonscape-like terrain. They dropped out of the airstream and grew fat on insects that arrived in the same way. Let’s imagine the frequency distribution of spider landings across the landscape. What would it look like if spider landings were completely “random” in space? The assumptions we need are as follows: ■ The probability that a spider lands at a given point on the continuous landscape is the same everywhere (i.e., they aren’t more likely to land some places than others). ■ Whether a spider lands at a given point on the landscape is independent of landings everywhere else (i.e., spiders don’t clump together or repel one another). To count spiders, let’s place a large grid across the landscape, breaking it up into equalsized blocks. (The block size doesn’t matter as long as they’re large enough to accommodate many potential landing sites.) If both assumptions listed previously are met, then the frequency distribution of the number of spiders landing in blocks will follow a Poisson distribution. The Poisson distribution describes the number of successes in blocks of time or space, when successes happen independently of each other and occur with equal probability at every instant in time or point in space. The Poisson distribution is a useful tool for asking whether events or objects occur randomly in continuous time and space. A Poisson distribution is a reasonable expectation for certain biological counts, such as the number of mutations carried by each individual in a population, the number of salmon caught on a given day by each sport fisher, or the number of seeds successfully germinated by each mother plant. For the biologist, the Poisson distribution is just a model for how successes may be distributed in time and space in nature. Life gets interesting when the model doesn’t fit the data, because then we learn that one or more of the main assumptions is false, hinting at the existence of interesting biological processes. (For example, some individuals may actually be more prone to mutations than others, some fishers may be better catchers than others, or some plants may produce better-quality seeds.) The alternative to the Poisson distribution is that successes are distributed in some nonrandom way in time or space. Successes can be clumped, for example, in which case they occur closer together than expected by chance (see the left panel in Figure 8.6-1), or successes can be dispersed, meaning they are spread out more evenly than expected by chance (see the right panel in Figure 8.6-1). A clumped distribution may arise when the presence of one success increases the probability of other successes occurring nearby. Outbreaks of contagious disease, for example, often lead to a clumped spatial distribution of cases, because individuals catch the disease from their neighbors. A dispersed distribution happens when the presence of one success decreases the probability of another success occurring nearby. Territorial animals are often more dispersed in space than would be expected by chance, for example, because individuals chase each other away. Deviations from the random pattern can therefore help us identify interesting biological processes that create the patterns. FIGURE 8.6-1 Distributions of points in space that follow a clumped distribution (left), a random distribution (center), or a dispersed distribution (right). In the “random” distribution, each point is independent and has an equal probability of appearing anywhere in the space. If the random graph were divided into a grid of equal-sized squares, the number of points per square would follow a Poisson probability distribution. Formula for the Poisson distribution The Poisson distribution was derived by Siméon Denis Poisson,9 a French mathematical physicist. He showed that the probability of X successes occurring in any given block of time or space is Pr [X successes] =e-μμXX!, where μ is the mean number of independent successes in time or space (expressed as a count per unit time or a count per unit space). Here e, the base of the natural log, is a constant approximately equal to 2.718, and X! is X factorial. Testing randomness with the Poisson distribution The main use of the Poisson distribution in biology is to provide a null hypothesis to test whether successes occur “randomly” in time or space. In practice, we usually don’t know the exact rate at which successes may occur. So, to make predictions about the probability of different outcomes from a Poisson distribution, we must first estimate the rate from the data. EXAMPLE 8.6 Mass extinctions Do extinctions occur randomly through the long fossil record of Earth’s history, or are there periods in which extinction rates are unusually high (“mass extinctions”) compared with background rates? The best record of extinctions through Earth’s history comes from fossil marine invertebrates, because they have hard shells and therefore tend to preserve well. Table 8.6-1 lists the number of recorded extinctions of marine invertebrate families in 76 blocks of time of similar duration through the fossil record (Raup and Sepkoski 1982). TABLE 8.6-1 The frequency of time blocks in the fossil record in which an observed number of marine invertebrate families went extinct. Number of extinctions Frequency (X) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 >20 0 13 15 16 7 10 4 2 1 2 1 1 0 0 1 0 2 0 0 0 1 0 Total 76 If the occurrence of family extinctions is “random” in time through the fossil record, then the number of extinctions per block of time should follow a Poisson distribution. Departures from the Poisson distribution could indicate that extinctions tend to be clumped in time and occur in bursts (mass extinctions). Another possibility is that extinctions are more evenly spread over time than we would expect if they occurred randomly. The easiest way to test the randomness of family extinctions is to compare the frequency distribution of extinctions to that expected from a Poisson distribution using a χ2 goodness-offit test. Our hypotheses are as follows: H0: The number of extinctions per time interval has a Poisson distribution. HA: The number of extinctions per time interval does not have a Poisson distribution. To begin the test, we need to estimate μ, the mean number of extinctions per time interval. This is obtained using the sample mean, X¯=(0×0) +(13×1) +(15×2)+ ...76=4.21. (See Section 3.1 to review how to calculate a mean from a frequency table. Remember that there are n = 76 separate data points here, not the smaller number indicated by the number of rows in Table 8.6-1.) This sample mean (X) is used in place of μ in the formula for the Poisson distribution to generate the expected frequencies. We show the calculations next, but first look at the result graphically in Figure 8.6-2. FIGURE 8.6-2 The frequency distribution of the number of extinctions (histogram) compared with the frequencies expected from the Poisson distribution having the same mean (curve). The histogram in Figure 8.6-2 gives the observed frequency distribution of extinctions per time interval, whereas the line connects the expected frequencies under the null hypothesis (the Poisson distribution). If you look closely, there is a discrepancy. Compared with the Poisson distribution, the fossil record shows too many time intervals with large numbers of extinctions and too many intervals having very few extinctions, compared with the Poisson distribution. But is the discrepancy between the observed and expected distributions greater than expected by chance? We will use the χ2 goodness-of-fit test to find out. In Table 8.6-2, we have tabulated the observed and expected frequencies. The expected frequency for all but the last category of extinctions is computed by applying the formula for the Poisson distribution to get the expected probability and then multiplying this probability by the total number of intervals (76) to yield the expected frequency. For example, the expected probability of three extinctions in a time interval is Pr[3 extinctions]=e-X¯X¯33!=e-4.21(4.21)33!=0.1846. TABLE 8.6-2 The observed frequency distribution of extinctions of marine invertebrate families compared with the number expected under the Poisson distribution. Number of Observed frequency of Expected frequency of extinctions (X) time intervals time intervals 0 0 1.13 1 13 4.75 2 3 4 5 6 7 8 15 16 7 10 4 2 1 10.00 14.03 14.77 12.44 8.72 5.24 2.76 9 ≥ 10 2 6 1.29 0.86 Total 76 76 Multiplying this result times the total number of intervals in the data set (76), we find the expected number of intervals with three extinctions: Expected [3 extinctions]=76×0.1846=14.03. You may want to try to calculate some of the other expected values in Table 8.6-2 for practice. We have grouped all X ≥ 10 extinctions into the last category because the expected frequency of larger numbers is getting very small. This grouping makes sense, because these are all large values with a similar biological meaning. The expected frequency for the final category is computed by subtracting the sum of all the previous expected values from 76, the total number of time intervals. Unfortunately, the expected frequencies fail to meet the assumptions of the χ2 test: one of them is less than one, and more than 20% are less than five. When this happens, we can group categories and try again. For example, combining X = 0 and X = 1 into a single class and combining all classes with X ≥ 8 make sense, because the classes grouped are similar. The resulting data, listed in Table 8.6-3, have eight categories. TABLE 8.6-3 The observed and expected frequencies of time intervals with a given number of extinctions of marine invertebrate families. Number of Observed frequency of Expected frequency of extinctions (X) time intervals time intervals 0 or 1 2 3 4 5 6 7 ≥8 13 15 16 7 10 4 2 9 5.88 10.00 14.03 14.77 12.44 8.72 5.24 4.91 Total 76 76 Using the standard formula for the χ2 statistic, we compute χ2=(13-5.88)25.88+(15 -10.00)210.00+(16 -14.03)214.03+...=23.93. We have six degrees of freedom for this test, accounting for the one parameter, μ, that we had to estimate from the data: df=(Number of categories) -1 -(Number of parameters estimated from the data)=8-1-1=6. The critical value for χ0.05,62 is 12.59 (see Statistical Table A). Our χ2 statistic of 23.93 is further into the tail of the distribution than this critical value is, so our P-value is less than 0.05. More specifically, we can see that P < 0.001, because 23.93 is also greater than 22.46, the critical value corresponding to α = 0.001. We reject the null hypothesis, therefore, and conclude that extinctions in this fossil record do not fit a Poisson distribution. Comparing the variance to the mean How can we describe the way that a pattern deviates from the Poisson distribution? One unusual property of the Poisson distribution is that the variance in the number of successes per block of time (the square of the standard deviation) is equal to the mean (μ). In an observed frequency distribution, if the variance is greater than the mean, then the distribution is clumped. There are more blocks with many successes, and more with few successes, than expected from the Poisson distribution. If the variance is less than the mean, then the distribution is dispersed. The ratio of the variance to the mean number of successes is therefore a measure of clumping or dispersion. For the extinction data, the sample mean number of extinctions is 4.21. The sample variance in the number of extinctions is s2=(0 -4.21)2(0) +(1 -4.21)2(13) +(2-4.21)2(15) + ...76-1=13.72. Because the sample variance (13.72) greatly exceeds the sample mean (4.21), the ratio of the variance to the mean (3.26) is greater than 1, and we conclude that the distribution of extinction events in time is highly clumped. That is, extinctions tend to occur in bursts (mass extinctions) rather than randomly or evenly in time. Summary ■ The χ2 goodness-of-fit test compares the frequency distribution of a discrete or categorical variable with the frequencies expected from a probability model. ■ The χ2 goodness-of-fit test is more general than the binomial test because it can handle more than two categories. It is also easier to compute, even when there are only two categories. ■ Goodness of fit is measured with the χ2 test statistic. ■ The χ2 test statistic has a null distribution that is approximated by the theoretical χ2 distribution. The approximation is excellent as long as no expected frequencies are less than one and no more than 20% of the expected frequencies are less than five. It may be necessary to combine categories to meet these criteria. ■ The theoretical χ2 distribution is a continuous distribution. Probability is measured by the area under the curve. ■ The null hypothesis is rejected at significance level α if the observed χ2 statistic exceeds the critical value of the χ2 distribution corresponding to α. ■ Under the proportional probability model, events fall in different categories in proportion to the number of opportunities. Rejecting H0 implies that the probabilities are not proportional. ■ If trials are independent, and the probability p of a success is the same for each trial, then the frequency distribution of the number of successes should follow a binomial distribution. Rejecting the null hypothesis that the number of successes follows a binomial distribution implies that trials are not independent or that the probability of success is not the same for all trials. ■ The Poisson distribution describes the frequency distribution of successes in blocks of time or space when successes happen independently and with equal probability over time or space. Rejecting a null hypothesis of a Poisson distribution of successes implies that successes are not independent or that the probability of a success occurring is not constant over time or space. ■ Comparing the variance of the number of successes per block of time or space to the mean number of successes measures the direction of departure from randomness in time or space. If the variance is greater than the mean, the successes are clumped; if the variance is less than the mean, successes are more evenly distributed than expected by the Poisson distribution. Quick Formula Summary χ2 Goodness-of-fit test What is it for? Compares observed frequencies in categories of a single variable to the expected frequencies under a random model. What does it assume? Random samples. Also that the expected count of each cell is greater than one and that no more than 20% of the cells have expected counts less than five. Test statistic: χ2 Distribution under the null hypothesis: χ2 distributed with df = (Number of categories) − 1 − (Number of parameters estimated from the data). Formula: χ2=∑i(Observedi - Expectedi)2Expectedi. Poisson distribution What is it for? Describes the number of independent events that occur per unit of time or space. Formula: Pr [X events] =e-μμXX!, where X is the number of events and μ is the mean number of events per unit time or space. PRACTICE PROBLEMS 1. Calculation problem: χ2 goodness-of-fit test to a Poisson distribution. Your friend is writing a computer program to place individuals randomly on a spatial landscape, where every individual is placed independently of all the others and probability is equal everywhere. He finds that many of the individuals land near each other and many other areas are empty, and he becomes concerned that the program is not behaving as intended. You offer to check his results against the Poisson distribution, which is the expected distribution for the number of individuals in equal-area blocks placed over the landscape according to his assumptions. The following data show the number of individuals placed by the program into 200 such blocks. We’ve ordered the numbers for your convenience. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3 a. Explain why the Poisson distribution is the appropriate distribution to compare these results to. b. Write the null and alternate hypotheses for this test. c. Make a frequency table for the data. d. Calculate the mean number of individuals per area in the data. e. Using this mean, calculate the probability of 0, 1, 2, and 3 individuals per block assuming a Poisson distribution. f. Calculate the expected numbers of blocks with zero to three individuals. g. Are the expected values from part (f) suitable for the χ2 goodness-of-fit test? Consider the requirements of the χ2 test for the minimum expected values. h. Combine categories (if necessary) to meet the minimum expected value requirements for the χ2 test. i. How many degrees of freedom will this test have? j. Calculate the χ2 test statistic with the observed and expected values. k. Find the critical value and determine an approximate P-value for this test. (Here and always, provide an exact P if you are using a computer to answer this question.) l. Interpret this result in terms of the original hypotheses and the question asked of the data. 2. The parasitic nematode Camallanus oxycephalus infects many freshwater fish, including shad. The following table gives the number of nematodes per fish (Shaw et al. 1998). Number of nematodes Number of fish 0 1 103 72 2 3 4 5 6 44 14 3 1 1 a. Produce a graph of the data. What type of graph is most appropriate? b. Calculate the frequencies expected if nematodes infect fish “at random” (i.e., independently and with equal probability). c. Overlay the expected frequencies onto your graph. What are the noticeable differences? d. Is there evidence that nematodes do not worm their way into the fish at random? Here and always, show all four steps of hypothesis testing. 3. Luijckx et al. (2012) discovered that resistance to the bacterial parasite Pasteuria ramosa is genetically variable in the common freshwater crustacean Daphnia magna. To investigate the genetic basis of this variation, they crossed a completely resistant lineage to a completely susceptible lineage. All the F1 offspring were resistant. These offspring, when mature, were then crossed to each other to produce an F2 generation. If resistance is the result of only a single gene with two forms (alleles), then resistant and susceptible F2 offspring should occur in a 3:1 ratio. Of 71 F2’s tested, 57 were resistant and 14 were susceptible. a. With these data, calculate the range of most-plausible values for the proportion of resistant offspring. Does the plausible range include the proportion predicted if resistance is determined by a single gene? b. Give two other values for the proportion that are also consistent with the data. c. Test the genetic hypothesis. Are the results compatible with your findings in part (a)? d. On the basis of these results, is it correct to conclude that a single gene in Daphnia magna underlies resistance to the bacterium? 4. Soccer reaches its apex every four years at the World Cup, attracting worldwide attention and fanatic devotion. The World Cup is widely thought to be the event that decides the best soccer team in the world. But how much do skill differences determine the outcome? If the probability of a goal is the same for all teams and games, and if goals are independent, then we would expect the frequency distribution of goals per game to approximate a Poisson distribution. In contrast, if skill differences really matter, then we would expect more high scores and more low scores than predicted from the Poisson distribution. The following table tallies the number of goals scored by one of the two teams (randomly chosen) in every game of the knockout round of the World Cup from 1986 though 2010. Number of goals Frequency 0 1 2 3 4 5 37 44 21 10 4 1 >5 Total 0 112 a. Plot the frequency distribution of goals per team using the data in the table. b. What is the mean number of goals per game? c. Using the Poisson distribution, calculate the expected frequency of games and teams with 0, 1, 2, …, = goals, assuming independence and equal probability of scoring. d. Overlay the expected frequencies calculated in part (c) on the graph you created in part (a). Do they appear similar? e. If skill differences do not matter, would you expect the variance in the number of goals per team and side to be less than, equal to, or greater than the mean number of goals? Calculate the variance in the number of goals per team and side. How similar is it to the mean? 5. Each of the following examples could be addressed with a goodness-of-fit test. From the information given, how many categories and how many degrees of freedom would each test have? Explain your answers. a. A die is rolled 50 times to test whether it is fair—whether it has a 1/6 chance of coming up on each of its six different sides. b. A set of 10 coins are flipped, and the number of heads is recorded. This experiment is repeated with the same coins 1000 times. The test compares the frequency of heads to a binomial distribution with p = 0.5. c. The scenario is the same as in part (b), except now the question is whether the frequency of heads follows a binomial distribution (p not specified). d. A food-protection agency counts the number of insect heads found per 100-gram batch of wheat flour. The researchers have 500 batches, and they want to know whether the frequency of insect heads in batches follows a Poisson distribution. The 500 batches included at least = batches having 0, 1, 2, 3, or 4 insect heads. No batches had more than four heads. 6. One thousand coins were each flipped eight times, and the number of heads was recorded for each coin. The results are as follows: Number of heads Number of coins 0 1 2 3 4 5 6 7 8 6 32 105 186 236 201 98 33 103 a. Test whether the distribution of coin flips matches the expected frequencies from a binomial distribution assuming all fair coins. (A coin is fair if the probability of heads per flip is 0.5.) b. If the binomial distribution is a poor fit to the data, identify in what way the distribution does not match the expectation. c. Some two-headed coins (which always show heads on every flip) were mixed in with the fair coins. Can you say approximately how many two-headed coins there might have been out of this 1000? 7. Practice Problem 4 from Chapter 7 gave data about the death rates of people working on the movie The Conqueror. Test whether the cancer rates in this group were different from the expected rate of 14%. 8. Imagine that a small hospital’s emergency room has an average of 20 admissions per Saturday night. If you were a doctor working overtime on such a Saturday night, you might want to know the probability of having a quiet night, one that would let you catch up on some much-needed sleep. Let’s call a quiet night one in which five or fewer admissions take place. What is the chance that you get some sleep? Assume that admissions are independent of one another and are just as likely to land in one instant in time as another on a Saturday night. 9. The following list gives the number of degrees of freedom and the χ2 test statistic for several goodness-of-fit tests. Find the P-value for each test as specifically as possible from Statistical Table A. If you can, find the P-values more exactly by using a computer program. Degrees of freedom χ2 1 4 2 10 1 4.12 1.02 9.50 12.40 2.48 10. Practice Problem 14 from Chapter 7 gave data about death rates from cancer before and after Christmas. Use these data to test whether the holiday affects death rates. ASSIGNMENT PROBLEMS 11. If each “success” happens independently of all other successes and with the same probability, what probability distribution is expected for each of the following? a. Number of flowers in square-meter blocks in an alpine field b. Number of heads out of 10 flips of a coin c. Number of bombs landing in city blocks in London in World War II d. Daily number of hits on a website e. Annual number of elephant attacks on humans in Serengeti National Park f. Number of red flowers in sets of 100 flowers in a field of multiple types of flowers 12. The white “Spirit” black bear (or Kermode), Ursus americanus kermodei, differs from the ordinary black bear by a single amino acid change in the melanocortin 1 receptor gene10 (MC1R). In this population, the gene has two forms (or alleles): the “white “ allele b and the “black” allele B. The trait is recessive: white bears have two copies of the white allele of this gene (bb), whereas a bear is black if it has one or two copies of the black allele (Bb or BB). Both color morphs and all three genotypes are found together in the bear population of the northwest coast of British Columbia. If possessing the white allele has no effect on growth, survival, reproductive success, or mating patterns of individual bears, then the frequency of individuals with 0, 1, or 2 copies of the white allele (b) in the population will follow a binomial distribution. To investigate, Hedrick and Ritland (2011) sampled and genotyped 87 bears from the northwest coast: 42 were BB, 24 were Bb, and 21 were bb. Assume that this is a random sample of bears. a. Calculate the fraction of b alleles in the population (remember, each bear has two copies of the gene). b. With your estimate of the fraction of b alleles, and assuming a binomial distribution, calculate the expected frequency of bears with 0, 1, and 2 copies. c. Compare the observed and expected frequencies in a graph. Describe how they differ. 13. Refer to Assignment Problem 12. A formal hypothesis test was carried out to compare the observed and expected frequencies of genotypes. The procedure obtained P = 0.0001. Answer the following questions: a. What are the null and alternative hypotheses? b. What are the degrees of freedom for the test statistic? Say whether each of the following statements is true or false solely on the basis of these results. c. d. e. f. g. The observed frequencies are compatible with a binomial distribution. The difference between the observed and expected frequencies is statistically significant. The test statistic exceeds the critical value corresponding to α = 0.05. The test statistic exceeds the critical value corresponding to α = 0.01. The difference is large between the true genotype frequencies in the bear population and that expected under the binomial distribution. 14. In North America, between 100 million and 1 billion birds die each year by crashing into windows on buildings, more than any other human-related cause. This figure represents up to 5% of all birds in the area. One possible solution is to construct windows angled downward slightly, so that they reflect the ground rather than an image of the sky to a flying bird. An experiment by Klem et al. (2004) compared the number of birds that died as a result of vertical windows, windows angled 20 degrees off vertical, and windows angled 40 degrees off vertical. The angles were randomly assigned with equal probability to six windows and changed daily; assume for this exercise that windows and window locations were identical in every respect except angle. Over the course of the experiment, 30 birds were killed by windows in the vertical orientation, 15 were killed by windows set at 20 degrees off vertical, and 8 were killed by windows set at 40 degrees off vertical. a. Clearly state an appropriate null hypothesis and an alternative hypothesis. b. What proportion of deaths occurred while the windows were set at a vertical orientation? c. What statistical test would you use to test the null hypothesis? d. Carry out the statistical test from part (c). Is there evidence that window angle affects the mortality rates of birds? 15. In the 19th century, cavalries were still an important part of the European military complex. While horses have many wonderful qualities, they can be dangerous beasts, especially if poorly treated. The Prussian army kept track of the number of fatalities caused by horse kicks to members of 10 of their cavalry regiments over a 20-year time span. If these fatalities occurred independently and with equal probability for each regiment, then the number of deaths by horse kick per regiment per year should follow a Poisson distribution. On the other hand, if some regiments during some years consisted of particularly bad horsemen,11 then the events would not occur with equal probability, in which case we would expect a frequency distribution different from the Poisson distribution. The following table shows the data, expressed as the number of fatalities per regiment-year (Bortkiewicz 1898). Number of deaths (X) Number of regiment-years 0 1 2 3 4 109 65 22 3 1 >4 0 Total 200 a. What is the mean number of deaths from horse kicks per regiment-year? b. Test whether a Poisson distribution fits these data. 16. Are the outcomes of hospital care different on weekends than weekdays? In a random sample of 500 patients who experienced severe medical complications after admission to acute care wards in three U.S. states from 1999 and 2001, 119 had been admitted on a weekend and 381 had been admitted on a weekday. This compares with a large population of people at risk for such complications in which 14.8% are admitted on weekends and 85.2% are admitted on weekdays (Bendavid et al. 2007). a. In the 500 sampled patients with severe complications, what fraction had been admitted on weekends? Is this higher or lower than the fraction of all at-risk patients admitted on weekends? b. Name two statistical methods that could be used to test whether the probability of severe complications in at-risk patients admitted to hospitals differs between weekend and weekday. State the advantages and disadvantages of both. c. State the null and alternative hypotheses for such a test. d. Test the hypotheses. State your conclusions clearly. 17. Truffles are a great delicacy, sending thousands of mushroom hunters into the forest each fall to find them. A set of plots of equal size in an old-growth forest in Northern California was surveyed to count the number of truffles (Waters et al. 1997). The resulting distribution is presented in the following table. Are truffles randomly located around the forest? If not, are they clumped or dispersed? How can you tell? (The mean number of truffles per plot, calculated from these data, is 0.60.) Number of truffles per plot Frequency 0 1 2 3 >3 203 39 18 13 15 18. The anemonefish Amphiprion akallopisos lives in small groups that defend territories consisting of a cluster of sea anemones, among the tentacles of which the anemonefish live (think Nemo). In a field study of the species at Aldabra Atoll in the Indian Ocean, Fricke (1979) noticed that each territory tended to have several males but just one female. Based on his counts, 20 territories of exactly four adult fish would have the following frequency distribution of female numbers. Number of females Number of males Number of territories 4 3 <3 0 20 0 Total 20 0 1 >1 a. Estimate the mean number of females per territory having four fish. Provide a standard error for this estimate. b. Does the number of females in territories having four fish have a binomial distribution? Show all steps in carrying out your test. c. If the number of females in territories does not have a binomial distribution, what is the likely statistical explanation (i.e., what assumption of the binomial distribution is likely violated)? d. Optional: Can you suggest a biological explanation for a non-binomial pattern? 19. Hurricanes hit the United States often and hard, causing some loss of life and enormous economic costs. They are ranked in severity by the Saffir–Simpson scale, which ranges from Category 1 to Category 5, with = being the worst. In some years, as many as three hurricanes that rate a Category 3 or higher hit the U.S. coastline. In other years, no hurricane of this severity hits the United States. The following table lists the number of years that had 0, 1, 2, 3, or more hurricanes of at least Category 3 in severity, over the 100 years of the 20th century (Blake et al. 2005): Number of hurricanes Category 3 or higher Number of years 0 1 2 3 >3 50 39 7 4 0 a. What is the mean number of severe hurricanes to hit the United States per year? b. What model would describe the distribution of hurricanes per year, if they were to hit independently of each other and if the probability of a hurricane were the same in every year? c. Test the fit of the model from part (b) to the data. 20. In snapdragons, variation in flower color is determined by a single gene (Hartl and Jones 2005). RR individuals are red, Rr (heterozygous) individuals are pink, and rr individuals are white. In a cross between heterozygous individuals, the expected ratio of red-flowered:pinkflowered:white-flowered offspring is 1:2:1. a. The results of such a cross were 10 red-, 21 pink-, and 9 white-flowered offspring. Do these results differ significantly (at a 5% level) from the expected frequencies? b. In another, larger experiment, you count 100 times as many flowers as in the experiment in part (a) and get 1000 red, 2100 pink, and 900 white. Do these results differ significantly from the expected 1:2:1 ratio? c. Do the proportions observed in the two experiments [i.e., in parts (a) and (b)] differ? Did the results of the two hypothesis tests differ? Why or why not? 21. A more recent study of Feline High-Rise Syndrome (FHRS) (see Chapter 1, Example 1.2) included data on the month in which each of 119 cats fell (Vnuk et al. 2004).12 The data are in the accompanying table. Can we infer that the rate of cat falling varies between months of the year? Month Number fallen January February March April May June July August September October November December 4 6 8 10 9 14 19 13 12 12 7 5 22. Consider an isolated population of humans in which some individuals are infected with a specific parasite species (e.g., malaria or a filarial worm). Think of two biological hypotheses for why the number of parasite individuals per person may not be well described by a Poisson distribution. Which assumptions of the Poisson distribution are likely violated by the process you propose, and how would the frequency distribution likely be affected? 23. Spot the flaw. Tabershaw and Lamm (1977) compared the observed and expected numbers of different leukemia types in a study group of workers who had been exposed to benzene during their employment. They were testing a previous suggestion that exposure to benzene increases the probability of acute leukemia while not changing the occurrence of other leukemia types. Their expected numbers are based on the relative frequencies of these diagnoses in the population as a whole. The numbers are presented in the following table. Type of leukemia Chronic lymphocytic Chronic myelogenous Monocytic Acute Observed Expected 0 1 2 4 2 1 1 3 The researchers applied a χ2 goodness-of-fit test to the data and calculated a χ2 value of 3.33, with df = 3. From the information given, what is the largest error made in this analysis? 24. Seedlings of the parasitic plant Cuscuta pentagona (dodder) hunt by directing growth preferentially toward nearby host plants. Lacking eyes, or even a nervous system, how do they detect their victims? To investigate the possibility that the parasite detects volatile chemicals produced by host plants, Runyon et al. (2006) placed individual dodder seedlings into a vial of water at the center of a circular paper disc. A chamber containing volatile extracts from tomato (a host plant) in solvent was placed at one edge of the disc, whereas a control chamber containing only solvent was placed at the opposite end. The researchers divided the disc onto equal-area quadrats to record in which direction the seedlings grew. Of 30 dodder plants tested, 17 seedlings grew toward the volatiles, 2 grew away (toward the solvent), 7 grew toward the left side, and 4 grew toward the right side. a. Graph the relative frequency distribution for these results. What type of graph is ideal? b. What are the relative frequencies expected if the parasite is unable to detect the plant volatiles or any other cues present? Add these expected relative frequencies to your graph in part (a). c. Using these data, calculate the fraction of seedlings that grow toward the volatiles. What does this fraction estimate? d. Provide a standard error for your estimate. What does this standard error represent? e. Calculate the range of most-plausible values for the fraction of dodder seedlings that grow toward the volatiles under these experimental conditions. Does it include or exclude the fraction expected if the parasite is unable to detect plant volatiles or other cues present? 25. Refer to Assignment Problem 24. The researchers used gas chromatographic analysis to extract and identify eight major volatile chemicals present in the host plants (tomato). They tested each of these chemicals separately using the same experimental design to determine whether dodder seedlings would preferentially orient their growth. Of 34 dodder plants tested with one of these chemicals, a-pinene (also present in pine resin, as its name suggests), 11 seedlings grew toward the volatiles, 6 grew away, 8 grew toward the left side, and 9 grew toward the right side. The authors compared these observed frequencies to those expected by the null hypothesis. Their test statistic was χ2 = 1.53. They obtained P = 0.68. Answer the following questions: a. What are the null and alternate hypotheses? b. Their χ2 test statistic has how many degrees of freedom? Based on these results, state whether each of the following statements is either true or false: c. d. e. f. g. h. The observed frequencies are compatible with the proportional probability model. The difference between the observed and expected frequencies is statistically significant. The test statistic exceeds the critical value corresponding to α = 0.05. Dodder plants do not orient their growth toward the plant volatile α-pinene. There is no evidence that dodder plants orient their growth toward α-pinene. The difference between the proportion of individuals that grow toward a-pinene and that grow away from a-pinene in the dodder population is small. 5 INTERLEAF Making a plan Too often, experimenters do not carefully consider statistical issues until after the study is completed and the data are in hand. Sometimes a flaw in the experimental design becomes obvious only then, when they try to analyze the data. As Fisher once said, “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”1 To ensure that your experiment is given a statistical clean bill of health and not a toe tag, it’s important to plan the experiment carefully with statistics in mind and to follow that plan throughout the data collection. Here are some guidelines to avoid a few common pitfalls. Chapter 14 delves into some of these issues in more detail than is possible here. For now, we list a few sensible procedures to help get you started. To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. —Fisher 1. Develop a clear statement of the research question. This needs to be as specific as possible. What is the scientific hypothesis? Is the question interesting? Has it already been addressed sufficiently in the literature?2 Identify clear objectives for the experiment. 2. List the possible outcomes of your experiment. Once you have a preliminary plan for the treatments you want to compare, think of the outcomes you might obtain. Can you draw firm conclusions no matter what the outcome? Do these conclusions answer the questions? If not, then modify your design. 3. Develop an experimental plan. Write it down. Let it sit for a few days and then review it again. 4. Keep the design of your experiment as simple as possible. Do you really need 12 different treatments, or will two suffice? Simplifying the design will make it easier to keep track of your objectives, and it will avoid the need for complex statistical analyses. 5. Check for common design problems. Is there replication of treatments? Are these replicates truly independent? Will your sampling method yield random samples? Can you identify any confounding variables that will complicate the interpretation of the results? 6. Is the sample size large enough? Avoid getting to the end of an experiment before discovering that your sample size is only large enough to demonstrate an unrealistically large effect. Is the sample size sufficient to produce a confidence interval narrow enough to permit conclusions, regardless of the size of the treatment effect? Chapter 14 discusses some methods to help in this planning. 7. Discuss the design with other people. Many brains think better than one, and others will often see a problem (and hopefully a solution) that wasn’t obvious to you. It is better to get that feedback before doing all the work than to be told after the fact, when it’s too late to do anything about it. Science is a social process, so take advantage of the brainpower you have around you. By carefully considering these guidelines before starting the experiment, you can avoid a lot of wasted effort. 9 Contingency analysis associations between categorical variables Gobiodon erythrospilus Biologists are keenly interested in associations between variables and differences between groups. Contingency tables (see Section 2.3) display how the frequencies of different values for one variable depend on the value of another variable when both are categorical. In this chapter, we analyze sample data for two categorical variables to infer associations between those variables in populations. We want to determine to what extent one variable is “contingent” on the other. Analysis of contingency data can be used to answer questions such as the following: ■ Do bright and drab butterflies differ in the probability of being eaten? ■ How much more likely to drink are smokers than nonsmokers? ■ Are heart attacks less likely among people who take aspirin daily? In experimental studies, contingency data can help us establish whether the probability of living or dying differs between medical treatments. We can estimate the differences in these probabilities with odds ratios and with relative risk, which are explained in Sections 9.2 and 9.3. We can test hypotheses about differences in the probabilities using a contingency test. Contingency analysis allows us to determine whether, and to what degree, two (or more) categorical variables are associated. In other words, a contingency analysis helps us to decide whether the proportion of individuals falling into different categories of a response variable is the same for all groups. Contingency analysis estimates and tests for an association between two or more categorical variables. At the heart of contingency analysis is the investigation of the independence of variables. If two variables are independent, then the state of one variable tells us nothing about the probability of the different values of the other variable. Associating two categorical variables An association between two categorical variables implies that the two variables are not independent. During the RMS Titanic disaster, for example, women had a lower probability of death than men. Sex and death were not independent; the sex of an individual predicts to some extent his or her probability of death. The mosaic plot on the left in Figure 9.1-1 shows the relationship between sex and death. If death had been independent of sex, then the probability of death would have been equal for both sexes, and the resulting mosaic plot would look like the one on the right in Figure 9.1-1. FIGURE 9.1-1 Left: Mosaic plot depicting the death of adult men and women passengers following the shipwreck of the Titanic. Survivors are represented by the gold and those that died by the red. The area of each box is proportional to the number of individuals in the sample with those attributes; n = 2092 individuals from data in Dawson (1995). Right: This is what the mosaic plot would have looked like if death and sex on the Titanic were perfectly independent. In reality, the probability of death differed between the sexes, so the two variables are not independent. Estimating association in 2 × 2 tables: odds ratio The odds ratio measures the magnitude of association between two categorical variables when each variable has only two categories. One of the variables is the response variable—let’s call its two categories “success” and “failure,” where success just refers to the focal outcome of interest. The other variable is the explanatory variable, whose two categories identify the two groups whose probability of success is being compared. The odds ratio compares the proportion of successes and failures between the two groups. Odds Consider a variable for which a single random trial yields one of two possible outcomes: success or failure. The probability of success is p and the probability of failure is 1 − p. The odds of success (O) are the probability of success divided by the probability of failure: O=p1-p. The odds of success are the probability of success divided by the probability of failure. If O = 1 (sometimes written as 1:1 or “the odds are one to one”), then one success occurs for every failure. If the odds are 10 (sometimes written as 10:1), then 10 trials result in success for every one that results in failure. The estimate of odds is calculated from a random sample of trials using the observed proportion of successes (pˆ ) as follows: Oˆ=pˆ1-pˆ. Example 9.2 shows how to estimate odds from sample data. EXAMPLE 9.2 Take two aspirin and call me in the morning? Aspirin, the medicine commonly used for headache and fever, has been shown to reduce the risk of stroke and heart attack in susceptible people. Observational studies have suggested that aspirin may also reduce the risk of cancer. A large, carefully designed experimental study was conducted to test this possibility (Cook et al. 2005). A total of 39,876 women were randomly assigned one of two different treatments. Of these, 19,934 women received 100 mg of aspirin every other day. The other 19,942 women received a placebo, a chemically inert pill that gives the patient the experience of taking the medication without the chemical effects. The women did not know which treatment they received. The women were monitored for 10 years. During the course of the study, 1438 of the women on aspirin and 1427 of those receiving the placebo were diagnosed with invasive cancer (Table 9.2-1). The results are depicted in a mosaic plot in Figure 9.2-1. TABLE 9.2-1 2 × 2 contingency table for the aspirin and cancer experiment. Aspirin Placebo Cancer No cancer 1438 1427 18,496 18,515 FIGURE 9.2-1 A mosaic plot showing the results of the study comparing cancer rates in women who took aspirin with women who did not take aspirin; n = 39,876. A glance at the mosaic plot suggests that cancer rates did not change much, if at all, as a result of taking aspirin. Let’s focus on the outcome “getting cancer” and estimate the odds of this outcome in the two groups of women.1 In the aspirin group (group 1), the estimated proportion that got cancer is pˆ1=143819,934=0.0721. We have added the subscript “1” to identify the group. The estimated proportion of women who did not get cancer is 1-pˆ=1-0.0721=0.9279. So, the estimated odds of developing cancer while taking aspirin are Oˆ1=pˆ1-pˆ1=0.07210.9279=0.0777. The odds of getting cancer while on aspirin are about 0.08:1, or about 1:13. In common speech, we would say that the odds are 13 to 1 that a woman who took aspirin would not get cancer in the next 10 years. Similarly, the estimated probability that a woman on the placebo developed cancer is pˆ2=1427/19,942=0.0716. So, the odds of a woman on the placebo getting cancer are Oˆ2=pˆ21-pˆ2=0.07160.9284=0.0771, which is also about 1:13, similar to that in the aspirin group. Odds ratio We can use the odds ratio to quantify the difference between the odds of women developing cancer on aspirin and on the placebo. The odds ratio (OR) is just what it sounds like, the ratio of the odds of success between two groups. If O1 is the odds of success in one group and O2 is the odds in the other group, then the odds ratio is OR=O1O2. The odds ratio is the odds of success in one group divided by the odds of success in a second group. If the odds ratio is equal to one, then the odds of success in the response variable are independent of treatment; the odds of success are the same for both groups. If the odds ratio is greater than one, then the event has higher odds in the first group than in the second group. Alternatively, if the odds ratio is less than one, then the odds are higher in the second group. The odds ratio is commonly calculated in medical research, where it is used to measure the change in the odds for a response variable resulting from medical intervention compared with a control treatment (the explanatory variable). For the cancer/aspirin study described in Example 9.2, the estimated odds ratio is given by OˆR=Oˆ1Oˆ2=0.07770.0771=1.008. (The “hat” on ORˆ in the preceding equation indicates that it is an estimate of the population OR.) The odds of developing cancer while taking aspirin were about the same as the odds while taking the placebo. The estimated odds ratio is slightly greater than one, which means that in the data, the odds of getting cancer were slightly higher in the aspirin group than in the placebo group. The following is a shortcut formula: ORˆ=a/cb/d=adbc, where the symbols a, b, c, and d refer to the observed frequencies in the cells of the contingency table: Success (focal outcome) Failure (alternative outcome) Treatment Control a c b d When we apply this shortcut to the cancer data, Cancer No cancer Aspirin Placebo a = 1438 c = 18,496 b = 1427 d = 18,515 we get OˆR=adbc=(1438)(18,515)(1427)(18,496)=1.009. Our answer (1.009) here is slightly different from the one calculated earlier (1.008) because the shortcut reduces round-off error in the calculation of odds ratio. Standard error and confidence interval for odds ratio The sampling distribution for the odds ratio is highly skewed, and so we convert the odds ratio to its natural log, ln(ORˆ ). We can calculate the standard error of the log-odds ratio as SE[ln(QˆR)]=1a+1b+1c+1d. The symbols a, b, c, and d in this equation refer to the observed frequencies in the cells of the contingency table shown earlier.2 An approximate 100(1 − α)% confidence interval for the log-odds ratio is then given by ln(OˆR)-Z×SE[ln(OˆR)]<ln(OR)<ln(OˆR)+Z×SE[ln(OˆR)] where Z = 1.96 for a 95% confidence interval and Z = 2.58 for a 99% confidence interval.3 This formula for the confidence interval is an approximation that assumes the sample size is fairly large. To find the confidence interval for the odds ratio, rather than the log odds, we must take the antilog of the upper and lower limits of the interval for the log-odds ratio. Let’s calculate the 95% confidence interval for the aspirin data. We’ve already calculated ORˆ = 1.009, so ln(ORˆ ) = 0.00896. The standard error of this estimate of SE[ln(OˆR)]=1a+1b+1c+1d=11438+11427+118,496+118,515=0.03878. With this standard error, we can calculate the 95% confidence interval. Using Z = 1.96 for a 95% confidence interval, we get 0.00896-1.96(0.03878) <ln(OR) -0.00896+1.96(0.03878). The 95% confidence interval for the population log-odds ratio is therefore -0.067<ln(OR) <0.085. To convert this to a confidence interval for the odds ratio, we must take the antilog of the limits of this interval by raising e to the power of each number: e-0.067 < OR <0.085. or 0.93<OR <1.09. The confidence interval for the odds ratio is tightly bounded around 1.00, so the data provide evidence that aspirin has little or no effect on the probability of developing cancer. The data are plausibly consistent with a small beneficial effect, a small deleterious effect, or no effect at all. Estimating association in 2 × 2 tables: relative risk Relative risk is another commonly used measure of the association between two categorical variables when both have just two categories. It is especially appropriate when comparing the probability (risk) of a focal outcome between two treatments or groups. As the name implies, the focal outcome is usually the rarer and less desirable outcome. For example, we might use relative risk to estimate and compare the probabilities of sudden infant death syndrome between infants sleeping facedown and infants sleeping on their backs. The relative risk is the probability of the focal outcome in the treatment group divided by the probability in the control group. If pˆ1 and pˆ2 are the estimates of the probability of the undesirable outcome in group 1 (treatment) and group 2 (control), respectively, then we calculate the relative risk, RR, as the ratio RR^=Pˆ1Pˆ2. Relative risk is the probability of an undesired outcome in the treatment group divided by the probability of the same outcome in a control group. Let’s use the data in Example 9.2 to calculate the relative risk of cancer for women who take supplemental aspirin compared to those who do not. For women on the aspirin treatment, we calculated the estimated proportion of women with cancer as pˆ1=0.0721. The proportion of women getting cancer in the placebo group was estimated as pˆ2=0.0716. Putting the treatment group (with aspirin) in the numerator, we calculate the relative risk as RR^=0.07210.0716=1.007. The estimated probability of getting cancer is very similar in these two groups, hence the relative risk is close to one. The estimate is that the cancer rate is only slightly higher in the aspirin treatment group than in the control group; therefore, the relative risk is slightly greater than one. If aspirin had reduced the risk of cancer, we would have seen a relative risk less than one. Calculations for standard errors and confidence intervals for relative risk are provided in the Quick Formula Summary on p. 256. Odds ratio vs. relative risk Which method is best for measuring association between two categorical variables: odds ratio or relative risk? Both methods provide a measure of the magnitude of the difference between two groups in the probability of a focal outcome, and both are used frequently in analyses of biological data. Relative risk, being simply the ratio of two proportions, is considered by many to be more intuitive than odds ratio. But you may have noticed that, when applied to the cancer and aspirin data, the values for RRˆ and ORˆ were almost identical (1.007 and 1.009, respectively). The values for odds ratio and relative risk will be similar whenever the focal outcome is rare, as is the probability of cancer in women in the aspirin study. One advantage of the odds ratio is that it can be applied to data from case-control studies. A case-control study is a method of observational study in which a sample of individuals having a disease or other focal condition (the “cases”) is compared to a second sample of individuals who do not have the condition (the “controls”) but are otherwise similar in other characteristics that might also influence the results. The study allows investigators to examine whether the two samples differ in their exposure to one or more possible causal factors. In a case-control study, the total numbers of cases and controls in the samples are chosen by the experimenter rather than by sampling at random in the population, and thus the numbers of individuals with and without the disease or focal condition are not necessarily proportional to the frequency of the disease in the population. As a result, we cannot estimate risk. We can, however, ask whether the focal outcome is associated with another variable by using an odds ratio. EXAMPLE 9.3 Your litter box and your brain Toxoplasma gondii is a protozoan parasite that can infect the brains of many birds and mammals, including humans. Toxoplasma’s main hosts are cats, and humans may acquire the parasite via contact with cat feces. Roughly a quarter of all humans are infected. Because Toxoplasma tends to infect the brain of its victims, it seems likely that it affects the behavior of the host as well. In humans, toxoplasmosis may be associated with some mental illnesses, and it may be associated with risky behavior.4 Yereli et al. (2006) compared the prevalence of Toxoplasma gondii in a sample of 185 drivers between 21 and 40 years old who had been involved in a driving accident (cases) with a sample of 185 drivers of similar age and sex who had not had accidents (controls). The researchers were interested in whether Toxoplasma infection may cause a change in the probability of an accident. Their data are shown in Table 9.3-1 and Figure 9.3-1. TABLE 9.3-1 The frequency of Toxoplasma gondii infection in a sample of drivers involved in driving accidents (cases) compared with a sample of drivers with no accidents (controls). From Yereli et al. (2006). Infected Uninfected Drivers with accidents Drivers without accidents 61 16 124 169 FIGURE 9.3-1 Mosaic plot illustrating relative frequency of Toxoplasma gondii infection in a sample of drivers involved in a driving accident (cases) compared with a sample of drivers with no accidents (controls). In this example, Toxoplasma infection is the explanatory variable. Occurrence of an accident (cases vs. controls) is the response variable. The association between the two variables is illustrated with a mosaic plot in Figure 9.3-1. Notice that in Figure 9.3-1 we’ve illustrated separate rows for cases (accidents) and controls (no accidents), and divided each row according to the frequency of infected and uninfected individuals. This slightly different arrangement from previous mosaic plots takes into account the unusual sampling design of the case-control study, whereby individuals are sampled according to their value for the response variable (here, accidents vs. no accidents), and subsequently measured for their exposure to the explanatory variable (here, infected vs. uninfected). However, we have maintained the explanatory variable on the horizontal axis. We recommend this graphical convention for displaying case-control frequency data using mosaic plots, and we adopt it in the rest of this book. Unfortunately, these data cannot be used to estimate the relative risk of an accident, comparing groups with and without infection. This is because the calculation of relative risk requires an unbiased estimate of p1, the probability that an infected individual has an accident. It also requires an unbiased estimate of p2, the probability that an uninfected individual has an accident. Such estimates are unavailable with this kind of data because of the case-control study design. Within each group, we do not have a random sample of drivers to use in estimating the probability of an accident. The data are enriched with drivers who have had accidents, compared to what we would see in a random sample of drivers from the population. As a result, the proportion of infected individuals in the study who were also in an accident is likely to be a severely biased estimate of the probability that an infected individual in the population has an accident. We are unable to calculate risk, and therefore we cannot calculate relative risk. (In contrast, if the researchers had first obtained two random samples of people with and without toxoplasmosis and then asked whether they had been in a car accident, we would be able to calculate the risk for both groups and then relative risk.) A case-control study is a type of observational study in which a sample of individuals with a focal condition (cases) is compared to a sample of subjects lacking the condition (controls). Nevertheless, we can estimate the odds ratio of an accident, comparing infected and uninfected drivers. With odds ratios, the overall proportions of cases and controls cancel out in the ratio. As a result, the odds ratio is unaffected by having a sample of cases and controls that don’t match the population proportions of cases and controls. Thus, a study with a case-control design can be analyzed even if the total numbers of cases and controls in the samples are chosen by the experimenter rather than by sampling at random from the population. To finish this example, let’s calculate the odds ratio with the usual formula: OR^=adbc=61×169124×16=5.20. The odds of an accident are estimated to be about five times higher for drivers infected with Toxoplasma than for uninfected drivers. Recall that if the focal event is rare in the population, then relative risk and odds ratio are approximately the same magnitude. Hence, if driving accidents are rare in the population, the relative risk is also about fivefold. However, if accidents were common in this population, this interpretation would be inaccurate. The χ2 contingency test Relative risk and odds ratio allow us to estimate the magnitude of association between two categorical variables. However, they do not directly test whether an association may be caused by chance alone. The χ2 contingency test is the most commonly used test of association between two categorical variables. It tests the goodness of fit to the data of the null model of independence of variables. The χ2 contingency test is the most commonly used test of association between two categorical variables. Example 9.4 illustrates how the method works. EXAMPLE 9.4 The gnarly worm gets the bird Many parasites have more than one species of host, so the individual parasite must get from one host to another to complete its life cycle. Trematodes of the species Euhaplorchis californiensis use three hosts during their life cycle. Worms mature in birds and lay eggs that pass out of the bird in its feces. The horn snail Cerithidea californica eats these eggs, which hatch and grow to another life stage in the snail, sterilizing the snail in the process. When an infected snail is eaten by the California killifish Fundulus parvipinnis, the parasite develops to the next life stage and encysts in the fish’s braincase. Finally, when the killifish is eaten by a bird, the worm becomes a mature adult and starts the cycle again. Researchers have observed that infected fish spend excessive time near the water surface, where they may be more vulnerable to bird predation. This would certainly be to the worm’s advantage, as it would increase its chances of being ingested by a bird, its next host. Lafferty and Morris (1996) tested the hypothesis that infection influences risk of predation by birds. A large outdoor tank was stocked with three kinds of killifish: unparasitized, lightly infected, and heavily infected. This tank was left open to foraging by birds, especially great egrets, great blue herons (pictured), and snowy egrets. Table 9.4-1 lists the numbers of fish eaten according to their levels of parasitism. TABLE 9.4-1 Observed frequencies of fish eaten or not eaten by birds according to trematode infection level. Uninfected Lightly infected Highly infected Row total Eaten by birds Not eaten by birds 1 49 10 35 37 9 48 93 Column total 50 45 46 141 We can use a mosaic plot to visualize the pattern in the data (Figure 9.4-1). Only 2% of the uninfected fish were eaten, while 22% and 80% of the lightly and heavily infected fish, respectively, died from predation. FIGURE 9.4-1 A mosaic plot for bird predation on killifish having different levels of trematode parasitism. The red areas represent the relative frequency of fish eaten by birds, and the gold areas are fish that escaped bird predation. A total of n = 141 fish are represented in these data. Hypotheses We want to test whether the probability of being eaten by birds differs according to infection status. In other words, we want to test whether the categorical variables “infection level” and “being eaten” are independent. The null and alternative hypotheses are as follows: H0: Parasite infection and being eaten are independent. HA: Parasite infection and being eaten are not independent. To carry out the χ2 contingency test, we need to calculate the expected frequencies for each of the cells in Table 9.4-1 under the null hypothesis of independence. Expected frequencies assuming independence Recall from Section 5.6 that, if two events are independent, then, by definition, the probability of both occurring is equal to the probability of one event occurring times the probability of the other event occurring (this is the multiplication rule). We use the multiplication rule to calculate the expected proportion of individual fish under each combination of events and then the expected frequencies under the null hypothesis. For example, if the infection status of a fish in our sample is independent of whether it’s eaten, then Pr[uninfected and eaten]=Pr[uninfected]×Pr[eaten]. To calculate the expected fraction of fish both uninfected and eaten, though, we still need to estimate the probability that a fish in the sample is uninfected and the chance that a fish was eaten. We can estimate these probabilities from the data in Table 9.4-1. The estimated probability that a fish was uninfected is the total number of uninfected fish in the sample (50) divided by the total number of fish (141): Pˆr[uninfected]=50/141=0.3546. We’ve marked the probability estimate with a “hat” (^) to indicate that it is an estimate and not the true value. We can estimate the probability of being eaten in the same way, by dividing the number of fish eaten (48) by the total number of fish (141): Pˆr[eaten]=48/141=0.3404. Under the null hypothesis of independence, therefore, the probability of a fish being uninfected and eaten is expected to be Pˆr[uninfected and eateng]=0.3546×0.3404=0.1207. This means that the expected frequency of fish both uninfected and eaten is this probability (0.1207) times the total number of individuals in the data set (141): Expected[uninfected and eaten]=0.1207×141 =17.0. Repeating the preceding procedure for the other cells in Table 9.4-1 gives the expected frequencies of all combinations of outcomes. These values are listed in Table 9.4-2, but you should make sure that you can calculate them on your own. TABLE 9.4-2 Expected frequencies of fish eaten and not eaten by birds, according to trematode infection status. Uninfected Lightly infected Highly infected Row total Eaten by birds Not eaten by birds 17.0 33.0 15.3 29.7 15.7 30.3 Column total 50 45 46 48 93 141 Note that the row and column totals in Table 9.4-2 match the totals in the actual data (Table 9.4-1). This must be true, because we used the proportions in the data themselves to generate our expected frequencies. If the row and column totals are not the same for the observed and expected frequencies (within rounding errors), a calculation error has been made.5 Moreover, the expected frequencies don’t have to be integers. Remember that we use “expected” in the sense of “on average,” so we don’t necessarily expect an integer. The χ2 statistic The observed frequencies in Table 9.4-1 are quite different from the expected frequencies in Table 9.4-2. From this point on, the χ2 contingency analysis is just a special case of the χ2 goodness-of-fit test. We calculate the χ2 statistic to test whether these discrepancies are greater than expected by chance. Using c to represent the number of columns and r to represent the number of rows, we get χ2=∑row=1r∑column=1c[Observed(row, column) - Expected(row, column)]2Expected(row, col This χ2 calculation simply adds across all cells of the contingency table. When we apply it to the data, we get χ2=(1-17.0)217.0+(49-33.0)233.0+(10-15.3)215.3+(35-29.7)229.7+(37-15.7)215.7+(9- 30.3)230.3=69.5. Degrees of freedom The sampling distribution of the χ2 test statistic under the null hypothesis of independence is approximated by the theoretical χ2 distribution. To calculate the degrees of freedom for the χ2 distribution, we count the number of rows (r) and the number of columns (c) in the data table. The number of degrees of freedom, then, is given by df = (r−1)(c−1). For Example 9.4, Table 9.4-1 has two rows and three columns, so there are two degrees of freedom: df = (2−1)(3−1)=2. P-value and conclusion The critical value for the χ2 distribution with df = 2 and significance level α = 0.05 is 5.99 (see Statistical Table A). Our observed value (χ2 = 69.5) is further out in the tail of the distribution, much greater than the critical value of 5.99. We therefore reject the null hypothesis that infection level and probability of being eaten are independent in killifish. Instead, the probability of being eaten is contingent upon whether the fish was parasitized. Trematode parasitism in these killifish was associated with higher rates of predation by birds. We reach the same conclusion when we use a computer to calculate the P-value for a χ2 value of 69.5 having two degrees of freedom: P < 10−10. How do we explain this result? If differences in infection are truly the cause of the differences in predation risk, then the most likely explanation is that the worms modify fish behavior or hinder their ability to escape, increasing their chances of being eaten by birds and thus completing the last transition in the worm’s life cycle.6 A shortcut for calculating the expected frequencies Here’s a shortcut formula for calculating the expected frequencies in fewer keystrokes on your calculator. The expected cell value for a given row and column is Expected [row i, column j ] =(Row i total)(Column j total)Grand total. For the killifish parasite data, for example, the expected frequency for the top left cell in Table 9.4-2 (uninfected fish that were eaten) can be calculated by multiplying the total for its row (48) times the total for its column (50) and dividing by the overall total (141): Expected[row 1, column j]=(48×50)/141=17.0. The last cell in a row or column can also be computed by subtraction, because the sum of the expected frequencies for a given row or column is the same as the sum of the observed values. Thus, the expected frequency of the top right cell in Table 9.4-2 is 48 − 17.0 − 15.3 = 15.7. (The number of cells that we cannot calculate by subtraction is the number of degrees of freedom for the test. The expected values of cells that are calculated by subtraction are fixed and not free to vary.) This shortcut comes from the definition of independence and the way we estimate the probability of belonging to each row or column: Expected=Pr [row]×Pr [column]× Grand total= (Row totalGrand total)×(Column totalGrand total)×Grand total. Canceling terms allows you to take the shortcut. The χ2 contingency test is a special case of the χ2 goodness-of-fit test You may have noticed that, once we specified the expected values, the χ2 contingency test was very similar to the χ2 goodness-of-fit test introduced in Chapter 8. This resemblance is no accident, because the χ2 contingency test is a special application of the more general goodnessof-fit test for which the probability model being tested is the independence of variables. The number of degrees of freedom for the contingency test obeys the same rules as those for the goodness-of-fit test.7 Assumptions of the χ2 contingency test The χ2 contingency test makes the same assumptions as the χ2 goodness-of-fit test. No more than 20% of the cells can have an expected frequency less than five, and no cell can have an expected frequency less than one. If these rules are not met, at least three options are available. First, if the table is bigger than 2 × 2, then two or more row categories (or two or more column categories) can be combined to produce larger expected frequencies. This should be done carefully, though, so that the resulting categories are still meaningful. (For example, the three categories of infection in a trematode predation experiment could have been collapsed into two categories—namely, “uninfected” and “infected,” if necessary.) Second, if the table is 2 × 2, then Fisher’s exact test should be used instead. Fisher’s exact test is summarized in Section 9.5. Finally, a permutation test may be used instead of the χ2 test, an approach that we discuss further in Chapter 13. Correction for continuity Some statisticians recommend a modified formula to calculate the χ2 test statistic in the case of a 2 × 2 contingency table. The modification is known as the Yates correction for continuity: χ2=∑row=12∑column=12[|Observed(row, column)- Expected(row, column)|-12]Expected(row, All other steps in the Yates corrected test are the same as in an ordinary χ2 contingency test. We mention the Yates correction here because you will encounter it in the biological literature. However, we don’t recommend that you use it. The correction makes the χ2 contingency test too conservative (Maxwell 1976). That is, the Yates corrected test overestimates the correct P-value, with the result that the power of the test is reduced—it is less likely to reject a false null hypothesis. Fisher’s exact test Fisher’s exact test, named after Sir Ronald A. Fisher, provides an exact P-value for a test of association in a 2 × 2 contingency table. The test is an improvement over the χ2 contingency test in cases where the expected cell frequencies are too low to meet the rules demanded by the χ2 approximation. Fisher’s exact test examines the independence of two categorical variables, even with small expected values. The calculation of the P-value in Fisher’s exact test is cumbersome and is best done with a computer statistical package. Therefore, we do not detail the calculations here, but we instead focus on what the test can do and when it is appropriate. EXAMPLE 9.5 The feeding habits of vampire bats In Costa Rica the common vampire bat, Desmodus rotundus, commonly feeds on the blood of domestic cattle. The bat prefers cows to bulls, which suggests that the bats might respond to a hormonal signal. To explore this behavior further, a researcher compared vampire bat attacks on cows in estrus (“in heat”) with attacks on cows not in estrus8 on a particular night (Turner 1975). The results are presented in Table 9.5-1. Do cows in estrus differ from cows not in estrus in their chance of being attacked? TABLE 9.5-1 Numbers of cattle by estrus status and by vampire bat bite status. Cows in estrus Cows not in estrus Row totals Bitten by vampire bat Not bitten by vampire bat 15 7 6 322 21 329 Column totals 22 328 350 The null and alternative hypotheses are as follows: H0: State of estrus and vampire bat attack are independent. HA: State of estrus and vampire bat attack are not independent. We are tempted to analyze these data with a χ2 contingency test. However, if we calculate the expected values in the usual way, we find that the expected frequency for cows in estrus that were bitten by vampire bats is too low (Table 9.5-2). TABLE 9.5-2 The expected frequency values for the vampire bat study. Cows in estrus Cows not in estrus Row totals Bitten by vampire bat Not bitten by vampire bat Column totals 1.3 20.7 19.7 308.3 21 329 22 328 350 According to the null hypothesis, we expect to see only 1.3 cows that were both in heat and bitten by a vampire bat. This means that one out of the four cells (25%) has an expectation less than five, whereas the rule for a χ2 test is that no more than 20% of cells should have expectations that low. Because this is a 2 × 2 contingency table, we can turn to a Fisher’s exact test. The null and alternative hypotheses remain the same as in the χ2 test. Fisher’s test proceeds by listing all 2 × 2 tables that are as extreme as or more extreme than the observed table of numbers under the null hypothesis of independence. For example, the following are the more extreme tables in one tail. Row and column totals remain the same in the alternative hypotheses, so we just change one of the values and adjust the others to match. Focus on the top right corner of each table: The P-value for Fisher’s exact test is the sum of the probabilities of all such extreme tables under the null hypothesis of independence, including equally or more extreme tables on the other tail of the distribution of tables. There are also confidence intervals for odds ratios based on small samples using the same logic as Fisher’s exact test (see Agresti 2002). We applied Fisher’s exact test to the data in Table 9.5-1, using a statistical program on the computer, and we found that P < 0.0001. Thus, we can reject the null hypothesis of independence. Vampire bats evidently preferred the cows in estrus. The reasons for this are not clear.9 G-tests The G-test is another contingency test often seen in the literature. The G-test is almost the same as the χ2 test except that the following test statistic is used: G=2∑row=1r∑column=1cObserved(row, column)×ln[Observed(row, column)Expected(row, co where “ln” refers to the natural logarithm. Under the null hypothesis of independence, the sampling distribution of the G-test statistic is approximately χ2 with (r − 1) (c − 1) degrees of freedom. For the data in Example 9.4 on fish infection and bird predation, for example, G=2(1 ln[117.0]+49 ln[4933.0]+10 ln[1015.3]+35 ln[3529.7]+37 ln[3715.7]+9 ln[930.3])=77.6. This G-test statistic would be compared to the χ2 distribution with df=(r−1)(c−1)=2. Again, we would strongly reject the null hypothesis of independence. The G-test is derived from principles of likelihood (see Chapter 20) and can be applied across a wider range of circumstances than the χ2 contingency test. While the G-test is preferred by some statisticians, it has been shown to be less accurate for small sample sizes (Agresti 2002), and it is used somewhat less often than the χ2 contingency test in the biological literature. The G-test has advantages, though, when analyzing complicated experimental designs involving multiple explanatory variables (see Sokal and Rohlf 1995 or Agresti 2002). Summary ■ The odds of success are the probability of success divided by the probability of failure, where “success” refers to the outcome of interest. ■ The odds ratio is the odds of success in one of two groups (the treatment group, if one is present) divided by the odds of success in the second group (the control group, if one is present). The odds ratio is used to quantify the magnitude of association between two categorical variables, each of which has two categories. ■ Risk is the probability of an undesired event. Relative risk is risk in a treatment group divided by the risk in a control group. If relative risk is less than one, then the treatment is associated with reduced risk. ■ The χ2 contingency test makes it possible to test the null hypothesis that two categorical variables are independent. ■ The sampling distribution of the χ2 statistic under the null hypothesis is approximately χ2 distributed with (r − 1)(c − 1) degrees of freedom. The χ2 approximation works well, provided that two rules are met: no more than 20% of the expected frequencies can be less than five, and none can be less than one. ■ Fisher’s exact test calculates an exact P-value for the test of independence of two variables in a 2 × 2 table. The test is especially useful when the rules for the χ2 approximation are not met. ■ The G-test is an alternative method for testing the null hypothesis of independence with contingency analysis. Quick Formula Summary Confidence interval for odds ratio What does it assume? Random samples. Formula: ln(OR^)−Z SE[ln(OR^)]<ln(OR)<ln(OR^)+Z SE[ln(OR^)], where ln(ORˆ ) is the natural logarithm of the estimate of odds ratio OR^=adbc, where a and b are the observed frequencies of the focal outcome (“success”) in the two treatment groups, and c and d are the frequencies of the second category of the response variable (see table on p. 240). (SE[ln(OR^)] ratio, is the standard error of the log-odds SE[ln(OR^)]=1a+1b+1c+1d, and Z = 1.96 for a 95% confidence interval. The confidence interval for OR is found by taking the antilog of the limits of the confidence interval for ln(OR). When a, b, c, or d is zero, then add 1/2 to all four values before calculating the estimate of the odds ratio and its confidence interval (Sweeting et al. 2004). Confidence interval for relative risk What does it assume? Random samples. Formula: ln(RR^)−Z SE[ln(RR^)]<ln(RR)<ln(RR^)+Z SE[ln(RR^)], where ln(RRˆ) is the natural logarithm of the estimate of relative risk, ln[RR^]=ln[pˆ1pˆ2], and pˆ1 and pˆ2 are the estimated proportions of the undesired outcome (risk) for the two groups, such that pˆ1=aa+c and pˆ2=bb+d. When a, b, c, or d is zero, then add 1/2 to all four values before calculating the estimate of the relative risk and its confidence interval. SE[ln(RR^)] is the standard error of the log-relative risk, SE[ln(RR^)]=1a+1b+1a+c-1b+d, and Z = 1.96 for 95% confidence interval. The confidence interval for RR is found by taking the antilog of the lower and upper limits of the confidence interval for ln(RR). The χ2 contingency test What is it for? Testing the null hypothesis of no association between two or more categorical variables. What does it assume? Random samples; the expected frequency of each cell is greater than one; no more than 20% of the cells have expected frequencies less than five. Test statistic: χ2 Sampling distribution under H0 χ2 distribution with (r − 1)(c − 1) degrees of freedom, where r and c are the numbers of rows and columns, respectively. Formula: χ2=∑row = 1r∑column = 1c[Observed(row, column) - Expected(row, column)]2Expected(row, Fisher’s exact test What is it for? Testing the null hypothesis of no association between two categorical variables, each having two categories. Appropriate with small expected values. What does it assume? Random samples. Formula: P=2∑all equally or more extreme tablesR1!R2!C1!C2!a!b!c!d!n!, where Ri and Ci are the row and column totals; a, b, c, and d are the cell values for each of the cells; and n is the total sample size. The summation is over all tables, including the observed table and any tables with the same row and column totals more different from H0 than the observed table. G-test What is it for? Testing the null hypothesis of no association between two or more categorical variables. What does it assume? Random samples; no more than 20% of cells have expected frequencies less than five. Test statistic: G Sampling distribution under H0: X2 distribution with (r − 1)(c − 1) degrees of freedom, where r and c are the numbers of rows and columns, respectively. Formula: G=2∑row = 1r∑column = 1cObserved(row, column)×ln[Observed(row, column)Expected(row, PRACTICE PROBLEMS 1. Calculation practice: Odds ratio and relative risk. Wilson et al. (2011) followed a set of male health professionals for 20 years. Of all the men in the study, 7890 drank no coffee and 2492 drank on average more than 6 cups per day. In the “no coffee” group, 122 developed advanced prostate cancer during the course of the study, and 19 in the “high coffee” group did. a. Create a contingency table for these data. Follow the convention recommended in Chapter 2: the explanatory variable is in the columns and the response variable is in the rows. What association is suggested? b. What is the estimated probability of advanced prostate cancer for the high-coffee group (i.e., what is the risk of advanced prostate cancer for men who drink more than six cups of coffee per day)? c. What is the estimated probability of advanced prostate cancer for the no-coffee group (i.e., what is the risk of advanced prostate cancer for non-coffee drinkers)? d. What is the relative risk of advanced prostate cancer, comparing the treatment (highcoffee) and control (no-coffee) groups? e. What are the odds of advanced prostate cancer for high-coffee consumers? f. What are the odds of advanced prostate cancer for non-coffee drinkers? g. What is the odds ratio of advanced prostate cancer, comparing these two groups? h. What is the log odds ratio comparing these two groups? i. What is the standard error of the log odds ratio in this case? j. What is the 95% confidence interval for the log odds ratio? k. What is the 95% confidence interval for the odds ratio? l. Interpret this confidence interval for the odds ratio. Is it consistent with the possibility that coffee drinking and developing advanced prostate cancer are independent? Does coffee consumption tend to be associated with an increased or decreased probability of developing advanced prostate cancer? 2. Calculation practice: χ2 contingency analysis. Married couples often split up after one member is diagnosed with a catastrophic disease, such as terminal cancer or a brain tumor. Does the frequency of breakup depend on which member is diagnosed? Glantz et al. (2009) tallied divorces after such serious diagnoses in 515 patients in opposite-sex marriages. Of the 261 couples in which the man was ill, 7 divorced soon after diagnosis. Of 254 couples in which the woman was the patient, 53 divorced. Test for a difference between the two types of couples in the proportion divorcing after diagnosis. a. Summarize the data in a contingency table and examine the frequencies. Do divorce frequencies appear similar between the two types of patients? If they differ, for which sex does diagnosis seem more often to lead to divorce? b. State the null and alternate hypotheses for the test. c. In what proportion of couples was the man the diagnosed patient? d. What proportion of couples divorced? e. Under the null hypothesis, what is the expected proportion of each of the four possible combinations of outcomes? f. What is the total number of observations in the study? g. Under the null hypothesis, what is the expected number (frequency) of observations in each combination? h. Examine the expected numbers. Is it legitimate to use a χ2 contingency test with these data? Why? i. Calculate the test statistic, χ2, for these data. j. k. l. m. How many degrees of freedom does the χ2 test statistic have? What is the critical value for this test corresponding to significance level α = 0.05? Calculate the P-value for the test. What is your conclusion? Are sex of the diagnosed patient and divorce independent? 3. The hypothetical plots at the top of the next page show the relative frequencies of subjects assigned to two experimental groups (treatment and control). The frequency of negative outcomes (red) and positive outcomes (gold) are illustrated. For each plot (a), (b), and (c), identify the correct value of relative risk from the following list of choices: 0.1, 0.5, 1, 2, or 9. 4. The common pigeon found in most American cities is derived from a domesticated European species released in North America. As a result, the pigeons in North America have variations in coloration caused by genes previously selected by pigeon fanciers. An example is the rump, whose feathers are white in wild European pigeons but blue in many pigeons in North America. It has been hypothesized that the white rump of pigeons serves to distract predators like peregrine falcons, and therefore it may be an adaptation to reduce predation. To test this, researchers followed the fates of 203 pigeons, 101 with white rumps and 102 with blue rumps. Nine of the white-rumped birds were captured and killed by falcons, while 92 of the blue-rumped birds were killed (Palleroni et al. 2005). a. Show the results in a frequency table. Follow the convention recommended in Chapter 2: put the explanatory variable in the columns and the response variable in the rows. What association is suggested? b. Do the two kinds of pigeons differ in their rate of capture by falcons? Carry out an appropriate test. c. What is the estimated odds ratio for capture of the two groups of pigeons? What is the 95% confidence interval for the odds ratio? 5. Malaria, which kills more than a million people each year worldwide, is caused by a Plasmodium that spreads between hosts by infected mosquitoes. The more people bitten by each infected mosquito, the higher the transmission rates of malaria. Does infection by plasmodium cause a mosquito to bite more people? To test this, researchers captured 262 mosquitoes that had human blood in their guts (Koella et al. 1998). They measured two attributes: whether mosquitoes were infected with malaria, and whether they had fed on the blood of more than one person (assessed by DNA fingerprinting of blood in mosquito guts). Of 173 uninfected mosquitoes, 16 had taken multiple blood meals. Of 89 infected mosquitoes, 20 had fed multiple times. a. Illustrate these results with a graph. What association is suggested? b. Do these data support the idea that infected mosquitoes behave differently than uninfected mosquitoes? 6. Female Australian redback spiders, Latrodectus hasselti, are about 50 times larger than males and often eat the males during mating. This might sound like a horrible accident, but males could gain some indirect advantage by being cannibalized in this way. Perhaps a female is more likely to accept the sperm of a male that she has eaten than that of a male that has escaped. Researchers watched the mating behavior of 32 virgin female redback spiders, recording whether each female ate her first mate and then whether she rejected advances from a second male later placed in her vicinity (Andrade 1996). The results were as follows: 2nd male accepted 2nd male rejected 1st male eaten 1st male escapes 3 6 22 1 a. How does cannibalism affect the odds that the second male is accepted? b. What method would you use to test the association between these two variables? Why? 7. Fires are a common and important part of many ecosystems. Many species have evolved mechanisms for dealing with fire. Reed frogs, a species living in West Africa, have been observed hopping away from grass fires long before the heat of the fire reached the area they were in. This finding led to the hypothesis that the frogs might hear the fire and respond well before the fire reaches them. To test this hypothesis, researchers played three types of sound to samples of reed frogs and recorded their response (Grafe et al. 2002). Twenty frogs were exposed to the sound of fire, 20 were exposed to the sound of fire played backward (to control for the range of sound frequencies present in the real sound), and 20 were exposed to equally loud white noise. Of these 60 frogs, 18 hopped away from the sound of fire, 6 hopped away from the sound of fire played backward, and 0 hopped away from the white noise. a. Illustrate these data with a frequency table. What association is suggested? b. Do the data provide evidence that reed frogs change their behavior in response to the sound of fire? 8. Some fish are famously able to develop into either sex, depending on social circumstances. One study of a coral reef fish, the goby Gobiodon erythrospilus,10 placed juvenile fish with either an adult male or an adult female (Hobbs et al. 2004). Of the 12 juveniles placed with a male, 11 became female. Of the 10 juveniles placed with an adult female, six became male. What method can we use to test whether the social context of the juvenile fish affects which sex they become? 9. Between 20 and 25 violent acts are portrayed per hour in children’s television programming. A study (Johnson et al. 2002) of the possible link between TV viewing and aggression followed the TV viewing habits of children between 1 and 10 years old. Of these children, 88 watched less than one hour of TV per day, 386 watched 1–3 hours per day, and 233 watched more than three hours per day. Eight years later, researchers evaluated the kids to see if they had a police record or had assaulted and injured another person. The number of aggressive individuals from the three TV-watching groups were 5, 87, and 67, respectively. a. Estimate the proportion of kids in each TV-watching category who subsequently became violent. Give 95% confidence intervals for these estimates. b. Is there evidence that childhood TV viewing is associated with future violence? Carry out an appropriate statistical test. c. Does this prove that TV watching causes increased aggression in kids? Why or why not? 10. A study by Doll et al. (1994) examined the relationship between moderate intake of alcohol and the risk of heart disease. In all, 410 men (209 “abstainers” and 201 “moderate drinkers”) were observed over a period of 10 years, and the number experiencing cardiac arrest over this period was recorded and compared with drinking habits. All men were 40 years of age at the start of the experiment. By the end of the experiment, 12 abstainers had experienced cardiac arrest and nine moderate drinkers had experienced cardiac arrest. a. Test whether the relative frequency of cardiac arrest was different in the two groups of men. b. Assume that you were unable to reject the null hypothesis in part (a). Would this imply that drinking has no effect on the risk of cardiac arrest? Why or why not? 11. Postnatal depression affects approximately 8–15% of new mothers. One theory about the onset of postnatal depression predicts that it may result from the stress of a complicated delivery. If so, then the rates of postnatal depression could be affected by the type of delivery. A study (Patel et al. 2005) of 10,934 women compared the rates of postnatal depression in mothers who delivered vaginally to those who had voluntary cesarean sections (C-sections). Of the 10,545 women who delivered vaginally, 1025 suffered significant postnatal depression. Of the 389 who delivered by voluntary C-section, 48 developed postnatal depression. a. Draw a graph of the association between postnatal depression and type of delivery. What is the pattern in this data? b. How different are the odds of depression under the two procedures? Calculate the odds ratio of developing depression, comparing vaginal birth to C-section. c. Calculate a 95% confidence interval for the odds ratio. d. Based on your result in part (c), would the null hypothesis that postpartum depression is independent of the type of delivery likely be rejected if tested? e. What is the relative risk of postpartum depression under the two procedures? Compare your estimate to the odds ratio calculated in part (b). 12. Migraine with aura is a potentially debilitating condition, yet little is known about its causes. A case-control study compared 93 people who suffer from chronic migraine with aura (cases) to a sample of 93 healthy patients (controls; Schwerzmann et al. 2005). The researchers used transesophageal echocardiography to look for cardiac shunts in all of these patients. (A cardiac shunt is a heart defect that causes blood to flow from the right to the left in the heart, causing poor oxygenation.) Forty-four of the migraine patients were found to have a cardiac shunt, while only 16 of the people without migraine symptoms had this heart defect. a. Is this an observational or experimental study? b. Show the association between migraines and cardiac shunts with a mosaic plot. c. How strong is the association between migraines and cardiac shunts? Calculate the odds ratio for migraine, comparing the patients with and without cardiac shunts. d. What is the 95% confidence interval for this odds ratio? 13. Spot the flaw. Since 1953, when Tenzing Norgay and Edmund Hillary reached the summit of Mount Everest, many climbers have attempted to scale the world’s two highest mountains, Everest and K2. Norgay and Hillary aided their climb by bringing supplemental oxygen in tanks, and some later groups have attempted to outdo the originals by trying the ascent without supplemental oxygen. From 1978 to 1999, in fact, 159 teams comprising 1173 team members have attempted to climb either Everest or K2. The numbers of individuals who either survived or died11 during those attempts is given in the following table (data from Huey and Eguskitza 2000): Supplemental O2 No supplemental O2 Row totals Survived Did not survive 1045 32 88 8 1133 40 Column totals: 1077 96 1173 A χ2 contingency test on these data calculated χ2 = 7.694 with one degree of freedom, which corresponds to P = 0.0055. The null hypothesis is that oxygen use has no effect on survivorship during these expeditions. What’s wrong with this analysis?12 14. A “Mediterranean diet” (high in fish, olive oil, red wine, etc.) has been touted as a key to a long life. A study by Trichopoulou et al. (2005) looked at the death rates of people according to whether their diet had a low component, a medium component, or a high component of foods that characterize a Mediterranean diet. In these kinds of studies, it is important to look for other confounding variables, such as smoking, that might be correlated with the main variable under study. For each person in the study, therefore, the team also recorded whether the person was a current smoker, a former smoker, or had never smoked. We want to know if there is an association between diet and smoking. The data for men in the study are as follows, where the numbers represent the number of men in each category: Never smoked: Former smoker: Current smoker: Low Med. diet Medium Med. diet High Med. diet 2516 3657 2012 2920 4653 1627 2417 3449 1294 a. Draw a mosaic plot of these data. What is the pattern? b. Test whether there is an association between diet and smoking. c. Comment on how the relationship you found in part (b) would affect the interpretation of a study that looked for health effects of switching to a Mediterranean diet. ASSIGNMENT PROBLEMS 15. The hypothetical plots at the bottom of this page show the relative frequencies of subjects assigned to two experimental groups (treatment and control). The frequency of negative outcomes (red) and positive outcomes (gold) are illustrated. For each plot (a), (b), and (c), identify the correct value of the odds ratio: 0.1, 0.5, 1, 2, or 9. 16. In animals without paternal care, the number of offspring sired by a male increases as the number of females he mates with increases. This fact has driven the evolution of multiple matings in the males of many species. It is less obvious why females mate multiple times, because it would seem that the number of offspring that a female has would be limited by her resources and not by the number of her mates, as long as she has at least one mate. To look for advantages of multiple mating, a study of the Gunnison’s prairie dog followed females to find out how many times they mated (Hoogland 1998). They then followed the same females to discover whether they gave birth later. The results are compiled in the following table: Number of times female mated: Number who gave birth: Number who didn’t give birth: 1 2 3 4 5 81 6 85 8 61 0 17 0 5 0 FIGURE FOR PROBLEM 15 Did the number of times that a female mated affect her probability of giving birth? a. Calculate expected frequencies for a contingency test. b. Examine the expected frequencies. Do they meet the assumptions of the χ2 contingency test? If not, what steps could you take to meet the assumptions and make a test? c. An appropriate test shows that the number of mates of the female prairie dogs is associated with giving birth. Does this mean that mating with more males increases the probability of giving birth? Can you think of an alternative explanation? 17. Some people feel that they have good intuition about when others are lying, while others do not feel they have this ability. Are the more “intuitive” people better able to detect lies? Each of 100 people who thought they had good intuitive abilities was shown a video clip of a person stating the name of his or her favorite movie (Young 2002). The person was truthful in some of the clips shown, whereas in others the person was lying. Another 100 people who claimed not to have intuitive abilities were shown similar video clips. Fifty-nine of the 100 “intuitive” subjects correctly identified whether the person in the video was lying, whereas 69 of the 100 “nonintuitive” subjects correctly identified whether the person in the video was lying. a. Draw a graph that best presents these data. What association is suggested? b. Test whether the success rates of the two groups were different. c. Are “intuitive” people better at detecting lies than “nonintuitive” people? Calculate an odds ratio and confidence interval for your answer. Interpret your result. 18. Spot the flaw. Scottish researchers compared rates of depression between 94 undergraduates who regularly kept diaries and 41 students who did not. They found that people who kept diaries were more likely to have depression than those who did not. The researchers said, “We expected diary-keepers to have some benefit, or be the same, but they were the worst off. You are probably much better off if you don’t write anything at all.” Why is this an incorrect interpretation of these results? 19. It is well known, and scientifically documented, that yawning is contagious. When we see someone else yawn, or even think about someone yawning, we are very likely to yawn ourselves. (In fact, we predict that you are starting to want to yawn right now.) In a study of yawning contagion, researchers showed participants one of several pictures, including a picture of a man yawning, the same man smiling, a yawning man with his mouth covered, or a yawning man with his eyes obscured (Provine et al. 1989). Participants yawned much more often when shown the yawner than the smiler, but surprisingly an identical number also yawned when shown the picture with the mouth obscured. This suggests that something besides the mouth is an important trigger. What about the eyes? Seventeen of 30 participants yawned when confronted with a picture of a yawning man, while 11 of 30 independent participants yawned when shown a picture of a yawning man with his eyes covered. Is there evidence in these data that covering the yawning man’s eyes in an image changes the occurrence of contagious yawns? 20. Day care centers expose children to a wider variety of germs than the children would be exposed to if they stayed at home more often. This has the obvious downside of more frequent colds and other illnesses, but it also serves to challenge the immune system of children at a critical stage in their development. A study by Gilham et al. (2005) tested whether social activity outside the house in young children affected their probability of later developing the white blood cell disease acute lymphoblastic leukemia (ALL), the most common cancer of children. They compared 1272 children with ALL to 6238 children without ALL. Of the ALL kids, 1020 had significant social activity outside the home (including day care) when younger. Of the kids without ALL, 5343 had significant social activity outside the home. The rest of the children in both groups lacked regular contact with children who were not in their immediate families. a. Is this an experimental or observational study? b. What are the proportions of children with significant social activity in children with and without ALL? c. What is the odds ratio for ALL, comparing the groups with and without significant social activity? d. What is the 95% confidence interval for this odds ratio? e. Does this confidence interval indicate that amount of social activity is associated with ALL? If so, did the children with more social activity have a higher or lower occurrence of ALL? f. The researchers interpreted their study results in terms of the differing immune system exposure of the children, but gave several alternative explanations for the pattern. Can you think of any possible confounding variables? 21. Aging workers of the Neotropical termite, Neocapritermes taracua, develop blue crystalcontaining glands (“backpacks”) on their backs. When they fight intruding termites and are hampered, these “blue” termites explode, and the glands spew a sticky liquid (Šobotník et al. 2012). The following data are from an experiment that measured the toxicity of the blue substance. A single drop of the liquid extracted from blue termites was placed on individuals of a second termite species, Labiotermes labralis, and the number that were immobilized (dead or paralyzed) within 60 minutes was recorded. The frequency of this outcome was compared with a control treatment in which liquid from glands of “white” termites lacking the blue crystals was dropped instead. Is the blue liquid toxic compared to liquid from white termites? Liquid source Unharmed Immobilized Blue workers White workers 3 31 37 9 22. Keenan et al. (2001) used anesthesia to investigate which brain hemisphere is involved in self-recognition. Ten subjects were randomly assigned to two groups. The left hemisphere was anesthetized in one group, whereas the right hemisphere was anesthetized in the other group. Each subject was then shown a picture generated by averaging (“morphing”) images of the face of a famous celebrity (e.g., Marilyn Monroe) and their own face, and told to remember the picture. After recovery from anesthesia, patients were presented with two pictures and asked to choose the one they had been shown earlier while under anesthesia. The two pictures were the original two images from which the morphed image had been generated (i.e., “self” and “celebrity,” but separate this time). All five patients whose left hemisphere had been inactivated chose the picture of self. Four of the five patients whose right hemisphere had been anesthetized chose the celebrity picture, instead (the fifth chose self). State what test you would use to determine whether the treatment (left vs. right hemisphere anesthesia) influenced recognition of self versus celebrity. Explain why you would choose this test (don’t carry out the test, just name it and justify your answer). 23. Eggebeen et al. (2010) found that men who at some point in their lives are fathers are more likely to have altruistic social relationships and be involved in community service organizations. This result was reported in the popular press (Jacobs 2009) as “Fatherhood… alters a man’s neurochemistry, increasing his ability to cope with stress and generally making him a better mate. Just-published research suggests the benefits of this transformation extend far beyond one’s immediate family and remain robust as the years go by.” Are the conclusions drawn by the popular press article defendable? Why or why not? 24. Male Drosophila become sterile when exposed to moderately high temperatures, because sperm are damaged by heat at much lower temperatures than other cells. Rohmer et al. (2004) asked whether flies from warmer climates are adapted to higher temperatures. They collected flies from France (where it is relatively cool) and India (relatively warm) to test the effects of temperature on sterility. In one procedure, they raised male flies from both locations at a high temperature, 30.5°C, and recorded whether the flies were sterile or fertile. Thirty-two out of 50 flies from France were sterile at this temperature, whereas 20 of 50 flies from India were sterile. a. Is this an observational or experimental study? b. Draw a graph to illustrate the association between sterility and source location (India vs. France). What association is suggested? c. Is there evidence that the populations of flies from these two locations differ in their probability of sterility at this temperature? Do the appropriate hypothesis test. d. Estimate the relative risk of sterility at this temperature in the Indian population compared to the population in France (consider the Indian population to be the treatment group for this analysis). Include a 95% confidence interval. 25. Vampire bats, as their name implies, feed almost exclusively on blood. A bat must feed every day or it will starve to death, but bats are not always successful at finding a blood meal. Perhaps for this reason, they roost during the day in communal groups and sometimes share blood by regurgitative feeding.13 Researchers measured whether hungry bats were more likely to receive regurgitated blood than were partially fed bats (Wilkinson 1984). Eight bats were captured in the evening before they had fed and were held without feeding until the next morning. As a control, six bats were captured after naturally feeding at night, and they were also held until the following morning. At that time, the bats were returned to their groups. Five of the eight hungry bats were given regurgitated blood meals by groupmates, but none of the six well-fed bats were given a blood meal by other bats. What statistical test would you use to address the question, “Does the probability of being fed by roost-mates vary according to hunger status?” 26. We are often happy to do favors for other people when they have a particular need. For example, we are more willing to let someone use a photocopier when they ask, “Can I go in front of you, because I am in a rush?” than when they give no reason: “Can I go in front if you?” Some researchers believe that simply giving a reason—using the word “because”—may be enough to trigger this giving behavior, even when the reason is not a very good one. An experiment was done in which 60 people who were about to use a Xerox photocopy machine were approached (Langer et al. 1978). In 30 cases (randomly assigned; call these the “request only” group), an investigator asked, “May I use the Xerox machine?” Eighteen of these people allowed the investigator to go first. In the other group of 30 people (the “bad reason” group) an investigator said, “May I use the Xerox machine, because I have to make copies?” Of the 30, 28 allowed the investigator to go first. Test whether the bad-reason approach is better or worse than the request-only approach. 27. It is common wisdom that death of a spouse can lead to health deterioration of the partner left behind. Is common wisdom right or wrong in this case? To investigate, Maddison and Viola (1968) measured the degree of health deterioration of 132 widows in the Boston area, all of whose husbands had died at the age of 45–60 within a fixed six-month period before the study. A total of 28 of the 132 widows had experienced a marked deterioration in health, 47 had seen a moderate deterioration, and 57 had seen no deterioration in health. Of 98 control women with similar characteristics who had not lost their husbands, 7 saw a marked deterioration in health over the same time period, 31 experienced a moderate deterioration of health, and 60 saw no deterioration. Test whether the pattern of health deterioration was different between the two groups of women. Give the P-value as precisely as possible from the statistical tables, and interpret your result in words. 28. Spot the flaw. After golden monkeys fight with each other, opponents seem to reconcile. Ren et al. (1991) tested whether the behavior of golden monkeys after spontaneous conflicts differed from behavior at normal times. They recorded a large number of behaviors between individuals in two troops with a total of nine monkeys, both after aggressive interactions and not after aggressive interactions (“control” periods). The observed frequencies of behaviors are given in the accompanying table. The team carried out a contingency test on these frequencies as given and rejected the null hypothesis that behaviors occurred at similar frequencies after conflicts compared to control periods. The expected values in this test are not large enough to justify a χ2 test, but can you identify another problem with this analysis? Behavior Post conflict Control periods Open-mouth 46 1 Embrace Groom Contact sit Hold-hand Hold-lumbar Crouch 39 33 18 8 7 21 19 20 26 3 0 0 29. Psychologists were interested in whether there is a “denomination effect”—are people more or less likely to spend money if that money is in large bills or in lots of small bills? Such research may help give advice to people with problems controlling their spending, for example. The researchers stopped 50 people at a U.S. convenience store and asked them three simple questions; as a reward, they gave each person $5, either as a $5 bill or as five $1 bills. Of the 25 people who were given the larger bill, four spent some money in the convenience store. Of the 25 people who were given five $1 bills, six spent some money. a. Is there a significant difference in the probability of spending money depending on the denomination of the bills received? b. What is the relative risk of spending money for small bills compared to large bills? Provide a 95% confidence interval for the relative risk. What step would you take next in this research if you wanted to produce a narrower confidence interval for the odds ratio? 30. Brent et al. (1993) carried out a study to investigate the possible association between firearms in the home and adolescent suicide. How big is the risk? They obtained information on 67 adolescent suicide victims, 70% of whom had died by firearms. The researchers compared this group to a control sample of 67 adolescents of similar race, age, gender, childhood behavior scores, and socioeconomic status and living in the same cities. They found that in 51 of the suicide cases, guns were kept in the home, whereas this was true of 16 of the controls. a. Graph the association between suicide and the presence of guns in the home. What trend is suggested? b. What is the estimated odds ratio of suicide with guns in the home, compared to homes without guns? c. Provide a 95% confidence interval for the population odds ratio. d. Under what circumstances can this estimate of odds ratio be used to approximate the relative risk of suicide with guns in the home, compared to homes without guns? 31. Darwin suggested that plants pollinated by long-tongued insects would benefit by having long flowers, because greater length would cause the long-tongued insects to press themselves further into the flower to reach the nectar, increasing the chance that pollen is removed and that pollen from other flowers is received. Some plants have evolved “nectar spurs”—a long projection off the back of the flower that has the nectar at the base. Several populations of the South African orchid, Disa draconis, evolved longer nectar spurs after switching pollinators from relatively short-tongued horseflies (tongue length about 30 mm) to long-tongued tanglewing flies (57 mm). To measure the advantage of the long spurs, Johnson and Steiner (1997) randomly selected 59 of 118 long-spurred flowers and, using yarn, shortened the length of their spurs to that found in populations pollinated by horseflies. The remaining 59 flowers retained their spurs at full length, but yarn was tied around the stigma as a control for any other effects of the yarn. One week later, the numbers of flowers that had or had not received pollen on their stigmas were recorded. Ten of the 59 flowers with shortened spurs had received pollen on their stigmas, whereas 27 of the 59 control flowers had received pollen. a. Illustrate these results in a graph. b. What is the estimated odds ratio of not receiving pollen after experimental shortening, as compared to control flowers? Provide a confidence interval for the population odds ratio. 32. Refer to Assignment Problem 31. If we carry out a χ2 contingency test to compare the proportion of flowers in the two treatments that received pollen on the stigma, we obtain the following results: χ2 = 11.38, df = 1, P = 0.0007. But the researchers also tested whether there was an effect of treatment on the proportion of plants that had pollen removed by pollinating insects. The results are as follows: χ2 = 6.38, df = 1, P = 0.012. Is the following statement a true or false interpretation of these findings? “The effect of spur shortening on pollen receipt is stronger than the effect on pollinia removal.” Explain your answer. 33. Kuru is a prion disease of the Fore people of highland New Guinea. Kuru is similar to Creutzfeldt–Jakob disease. It was once trans-mitted by the consumption of deceased relatives at mortuary feasts, a ritual that was ended by about 1960. Using archived tissue samples, Mead et al. (2009) investigated genetic variants that might confer resistance to kuru. The data in the accompanying table are genotypes at codon 129 of the prion protein gene of young and elderly individuals all having the disease. Since the elderly individuals have survived long exposure to kuru, unusually common genotypes in this group might indicate resistant genotypes. Genotypes at codon 129 Age MM MV VV Elderly Young 13 22 77 12 14 14 a. Illustrate these data with a grouped bar graph. Which genotype(s) are especially prevalent in the elderly compared with young individuals? b. Test whether genotype frequencies differ between the two age groups. Review Problems 1 1. A scientist sets up an experiment with Drosophila that requires her to make a measurement on each fly every day until all the flies are dead. She knows that flies in her stock typically die at a rate of 3% per day, and she is willing to assume that probability of death is the same for each fly and is constant throughout its life. She is also willing to assume that the flies die independently of each other. She starts the experiment with 50 flies, 80 days before she plans to go on vacation. a. What is the probability that an individual fly survives a given day? b. What is the probability that an individual fly survives for 80 days? c. What is the probability that the experiment will be finished (i.e., that all 50 flies will have died) before her vacation? 2. A study of 6,839,854 births in the United States found a total of 6522 babies were born with a finger defect, either syndactyly (fused fingers), polydactyly (extra fingers), or adactyly (fewer than five fingers). Researchers examined 5171 of these babies with finger defects in further detail. Of these babies, 4366 had mothers that did not smoke while pregnant, and the rest had mothers that did smoke while pregnant. In a sample of 10,342 babies from the population with normal fingers, 9062 of their mothers did not smoke while pregnant, while mothers of the remaining 1280 did smoke while pregnant. Answer the following questions to establish the magnitude of the effect of smoking on the occurrence of finger defects. a. Why is this considered to be an observational study rather than experimental? What type of observational study is it? b. Use a graph to show the association between smoking during pregnancy and finger defects. c. What is the 95% confidence interval for the proportion of babies born in the United States with one of these finger defects? Is the defect common or rare? d. What is the odds ratio of these birth defects, comparing smoking to nonsmoking mothers? Provide a 95% confidence interval for the odds ratio. Based on your results, which group of mothers has the higher odds of having a child with a finger defect? e. Is it justified to consider your estimate of odds ratio in part (d) also to be an estimate of the relative risk? Explain. 3. The study of the spatial distribution of vegetation often makes use of random samples of “quadrats,” rectangular plots of fixed size placed at random over the sampling region (e.g., a field or forest). The number of plants of each type occurring within each quadrat is then counted. In one such study, an investigator counted the number of white pine seedlings growing in eighty 10 m X 10 m quadrats to test whether the distribution of pine seedlings in the forest was random, clumped, or dispersed. She obtained the following counts: Number of seedlings 0 1 Number of quadrats 47 6 2 3 4 5 6 ≥7 5 8 6 6 2 0 Total 80 a. If the null hypothesis of a random distribution of pine seedlings across the forest is correct, to what theoretical probability distribution should the observed frequencies of quadrats containing a given number of seedlings conform? b. Carry out a formal test of the null hypothesis. c. If the null hypothesis is rejected in part (b), determine whether the spatial distribution of seedlings is clumped or dispersed. 4. Seeds fall on a landscape that contains 70% barren rock. The remainder of the landscape is suitable habitat for the seeds to germinate. Assume that the location where a seed falls in this landscape is random with respect to the site’s suitability for germination. a. What is the probability that a single seed lands on a suitable site for germination? b. If two seeds fall independently, what is the probability that both fall on suitable habitats? c. Suppose that three seeds fall randomly and independently onto this landscape. Use a probability tree to find the probability that exactly two of these three seeds land on suitable habitat. 5. Refer to Practice Problem 13 in Chapter 6. The researchers additionally examined the rate of increase in oxyhemoglobin flow with increasing neuronal activity in the somatosensory cortex of rat brains. They estimated a rate of increase of 1.48, with a 95% confidence interval for the population rate of 1.06 < rate < 1.91. A rate of one is expected if the relationship between the two variables is linear, whereas a value different from one indicates a nonlinear relationship. The researchers tested the null hypothesis that the relationship between the variables is linear (H0: rate = 1). Did their test reject the null hypothesis or not? Explain. 6. A researcher wished to estimate the fraction of students in a large high school who have used illegal drugs on at least one occasion. He carried out his survey through interviews with students. To satisfy legal and ethical guidelines, it was necessary to interview students only when a parent was present. Only eight students showed up to participate. a. Is this study likely to yield a biased estimate of illegal drug use or an unbiased estimate? Why? b. What effect does such a small sample size have on sampling error of an estimate, compared with a larger sample? 7. When one generation reproduces to form the next, the frequencies of alleles in the population can change by chance from generation to generation in a process called random genetic drift. An experiment was carried out using a very large laboratory population of the common fruit fly, Drosophila melanogaster, in which two eye-color alleles (versions of a gene) were present, one red (the norm for these flies) and the other brown. The frequency of the red allele in this base population was 0.5. The researchers created a new group of flies containing 32 alleles by randomly sampling flies from this large population (Buri 1956). They created a second new group of 32 alleles by sampling again from the base population. They repeated this procedure until there were a large number of new populations, all containing 32 alleles. The frequency distribution of the proportions of red-eyed alleles in the new groups closely matched a binomial distribution with n = 32 and p = 0.5. a. On average, what do we expect the mean proportion of the red-eyed allele to be in the new groups? b. What should be the standard deviation among groups in the proportion of red-eyed alleles? c. If a randomly chosen new group has 60% red-eyed alleles, what proportion of its alleles are for brown eyes? d. What is the probability that a randomly chosen group will have exactly 50% red-eyed alleles? e. What is the probability that a randomly chosen group will have 30 or more red-eyed alleles? 8. In one of the many new groups from the Drosophila eye-color study described in Review Problem 7, there were 9 brown-eyed alleles out of 32 alleles. In another, there were 26 brown-eyed alleles out of 32. a. Using the sample of alleles in each of these two new groups, separately estimate the proportion of brown-eyed alleles in the overall population of eye-color alleles. Calculate the 95% confidence intervals for the overall proportion. b. Do these confidence intervals overlap? How can you reconcile this with the fact that both samples came from the same population? 9. Some people who are having a heart attack do not experience chest pain, although most do. A study of people admitted to emergency rooms with heart attacks compared the death rates of people who had chest pains with those of people who did not have chest pains (Brieger et al. 2004). Of the 1763 people who had heart attacks without chest pain, 229 died, while of the 19,118 people who had heart attacks with chest pain, 822 died. a. Test whether the presence or absence of chest pains during a heart attack is associated with the probability of death. b. Give a valid estimate of the magnitude of the association between chest pain and risk of death. Put bounds on your estimate. 10. The following data are scores reflecting the number of young cod that recruited (grew to the catchable size) to the North Sea population in different years (Beaugrand et al. 2003). Scores are adjusted magnitudes (without units) rather than actual fish numbers. The measurements are listed below and arranged from low to high rather than by year. 5.0, 5.1, 5.2, 5.2, 5.2, 5.2, 5.2, 5.2, 5.3, 5.3, 5.3, 5.3, 5.4, 5.4, 5.4, 5.4, 5.4, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.6, 5.6, 5.6, 5.7, 5.7, 5.7, 5.7, 5.7, 5.8, 5.8, 5.8, 5.9, 5.9, 6.0, 6.0 a. What was the mean score for number of recruits over this period? b. What was the standard deviation in the score of number of recruits? c. In what fraction of years did the score for the number of recruits fall within two standard deviations of the mean? 11. When the performances of individuals, or preferences for products, are judged in sequence by subjective criteria (think music audition or dance competition), does position in the sequence affect the opinion of the judges? An experiment to look for these order effects (Mantonakis et al. 2009) gave 33 volunteers four glasses of wine in sequence, one at a time. Participants were asked to say which of the four was the superior wine. Unknown to the participants, all four glasses were poured from the same bottle. Fifteen participants preferred the first glass, 5 preferred the second glass, 2 preferred the third glass, and 11 preferred the last glass. Is there evidence from these data that position in the sequence affected the preference of the volunteers? 12. Smoking is a major risk factor for a number of diseases, including strokes. Smoking is particularly dangerous for people who have already had a stroke. One way to help such people to quit smoking is to make drugs that help stop smoking free to stroke patients. Papadakis et al. (2011) investigated the effectiveness of this idea. They randomly divided a sample of stroke patients who smoked into two groups. One group of 12 patients received the normal advice and prescription for anti-smoking drugs, whereas the other group of 13 patients got the same advice and prescription but were also provided the drugs cost-free. After 6 months, five members of the cost-free group had quit smoking, while only two members of the other (control) group had quit by that time. a. Calculate the relative risk of smoking for the cost-free program compared with the controls (prescription only). b. What statistical method could be used to test for a difference between these two groups in the efficacy of the cost-free anti-smoking program? c. The 95% confidence interval for relative risk for this study ranges from 0.45 to 1.22. In light of this result, what do you think is the greatest weakness of this study? 13. Birds of the Caribbean islands of the Lesser Antilles are descended from immigrants originating from larger islands and the nearby mainland. The data presented here are the approximate dates of immigration, in millions of years, of each of 37 bird species now present on the Lesser Antilles (Ricklefs and Bermingham 2001). The dates were calculated from the difference in mitochondrial DNA sequences between each of the species and its closest living relative on larger islands or the mainland. 0.00, 0.00, 0.04, 0.21, 0.29, 0.54, 0.63, 0.88, 0.96, 1.25, 1.67, 1.75, 1.84, 1.96, 2.01, 2.51, 2.72, 3.30, 3.51, 4.05, 4.85, 6.94, 8.73, 10.57, 11.11, 12.45, 14.00, 17.30, 17.92, 18.05, 18.43, 22.48, 22.48, 23.48, 26.32, 26.45, 28.87 a. Plot the data in a histogram and describe the shape of the frequency distribution. b. By viewing the graph alone, approximate the mean and median of the distribution. Which should be greater? Explain your reasoning. c. Calculate the mean and median. Was your intuition in part (b) correct? d. Calculate the first and third quartiles and the interquartile range. e. Draw a box plot for these data. 14. The MathWorld web page on hypothesis testing declares that “Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true” (Weisstein 2014). Is this statement true or false? Explain. 15. Spot the flaw. In a newer study of “high-rise syndrome” (see Chapter 1), Vnuk et al. (2004) reported injury rates of 119 fallen cats brought to a veterinary clinic in Zagreb, Croatia. The following graph indicates the sex of the cats brought to the clinic. a. Identify at least two of the four principles of good graph design that are violated in this diagram. Explain your answer. b. Pie charts are not regarded as the best way to illustrate a frequency distribution. What preferred graphical method could be used here instead? 16. For each of the following scenarios, state a good graphical method to display the data, and state the most appropriate statistical test to address the question. a. Do whales swim past a detector with equal probability over time and independently of each other? The data are measurements of the number of whales per one-hour blocks of time. b. Do men and women have the same probability of contracting basal carcinoma? Assume that the number of men and women in the study is very large. c. Do men and women have the same probability of contracting basal carcinoma? Assume that the number of men and women in the study is quite small. d. Leafcutter ants cut pieces of leaves and carry them back to their nest. Sometimes small ants called minims will ride on the leaves carried by other workers. Do minims occur on the leaves independently and with equal probability? The data are the number of minims on a sample of leaves. e. Is the proportion of patients developing a skin rash within two weeks of treatment the same between patients taking a new drug and patients taking a placebo? Assume that 20% of patients overall get a rash and that there are 100 patients in each treatment group. f. Each person in a sample is independently presented with two cookies that are prepared identically, except that one includes trans fats while the other is made with a healthier alternative. Is there a preference in the population of people for one or the other type of cookie? The data are the numbers of people preferring each type of cookie. g. Does the frequency of cases of leukemia in a small town differ from the known national average? 10 The normal distribution Crab spider, Thomisus spectabilis Measurements of continuous numerical variables abound in biological data. We measure the length and weight of babies, the velocities of swallows, the times between infection and the onset of symptoms, the numbers of cones on pine trees, etc. Typically we take these measurements on a sample of individuals, but we want to be able to make inferences about the continuous variables in the population. For example, “What is a 95% confidence interval for the mean birth weight of American babies?” Or, “Which is faster, on average, an African or a European swallow?” To answer these kinds of questions, we need to understand something about the probability distributions that these data are taken from. The normal distribution, which we introduced in Section 1.4, is the queen of all probability distributions used in the analysis of biological data. It is the ubiquitous bell-shaped curve that can be used to approximate the frequency distribution of so many biological variables. The normal distribution arguably describes more about nature than any other mathematical function; thus, it takes a preeminent role in biological statistics. Even more important than its ability to approximate frequency distributions of data, the normal distribution can be used to approximate the sampling distribution of estimates, especially of sample means. Many statistical techniques have been developed for dealing with variables that have a normal sampling distribution. In the rest of this book, we focus on these techniques. This chapter describes the normal distribution and explains some of the reasons it is so important. Bell-shaped curves and the normal distribution Many numerical variables have frequency distributions that are bell shaped. For example, Figure 10.1-1 is a histogram of the birth weights of the 4,017,264 singleton1 births recorded by birth certificate in the United States in 1991. FIGURE 10.1-1 Frequency distribution of the birth weight of babies born in the United States in 1991 (Clemons and Pagano 1999). Examine the shape of the baby weight distribution. Notice in Figure 10.1-1 that the peak (i.e., the mode) is at the center of the distribution. If we averaged all of these 4,017,264 data points, we would find the mean to be 3339 grams, which is also at the mode. The distribution is clearly shaped like a symmetrical bell. There are so many data points, and the interval widths in the histogram are so narrow, that the distribution looks almost like a smooth curve. If we collected more and more measurements and used even narrower intervals, then the graph would become even smoother. The theoretical probability distribution describing many bell curves is called the normal distribution. The normal distribution is a continuous probability distribution, which means that it describes the probability distribution of a continuous numerical variable. It is symmetric around its mean. The further values are from the mean, the lower the probability density of observations. The normal distribution has two parameters to describe its location and spread: the mean and the standard deviation. For example, Figure 10.1-2 shows the normal distribution having the same mean and standard deviation as the baby birth weights. It strongly resembles the frequency distribution of the real data. FIGURE 10.1-2 The normal distribution for a variable Y with mean and variance equal to that in the baby birth weight data. The scale on the vertical axis of Figure 10.1-2 is different from that in Figure 10.1-1, because the normal distribution shows the probability density, whereas the data are expressed as counts. To find the expected relative frequency of a particular bin of the histogram, we would integrate the normal probability density from the lower bound of the bin to the upper bound. The integral of the normal distribution from minus infinity to infinity equals one, because the probabilities of all possible outcomes are accounted for. Figure 10.1-3 includes some other examples of data whose frequency distributions resemble the normal curve: the body temperatures of adult humans, the brain sizes of undergraduate students, and the number of bristles on a fly’s abdomen. In each case, we have superimposed the normal distribution with the same mean and standard deviation as the data. Bell-shaped frequency distributions appear in nature all the time, and the normal distribution is an excellent approximation to these real distributions. The number of fly bristles is actually a discrete variable; but, with a large number of possible values for this variable, it is still well approximated by a normal distribution. FIGURE 10.1-3 The normal distribution approximates frequency distributions in nature. (a) Human body temperature, in degrees Fahrenheit (Shoemaker 1996). (b) University undergraduate brain size (measured in number of megapixels on an MRI scan) (Willerman et al. 1991). (c) The number of bristles on the fourth and fifth segments of the abdomens of fruit flies (Falconer and MacKay 1995). The black lines show normal distributions with the same mean and standard deviation as measured in the data. Biostatistics makes great use of the normal distribution. The statistical methods in most common use assume that the data come from a normal distribution of measurements. Moreover, as we explain in Section 10.6, the normal distribution can describe some properties of samples for variables that aren’t themselves normally distributed. The normal distribution is a continuous probability distribution describing a bell-shaped curve. It is a good approximation to the frequency distributions of many biological variables. The formula for the normal distribution You may rarely have reason to use the formula for the probability density of the normal distribution, but you will often use calculations derived from it. The formula is f(Y)=12πσ2e-(Y-μ)22σ2. This gives the probability density f(Y) for a value Y. The value Y can be any real number, ranging from negative infinity to positive infinity; µ is the mean of the distribution; and σ is the standard deviation. The formula also includes the irrational constants π = 3.1415 . . . and e = 2.7182 . . . (the base of the natural logarithm). The mean can take any real value, and the standard deviation can take any positive value. Thus, the “normal distribution” is really an infinite number of distributions, each having its own mean and standard deviation. Properties of the normal distribution The normal distribution has the following features that are worth remembering: ■ It is a continuous distribution, so probability is measured by the area under the curve rather than the height of the curve. ■ It is symmetrical around its mean. ■ It has a single mode. ■ The probability density is highest exactly at the mean. This final feature, and the symmetry of the normal distribution, together imply that the mean, median, and mode are all equal to each other for the normal distribution. There are some useful rules of thumb about areas under the normal curve. About two-thirds (68.3%, to be more exact) of the area under the normal curve lies within one standard deviation of the mean. In other words, the probability is 0.683 that the measurement of a randomly chosen observation drawn from a normal distribution falls between µ − σ and µ + σ, as shown in Figure 10.3-1. FIGURE 10.3-1 The probability that a randomly drawn measurement from a normal distribution is within one standard deviation of the mean is approximately 2/3. (More precisely, 0.683 of the observations fall within one standard deviation of the mean.) Ninety-five percent of the probability of a normal distribution lies within about two standard deviations of the mean (more precisely, within 1.96 standard deviations). In other words, the probability is 0.95 that the measurement of a randomly chosen observation drawn from a normal distribution falls between µ − 196 σ and µ + 1.96 s, as shown in Figure 10.3-2. FIGURE 10.3-2 The probability is 0.95 that a randomly drawn measurement from the normal distribution is within approximately two standard deviations of the mean (more precisely, exactly 95% of measurements lie within 1.96 standard deviations of the mean). For a variable with normal distribution, about two-thirds of individuals are within one standard deviation of the mean, and about 95% are within two standard deviations of the mean. The standard normal distribution and statistical tables A normal distribution with a mean of zero and standard deviation of one is called a standard normal distribution.2 Figure 10.4-1 is a plot of the standard normal. We use the symbol Z to indicate a variable having a standard normal distribution. The standard normal distribution is a normal distribution with a mean of zero and a standard deviation of one. FIGURE 10.4-1 The standard normal distribution. Using the standard normal table Unlike the Poisson and binomial distributions, the probability that a given event occurs when sampling from a normal distribution is difficult to compute by hand, because it requires integration of a complicated function. Instead, we use statistical tables or computers3 to obtain probabilities under the normal curve. Statistical Table B in the back of this book gives us the probability that a random draw from a standard normal distribution is above a given cutoff value. For example, if we drew a number at random from a standard normal distribution, the probability is 0.025 that it would be greater than 1.96. Table 10.4-1, an excerpt from Statistical Table B, shows how we obtained this number. TABLE 10.4-1 Probabilities of Z > a.bc under the standard normal curve. The digit before and immediately after the decimal (i.e., a.b) are given down the first column, and the second digit after the decimal (i.e., c) is given across the top row. The answer highlighted in red shows the probability that Z > 1.96. Excerpted from Statistical Table B. Second digit after decimal (c) First two digits of a.bc 0 1 2 3 4 5 6 7 8 9 1.5 1.6 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455 1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367 1.8 0.0359 0.0352 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294 1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233 2.0 2.1 2.2 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143 0.0139 0.0136 0.0132 0.0129 0.0126 0.0122 0.0119 0.0116 0.0113 0.0110 Table 10.4-1 has an unusual layout. It provides the probability that Z is greater than a given three-digit cutoff number “a.bc,” where “a,” “b,” and “c” refer to digits of the number. To find the probability, begin by finding the first two digits of the cutoff number (a.b) in the left-hand column of the table. Then find the second digit of the number after the decimal place (c) across the top row. The probability that Z is greater than a.bc is given in the corresponding cell for that row and column. To find the probability Pr[Z > 1.96], for example, we would look down the left-hand column for a.b = 1.9 and then go across to the column that corresponds to c = 6 to fill in the last decimal. We find that the probability of a random draw from a standard normal distribution greater than 1.96 is 0.025 (the number shown in red in Table 10.4-1). This probability is the area under the curve to the right of a standard normal Z of 1.96 (see Figure 10.4-2). FIGURE 10.4-2 The area under the standard normal curve greater than Z = 1.96 is 0.025. The area in red is given in Statistical Table B, which is excerpted in Table 10.41. Statistical Table B gives us the probability that Z is greater than a given positive number. Probabilities corresponding to negative Z values are not included. Recall, however, that all normal distributions are symmetric around their means, and the standard normal distribution is symmetric around m 5 0. This means that Pr[Z<-number]=Pr[Z>number]. Thus, the probability that a random observation from the standard normal distribution is less than 21.96 is the same as the probability that an observation is greater than Pr[Z<-1.96]=Pr[Z>1.96]=0.025. The probability that Z lies between a lower bound and an upper bound can be calculated in two steps, as shown in Figure 10.4-3. First, use Statistical Table B to find Pr[Z > lower bound] and Pr[Z > upper bound]. Then, calculate the difference between these two probabilities: Pr[lower bound<Z<upper bound]=Pr[Z> lower bound]−Pr[Z>upper bound]. FIGURE 10.4-3 Calculating the area under the standard normal curve between a lower bound and an upper bound. Using the standard normal to describe any normal distribution There are an infinite number of normal distributions, but they are all similar in shape. This allows us to use a simple transformation to obtain probabilities under any normal distribution. To do so, we calculate how many standard deviations a particular value is away from the mean. Z=Y-μσ. This standardized Z value is called a standard normal deviate.4 The formula converts Y, which has a normal distribution with mean m and standard deviation s, to Z, which has a standard normal distribution. Look at this formula for a second. The numerator, Y 2 m, tells us how far away Y is from its mean, measured in the original units. If this value is negative, then Y is less than the mean; and if it is positive, then Y is greater than the mean. If we now divide this value by s, then we can calculate how far Y is from its mean as measured by the number of standard deviations. A standard normal deviate, or Z, tells us how many standard deviations a particular value is from the mean. The probability of obtaining a measurement that is Z standard deviations from the mean is the same for all normal distributions, including the standard normal distribution. This means that we can use the table of probabilities for the standard normal distribution (Statistical Table B) to find probabilities for any normal distribution. Example 10.4 shows how the Z-standardization can be used. EXAMPLE 10.4 One small step for man? NASA excludes anyone under 62 inches in height and anyone over 75 inches from being an astronaut pilot (NASA 2004). In metric units,5 these values for the lower and upper height restrictions are 157.5 cm and 190.5 cm, respectively. What fraction of the young adult American population is excluded from being an astronaut pilot by these height constraints? The distribution of adult heights within a sex and age group is reasonably well approximated by a normal distribution, with the mean and standard deviation for 20- to 29-year-old males in America given by 177.6 cm and 9.7 cm, respectively (McDowell et al. 2008). For 20- to 29-yearold American females, the mean height is 163.2 cm with a standard deviation of 10.1 cm. We can use the standard normal distribution to calculate the proportion of individuals who are made ineligible for astronaut pilot training because of their height alone. We will start with the calculation for males. Let’s be very specific about what we are trying to do: we want to know the probability that a 20- to 29-year-old male individual has a height that is either less than 157.5 cm or greater than 190.5 cm: Pr[Height<157.5 or Height>190.5]. It becomes much easier to answer the question if we make a sketch of what we are looking for (Figure 10.4-4). FIGURE 10.4-4 A sketch can help you keep track of the areas under the curve that you are trying to find. This rough sketch does not show the exact areas, but it properly orients the mean and the values we care about. The maximum height (190.5 cm) is above the mean (177.6 cm), and the minimum height (157.5 cm) is below the mean. Based on the drawing in Figure 10.4-4, the outcomes “Height > 157.5” and “Height < 190.5” are mutually exclusive. By the addition rule, therefore, we can determine the fraction of American males whose height excludes them from being an astronaut pilot by summing the two parts: Pr[Height<157.5 or Height>190.5]=Pr[Height<157.5]+Pr[Height>190.5]. Let’s start with the second part first: what’s the probability that an American male is too tall for NASA’s restrictions? In other words, what is the probability that an American adult male is taller than 190.5 cm? The first step is to convert Height = 190.5 to a standard normal deviate, using the mean (177.6 cm) and standard deviation (9.7 cm) of American male height: Z=190.5 cm-177.6 cm9.7 cm=1.33. That is, a value of 190.5 cm occurs at 1.33 standard deviations above the mean male height. What fraction lies above this point? In other words, what is Pr[Z < 1.33]? Using Statistical Table B at the end of this book, we can see that 0.09176 of the area under the standard normal distribution lies more than 1.33 standard deviations above the mean. So, this is our answer to this part of the problem: the fraction 0.09176 of American adult males are taller than 190.5 cm. Finding the probability of males being too short for NASA’s restrictions follows a similar procedure, but with one additional step. Again we convert Height = 157.5 (the minimum cutoff) to a standard normal deviate: Z=157.5 cm-177.6 cm9.7 cm=-2.07. In other words, 157.5 cm occurs at 2.07 standard deviations below the mean male height. What fraction of American adult males lies below this point? What is Pr[Z < −2.07]? Statistical Table B gives us the probability of obtaining a value greater than a given number. To find the probability of getting a value less than a particular number, remember that the normal distribution is symmetric around its mean. Thus, the probability of being 2.07 standard deviations or more below the mean is the same as the probability of being 2.07 standard deviations or more above the mean: Pr[Z<-2.07]=Pr[Z>2.07]. When we look up the probability of being greater than 2.07 in Statistical Table B, we find that Pr[Z > 2.07] = 0.01923. Thus, the fraction 0.01923 of American adult males are shorter than 157.5 cm. Now, use the addition rule to determine the proportion of American adult men that are excluded from the astronaut pilot program by the height restrictions: Pr[Height<157.5 or Height>190.5]= 0.01923+0.09176=0.11099. The fraction 0.11099 (or 11.1%) of all 20- to 29-year-old American adult males are excluded by height. In other words, the NASA height restriction excludes a modest percentage of men. We can make similar calculations for American adult women. Women’s heights are approximated by a normal distribution with mean 163.2 cm and standard deviation 10.1 cm. These values correspond to height restrictions at 0.56 standard deviations below the mean and 2.7 standard deviations above the mean. (Check these numbers for yourself.) Using Statistical Table B, the fraction 0.28774 of the women are too short to meet NASA’s guidelines and the fraction 0.00347 are too tall. In total, the height restrictions exclude 29.1% of young American women from being astronaut pilots. The normal distribution of sample means One of the most important facts about the normal distribution is that it can be used to describe the sampling distribution of many estimates, including the sample mean. The sampling distribution of an estimate lists all the values that we might obtain when we sample a population and describes their probabilities of occurrence (Section 4.1). If a variable Y has a normal distribution in a population, then the distribution of sample means Y¯ is also normal. For example, Figure 10.1-2 shows the normal distribution that best matches the distribution of human birth weights in the United States. This distribution is normal with mean µ = 3339 g and standard deviation σ = 573 g. Suppose we took a single random sample of 10 babies from this distribution and obtained Y¯=3084 g. This is not equal to the true mean (3339 g) because of chance, or sampling error. The estimate based on any particular sample is influenced by who happened to get sampled and who did not, which is a random process. Each single sample will have a mean that by chance is different from the true mean; if we took a different sample, it would differ from the truth in a different way. The different values for Y¯ that we might have obtained, and their associated probabilities, make up the sampling distribution for Y¯. The fact that sample means are normally distributed, whenever the population itself is normal, is a huge time-saver. In Chapter 4, we obtained the sampling distribution for Y¯ by going back to the population and taking a vast number of random samples, each of the same size n, calculating Y¯ for each sample, and plotting the results. However, we are spared the trouble if Y has a normal distribution in the population, because in this case the distribution of sample means is also normally distributed, as shown in Figure 10.5-1 (we also plot the distribution of individual data points for comparison). FIGURE 10.5-1 The normal distribution of sample means (Y¯, in red) for samples of size n = 10 from the normal distribution describing the population of baby birth weights in Figure 10.1-2 (shown here in black for comparison). The distribution of sample means has the same mean as the individual values, but with a smaller standard deviation. Figure 10.5-1 shows that the mean of the sampling distribution of Y¯ is also 3339 g, the same as µ, the mean of Y itself. The mean of the sampling distribution of Y¯ always equals the mean of the original distribution (µ). In other words, the sample mean based on a random sample from a normal distribution gives an unbiased estimate of µ. The standard deviation of the sampling distribution for Y¯ is known as the standard error of the mean (Section 4.2). It is symbolized as σY¯ and is equal to the standard deviation of Y divided by the square root of the sample size (n): σY¯=σn This equation is correct even when Y does not have a normal distribution. The standard error describes the typical amount of sampling error when estimating the mean, so it measures the precision of the estimate. Increasing sample size reduces sampling error and hence increases the precision of an estimate, as we saw in Chapter 4. By averaging data from more babies, the sampling distribution based on samples of 100 babies is less “noisy” (has a smaller sampling error) than the sampling distribution generated from only 10 babies at a time.6 The sampling distribution of Y¯ would be even narrower if we used a larger sample size, as shown in Figure 10.5-2. FIGURE 10.5-2 The distributions of sample means based on sample sizes of n = 10 (red), n = 100 (blue), and n = 1000 (black). (Note the change in scale from the previous figures.) In later chapters, we will use the fact that the sampling distribution of Y¯ is normal when Y is normal to calculate exact confidence intervals for population means and exact P-values when testing hypotheses about means. Calculating probabilities of sample means The standard normal distribution can be used to calculate the probability of drawing a sample with a mean in a given range, assuming we know the true values of the mean and the standard deviation of the population from which the sample was taken. Given that the distribution of sample means is normal with mean µ and standard deviation σY¯=σ/n, then Z=Y¯-μσY¯ has a standard normal distribution, just like any normally distributed variable. The quantity σY¯ is the standard deviation of the sampling distribution of Y¯, also known as the standard error of the mean. This Z tells us that the sample mean Y¯ is Z standard errors above the true mean µ. For example, let’s take a random sample of n = 80 babies from the population of babies introduced in Section 10.1. In this population, weights are normally distributed with mean µ = 3339 g and standard deviation σ = 573 g. What is the probability that the sample mean is at least 3370 g? In other words, what is Pr[Y¯>3370]? For this example, σY¯ is σY¯=σn=57380=64.1 g. From this we calculate the Z-score: Z=Y¯-μσY¯=3370-333964.1=0.48. In other words, Pr[Y¯ > 3370] is the same as Pr[Z > 0.48]. Using Statistical Table B, we find that Pr[Z>0.48]=0.316. About 31.6% of samples of size n = 80 from this population will have a sample mean of 3370 g or larger. Central limit theorem We have seen that the means of samples drawn from a normal distribution are themselves normally distributed. Another reason for the importance of the normal distribution in data analysis is that the sampling distribution of sample means Y¯ is approximately normal even when the distribution of individual data points is not normal, provided the sample size is large enough. This astonishing fact is known as the central limit theorem. How large a sample is large enough depends on the shape of the distribution of observations in the population: the more similar the original distribution is to the normal, the smaller the sample size required to yield a distribution of sample means that is well approximated by the normal distribution. According to the central limit theorem, the sum or mean of a large number of measurements randomly sampled from a non-normal population is approximately normally distributed. Example 10.6 illustrates the effect that increasing the sample size has on the sampling distribution for the mean, even when the population is highly non-normal. EXAMPLE 10.6 Young adults and the Spanish flu Between 1918 and 1920, a strain of influenza misnamed the Spanish flu swept across the world, killing millions. This epidemic, which overlapped with World War I, caused more than twice as many deaths as the war itself. One unusual feature of the Spanish flu was that it mainly killed young adults, whereas most influenzas endanger mainly the very young and the very old. This caused an atypical pattern in the frequency distribution of age at death, as shown in Figure 10.61. The data in the figure are from 1918 Switzerland, which was not involved in the war, and so represent deaths from Spanish flu in addition to all the usual causes. A total of 75,034 people are represented in this graph. This distribution is unusual because it has three peaks. The large spike at age 0 corresponds to infant mortality. Another wide peak, at approximately 75 years of age, shows elevated mortality of the elderly. In a typical year around that time, these would have been the only two peaks. During 1918-20, the Spanish flu caused a third peak of mortality for people in their twenties and thirties. FIGURE 10.6-1 The frequency distribution of age at death in Switzerland in 1918 during the Spanish flu epidemic. The frequency distribution in Figure 10.6-1 is extremely non-normal. It has three peaks rather than just the one we would expect with the normal distribution, and the distribution of the data is highly asymmetric. Let’s use this distribution to illustrate the central limit theorem. We will look at the sampling distribution of the sample mean for a range of samples of increasing size. The distribution of sample means is displayed in Figure 10.6-2 for the range of sample sizes n =1, 2, 4, and 64. FIGURE 10.6-2 The frequency distribution of sample means for samples of increasing size. Each histogram displays the means of a large number of repeated samples of size n drawn from the distribution at age at death in Figure 10.6-1. The scale and range of the axes change from graph to graph. The top left graph in Figure 10.6-2 plots the mean of random samples of size n = 1, which simply re-creates the original distribution. That is the distribution of individuals in this population. Let’s now look at the distribution of sample means when we take a sample of two or more individuals from the same population. The top right histogram plots the sample means of many samples of size n = 2. The numbers shown were generated by taking a random sample of two individuals from the population and computing their average. This process was repeated many times, and the distribution of the computed averages is shown. Notice how different this frequency distribution is from that of the data. The standard deviation is lower than in the data, and even with such a tiny sample size, the graph already looks a lot more bell shaped. As n increases (in other words, as we take larger samples), the distribution of sample means starts to resemble a normal distribution. For example, when n =4, the overall shape of the distribution is similar to a normal distribution, but there are differences in the details. The fit gets better and better as sample size increases. At some point, the sample size is large enough that the sampling distribution becomes almost identical to a normal distribution. For the Swiss 1918 age at death data, the distribution of sample averages is nearly indistinguishable from a normal distribution when n =64 (the bottom right histogram in Figure 10.6-2). It is difficult to specify the sample size needed to reach this point for any particular case. For other data whose frequency distribution is even more different from the normal distribution than the Swiss mortality data, larger samples are required before the sample means converge on the normal distribution. The most powerful statistical methods available for analyzing biological data assume that the distribution of sample means (and the distributions of some other estimates) follows a normal distribution. The beauty of the central limit theorem is that if the sample size is large enough, then it is possible to use these powerful methods even when our data are sampled from a population that is not normally distributed. Normal approximation to the binomial distribution One frequent application of the central limit theorem is the normal approximation to the binomial distribution. Recall from Section 7.1 that the binomial distribution is a discrete probability distribution. It describes the number of “successes” in n independent trials, where p is the probability of success in any one trial. According to the central limit theorem, a binomial distribution with large n is approximated by a normal distribution. This is because the number of successes is a kind of sum: if each success is labeled as “1” and each failure is labeled as “0,” then the count of the number of successes is the sum over all individuals of the ones and zeros. The normal approximation to the binomial distribution is helpful when you are using a calculator to obtain probabilities for a binomial distribution with a large n—exactly the case when calculating exact binomial probabilities is most time-consuming. This situation comes up, for example, when carrying out a binomial test on a large sample. With a binomial test, the probability of success p0 stated in the null hypothesis is often 0.5, in which case n needn’t be too large (say 30 or more) in order for the normal approximation to be appropriate.7 As a rule of thumb, the products np and n(1 − p) should both be greater than five in order to use the normal approximation if all you want to know is whether the P-value is less than 0.05. The approximation requires higher np if the goal is to provide exact P-values or to obtain critical values corresponding to significance levels smaller than 0.05 (such as when using α = 0.01 instead). We use the normal distribution that has the same mean and standard deviation as the binomial distribution—in other words, a mean of np and a standard deviation of np(1-p). When the number of trials n is large, the binomial probability distribution for the number of successes is approximated by a normal distribution having mean np and standard deviation np(1-p). Example 10.7 applies the normal approximation to a problem requiring the binomial test. EXAMPLE 10.7 The only good bug is a dead bug The brown recluse spider (Loxosceles reclusa) often lives in human houses8 throughout central North America. This spider is a moderate health threat, as its bite causes nasty, slow-healing wounds. Bites are rarely fatal, but the resulting wounds are so disgusting that you will be very glad we chose to show the spider here rather than the injuries it causes. Information on the spider’s diet is useful for developing effective pest management strategies. A diet-preference study (Sandidge 2003) gave each of 41 brown recluse spiders a choice between two crickets, one live and one dead. Thirty-one of the 41 spiders chose the dead cricket over the live one. Does this represent evidence for a diet preference? A 95% confidence interval using the Agresti-Coull method (see Chapter 7) reveals that brown recluse spiders choose dead prey between about 60 and 86% of the time: 0.60<p<0.86. Let’s carry out a hypothesis test to determine the weight of evidence against the null hypothesis of no preference. The appropriate statistical method for analyzing these data is the binomial test with the following null and alternative hypotheses: H0: Brown recluse spiders show no preference for live or dead prey (p = 0.5). HA: Brown recluse spiders prefer one type of prey over the other (p ≠ 0.5). This is a two-sided test. The outcome of the test will be decided according to the P-value (Section 6.2), the probability of obtaining a result as extreme as or more extreme than the observed result (i.e., 31 out of 41 spiders ate the dead prey item), assuming that the null hypothesis were true. This P-value would ordinarily be calculated from the binomial distribution with n = 41 and p = 0.5: P=2Pr[X≥ observed]=2[(Pr[X=31]+Pr[X=32]+Pr[X=33]+...+Pr[X=41])], where Pr[X=i]=(41i)(0.5)i(0.5)41-i. There are 11 different probabilities to calculate in this example, which would be timeconsuming to do by hand or by calculator. It is much easier to approximate the answer by using the normal distribution. The normal approximation is appropriate here because both np and n(1 − p) are not too small (20.5 for both in this case). Under the null hypothesis, the mean of the best-fitting normal distribution is μ=np=(41)(0.5)=20.5, and the standard deviation is σ=np(1-p)=41(0.5)(0.5)=3.20. Now we can use the normal approximation to calculate the probability of getting the observed value 31 or more by chance under this distribution, and then multiply by two to yield the Pvalue. We will use the normal distribution having the same mean and standard deviation as our binomial distribution. In principle, we could do this by calculating Pr[X≥Observed]≈Pr[Z>(Observed-np)/np(1-p)], where “Observed” is the particular value of X that we observed in the data and the ≈ sign means “approximately equal to.” However, we can improve the accuracy of this approximation with a correction for continuity, as follows. Remember that the binomial distribution is a discrete distribution and the value Observed is an integer. To make the conversion to a continuous variable more seamless, we represent the probability of the discrete value Observed as the area under the continuous probability curve between Observed – ½ and Observed + ½. Therefore, if we want to use the normal distribution to approximate the probability of obtaining a value of X greater than or equal to an Observed value, we use the corrected formula Pr[X≥Observed]=Pr[Z>Observed-12-npnp(1-p)]. When we want to approximate the probability of obtaining a value of X less than or equal to an Observed value, we use Pr[X≤Observed]=Pr[Z<Observed+12-npnp(1-p)], adding a half, rather than subtracting it. In both cases we are including ½ above and ½ below the specific value of Observed to account for the difference between the discrete and continuous distributions. This is why it is called a “continuity correction.” To calculate a two-sided P-value for our spider example, we need to calculate P=2Pr[X≥31]. We can convert this to a Z score by using the normal approximation with the continuity correction, which effectively calculates the probability of observing a value from the normal distribution greater than 30.5: P=2Pr[Z>Observed-12-npnp(1-p)]=2Pr[Z>31-12-41(0.5)41(0.50)(0.5)]=2Pr[Z>3.12]. Looking to Statistical Table B, we find the probability that Z > 3.12 is approximately 0.0009. Multiplying this value by two to get the two-tailed P-value, we obtain P=0.0018. This P-value is less than α = 0.05, so we reject the null hypothesis. The spiders indeed show a diet preference for dead prey. This computation is much simpler than the exact calculation based on the binomial distribution, but how good is the approximation? The exact P-value, calculated using the binomial distribution, is 0.0015 rather than 0.0018. The normal approximation is not exact. It performs best when n is large and when the population proportion p is close to 0.5. The normal approximation can make some otherwise time-consuming calculations manageable. Summary ■ The normal distribution is a bell-shaped curve, a continuous probability distribution approximating the distribution of many variables in nature. ■ If a variable has a normal distribution, its mean, median, and mode are all the same. The normal distribution is symmetric around its mean. ■ To describe a normal distribution, we need to know the mean µ and the standard deviation σ. ■ For a normal distribution, about two-thirds (68.3%) of individuals are within one standard deviation of the mean. About 95% are within two standard deviations of the mean. ■ A standard normal distribution is a normal distribution with mean of zero and standard deviation equal to one. ■ All normal distributions can be converted to the standard normal distribution by computing, for each measurement, the number of standard deviations from the mean: Z=Y-μσ, where Z is called the “standard normal deviate.” ■ Means of random samples drawn from a normal distribution with mean µ and standard deviation σ are also normally distributed, with the same mean µ and with standard deviation σY=σ/n where σY is the standard error of the sample mean. ■ According to the central limit theorem, the sum (or average) of a large number of observations randomly sampled from a non-normal distribution is approximately normally distributed. ■ A binomial distribution with large n can be approximated by a normal distribution. As a rule of thumb, np and n (1 − p) should both be greater than five to use the normal approximation. Quick Formula Summary Z-standardization What is it for? Converts values from any normal distribution with known mean µ and known variance σ into standard normal deviates. What does it assume? The original distribution is normal with known parameters. Formula: Z=Y-μσ. Normal approximation to the binomial distribution What is it for? Approximates the binomial distribution using the normal distribution when n is large. What does it assume? That np and n(1 − p) are both five or greater. Formula: Pr[X≥Observed]=Pr[Z>Observed-12-npnp(1-p)] and Pr[X≤Observed]=Pr[Z<Observed+12-npnp(1-p)], where n is the number of trials, p is the probability of success for each trial, and Observed is the number of successes in the data. PRACTICE PROBLEMS 1. Calculation practice: Finding the probability of a range of values using the normal distribution. The natural log of growth (change in radius per year in mm) of Engelmann spruce is approximately normally distributed with mean of 0.037 log units and standard deviation 0.385. Following these steps, determine the probability that a tree has a bad year, defined as having growth less than −0.050 log units in a year. a. Make a sketch of the normal distribution with mean 0.037, and mark the values that we are trying to determine (i.e., those values less than −0.05). b. Calculate the standard normal deviate (Z) associated with the value we are interested in here, −0.05. c. We are interested in the probability of getting a value less than this Z value. Is this probability directly shown in Statistical Table B? If not, what quantity can we use to find what we need? d. What is the probability that a random draw from a standard normal distribution will be greater than 0.226? e. What is the probability that a random draw from a standard normal distribution will be less than −0.226? f. What is the probability that a tree has a bad growth year, that is, less than −0.050 in log units? 2. Calculation practice: Normal approximation to a binomial test. From 1995 to 2008 in the United States, 531 of the 648 people who were struck by lightning were men. Test whether this proportion is different from the 50% that might be expected by the proportion of men in the population as a whole (Avon 2009). Use the binomial test with a normal approximation. a. State the null hypothesis for this binomial test. b. Calculate the mean of the null distribution for the number of men struck by lightning under this null hypothesis. c. What is the standard deviation of the distribution for the number of men hit by lightning under the null hypothesis? d. Is your target value (531) above or below the value given by the null hypothesis? If above, subtract one-half from that target for the continuity correction. If below, add a half for the continuity correction. Scars from a lightning strike e. What is the standard normal deviate (Z) for the continuity-corrected observed number of men, using the mean and standard deviation calculated in parts (b) and (c)? f. What is the probability under the normal distribution of getting a result of 531 or greater (including the continuity correction)? g. What is the two-tailed P-value for this binomial test? h. State the conclusion from your test. 3. Assume that Z is a number randomly chosen from a standard normal distribution. Use the standard normal table to calculate each of the following probabilities: a. Pr [Z > 1.34] b. Pr [Z < 1.34] c. Pr [Z > 2.15] d. Pr [Z < 1.2] e. Pr [0.52 < Z < 2.34] f. Pr [−2.34 < Z < −0.52] g. Pr[Z < −0.93] h. Pr [−1.57 < Z < − 0.32] 4. It was rumored that Britain’s domestic intelligence service MI5 has an upper limit on the height of its spies, on the assumption that tall people stand out (although MI5 denies it). The rumor said that, to apply to be a spy, you can be no taller than 5 feet 11 inches (180.3 cm) if you are a man, and no taller than 5 feet 8 inches (172.7 cm) if you are a woman (supposedly to allow the spies to blend in with a crowd). a. If the mean height of British men is 177.0 cm, with standard deviation 7.1 cm, what proportion of British men would be precluded from being spies by this hypothetical height restriction? Assume that height follows a normal distribution. b. The mean height of women in Britain is 163.3 cm, with standard deviation 6.4 cm. Assuming a normal distribution of female height, what fraction of women meet the height standard for application to MI5? c. Sean Connery, the original James Bond in the movies, is about 183.4 cm tall. By how many standard deviations does he exceed the height limit for spies? 5. Use the three distributions labeled i, ii, and iii to answer the following questions. a. Which of these distributions is most like the normal distribution? On what basis would you exclude the other two? b. Which distribution would generate an approximately normal distribution of sample means, calculated from large random samples? Why? 6. The babies born in singleton births in the United States have birth weights that are approximately normally distributed with mean 3.296 kg and standard deviation 0.560 kg (Martin et al. 2011). a. What is the probability of a baby being born weighing more than 5 kg? b. What is the probability of a baby being born weighing between 3 kg and 4 kg? c. What fraction of babies is more than 1.5 standard deviations from the mean in either direction? d. What fraction of babies is more than 1.5 kg from the mean in either direction? e. If you took a random sample of 10 babies, what is the probability that their mean weight Y¯ would be greater than 3.5 kg? 7. In the accompanying pairs of graphs of normal distributions, which distribution has the highest mean? Which has the highest standard deviation? Pay careful attention to the changes in scale of the x-axes. 8. In the accompanying graph, the red region accounts for 0.67 of the probability density. Estimate the standard deviation of this normal distribution. 9. Suppose that a variable has a normal distribution, that the mean is 35 mm, and that 20% of the population is larger than 50 mm. a. What is the mode of this distribution? b. What is the median of this distribution? c. Complete the following sentence: Twenty percent of the distribution is smaller than _________. 10. Use the two distributions labeled i and ii below to answer the following questions. a. If we drew repeated random samples of individuals from each distribution and calculated the mean for each sample, which distribution would yield a distribution of sample means that most closely followed a normal distribution? b. Imagine that we drew the distribution of the sum of 100 random draws from distribution ii. What approximate shape would this distribution have? 11. A survey of European mitochondrial DNA variation has found that the most common haplotype (genotype), known as “H”, occurs in 40% of people (Roostalu et al. 2007). If we were to sample 400 individuals from the European population, what is the probability that a. at least 180 are haplotype H? b. at least 130 are haplotype H? c. between 115 and 170 (inclusive) are haplotype H? 12. Ninety-one out of 220 people working as cast and crew of the movie The Conqueror, which was filmed in 1955 downwind from nuclear bomb tests, ultimately contracted cancer (see Practice Problem 4 in Chapter 7). Based on age alone, though, only a 14% cancer rate was expected. Test the null hypothesis that the incidence of cancer in this group of people was no different than that expected, using the normal approximation to the binomial distribution. 13. The following table lists the means and standard deviations of several different normal distributions. For each, a sample of 10 individuals was taken, as well as a sample of 30 individuals. For each sample, calculate the probability that the mean of the sample was greater than the given value of Y. Standard Meandeviation Y 14 15 −23 72 5 3 4 50 15 15.5 −22 45 n=10:Pr[Y¯>Y] n=10:Pr[Y¯>Y] ASSIGNMENT PROBLEMS 14. Assume that Z is a number randomly chosen from a standard normal distribution. Use the standard normal table to calculate each of the following probabilities: a. Pr[Z > 0.24] b. Pr[Z < 0.24] c. Pr[Z > 2.01] d. Pr[Z < 1.02] e. Pr[0.60 < Z < 1.4] f. Pr[−2 < Z < 2] g. Pr[Z < 0.45] h. Pr[−0.2 < Z < 0.37] 15. The highest recorded temperature during the month of July for a given year in Death Valley, in California, has an approximately normal distribution with a mean of 123.8°F (!) and standard deviation of 3.1°F (Weather Source 2009). a. What is the probability for a given year that the temperature never exceeds 120°F in a given July in Death Valley? b. What is the probability that the temperature in Death Valley goes above 128°F during July in a randomly chosen year? 16. Draw a distribution that is approximately normal, with mean equal to 10 cm and variance equal to 360. On the same graph, draw the distribution of the means of samples taken from this distribution, if each sample was a random sample of 10 individuals. 17. The following data are 40 measurements of diameter growth rate of the tropical tree Dipteryx panamensis from a long-term study at La Selva, Costa Rica (Clark and Clark 2012). The data are log-transformed, with the original units in millimeters. 0.0, 0.1, 0.1, 0.3, 0.4, 0.4, 0.4, 0.5, 0.5, 0.5, 0.6, 0.6, 0.7, 0.7, 0.7, 0.7, 0.8, 0.8, 0.8, 1.2, 1.2, 1.3, 1.4, 1.6, 1.9, 2.0, 2.1, 2.1, 2.2, 2.2, 2.3, 2.4, 2.5, 2.5, 2.7, 2.7, 2.7, 2.7, 2.8, 3.1 a. Make an appropriate graph of the data. b. Examine the graph. Do the data appear to be sampled from a population having a normal distribution? Why or why not? Identify all the features on which you based your conclusion. 18. In Europe, 53% of flowers of the rewardless orchid, Dactylorhiza sambucina, are yellow, whereas the remaining flowers are purple (Gigord et al. 2001). For this problem, you may use the normal approximation only if it is appropriate to do so. a. If we took a random sample of a single individual from this population, what is the probability that it would be purple? b. If we took a random sample of five individuals, what is the probability that at least three are yellow? c. If we took many samples of n = 5 individuals, what is the expected standard deviation of the sampling distribution for the proportion of yellow flowers? d. If we took a random sample of 263 individuals, what is the probability that no more than 150 are yellow? 19. The amount of money spent on health care per person varies enormously among countries (The World Bank 2013). In 2010, this expense ranged from 11.9 U.S. dollars per person (in Eritrea) to $8361 (in the United States). The distribution of this per capita health care expenditure is very skewed, with a long tail corresponding to countries that spend a lot on health care per capita (see the top histogram in the accompanying graph). However, if we look at the log (base 10) of each country’s per capita health expenditure, it has a distribution that can be approximated by a normal distribution (see bottom histogram). On the log scale, the mean of the log expenditure is 2.47, with standard deviation equal to 0.72. a. Assuming that log health expenditure is normal, calculate the proportion of countries that spend less than $100 per capita on health care. (This corresponds to a log expenditure of 2.) b. Assuming that log health expenditure is normally distributed, calculate the proportion of countries that spend more than $1000 per capita on health care. (This corresponds to a log expenditure equal to 3.) c. The true proportions of countries with per capita health expenditure less than $100 or more than $1000 are 0.30 and 0.21, respectively. Comment on why your answers from parts (a) and (b) above do not exactly match these values. 20. Recall from Example 10.4 that more women (29.1%) than men (11.1%) are excluded from the astronaut pilot program by the minimum and maximum height restrictions. What value of minimum height for women would exclude the same total proportion of women as men, given that 0.3% of women are too tall? 21. Draw a probability distribution that isn’t normal. Describe the features of your distribution that identify it as non-normal. 22. In the accompanying graph of a normal distribution, each of the two red areas represents one-sixth of the area under the curve. Estimate the following quantities from this graph: a. The mean b. The mode c. The median d. The standard deviation e. The variance 23. The proportion of traffic fatalities resulting from drivers with high alcohol blood levels in 1982 was approximately normally distributed among U.S. states, with mean 0.569 and standard deviation 0.068 (U.S. Department of Transportation Traffic Safety Facts 1999). a. What proportion of states would you expect to have more than 65% of their traffic fatalities from drunk driving? b. What proportion of deaths due to drunk driving would you expect to be at the 25th percentile of this distribution? 24. The table at the bottom of the page lists the means and standard deviations of several different normal distributions. For each distribution, calculate the probability of drawing a single Y value greater than the given threshold and the probability of drawing a value less than that threshold. 25. In European earwigs, the males sometimes have long pincers protruding from the end of their abdomens, as shown in the accompanying photo. In graphs (a)−(c) we have plotted three frequency distributions of sample means of pincer lengths (in millimeters) based on random samples from an earwig population. One of the distributions shows means of samples based on n = 1, one shows means of samples based on n = 2, and one shows means of samples based on n = 8. Identify which frequency distribution corresponds to each sample size. Explain the basis for your decisions. TABLE FOR PROBLEM 24 Mean Standard deviationThreshold Pr[Y > threshold] Pr[Y < threshold] 14 5 15 −23 14,000 3 4 5000 9 18.5 −16 9000 26. The crab spider, Thomisus spectabilis, sits on flowers and preys upon visiting honeybees, as shown in the photo at the beginning of the chapter. (Remember this the next time you sniff a wild flower.) Do honeybees distinguish between flowers that have crab spiders and flowers that do not? To test this, Heiling et al. (2003) gave 33 bees a choice between two flowers: one had a crab spider and the other did not. In 24 of the 33 trials, the bees picked the flower that had the spider. In the remaining nine trials, the bees chose the spiderless flower. With these data, carry out the appropriate hypothesis test, using an appropriate approximation to calculate P. 27. The following table lists the mean and standard deviation of several different normal distributions. In each case, a sample of 20 individuals was taken, as well as a sample of 50 individuals. For each sample, calculate the probability that the mean of the sample was less than the given value. TABLE FOR PROBLEM 27 Standard Meandeviation Value n=20: Pr[Y¯<value] −5 10 −55 5 30 20 −5.2 8.0 −61.0 12 3 12.5 n=50: Pr[Y¯<value] 6 INTERLEAF Controls in medical studies In 1994, researchers reported on the results of a study describing the effects of a new “wonder” drug (Lanza et al. 1994). The drug, lansoprazole, was intended to treat ulcers, and the study showed that 88% of the people who were treated with this drug got better within four weeks. Surely this was a remarkable achievement! Another group of people were followed in the study, however. At the beginning of the study, participants who had ulcer problems were randomly sorted into two groups. One group received the new medication as planned, but the other group received a chemically inert “sugar pill” instead. The group who received no pharmacologically active medication also improved over the course of the study—in fact, over 40% of the ulcer sufferers in this control group improved over the same four-week period. I’m addicted to placebos. I could quit but it wouldn’t matter. —Steven Wright In this particular study, the patients treated with the new drug did indeed get better faster than those who were not treated. But when compared with the control group, the drug was not nearly as effective as it appeared to be when looked at by itself. Of the 88% who improved after taking the drug, over 40% would have improved even without the medication. How could this be? There are several reasons. First, for most medical conditions, patients tend to get better over time anyway.1 Most diseases are neither lethal nor permanent. We tend to go to the doctor when we are feeling at our worst, and therefore the odds are that we would soon start to improve after our worst days anyway, even without a new wonder drug. Second, humans (or at least many of them) like to please others, so there is a tendency to tell doctors that the treatment is more effective than it actually is. In both groups of the study, participants may have described an improvement in their condition, even if this was an exaggeration. Finally, being treated by a physician has benefits that go beyond the specifics of a particular drug. A doctor may suggest a new diet or advocate increased rest, for example, and participants in drug trials see doctors more often than they otherwise would and may therefore improve. One particularly interesting form of benefit from treatment is psychological. In some cases, simply the knowledge of being treated may be sufficient to improve a person’s condition. This is the so-called placebo effect, an improvement in a medical condition that results from the psychological effects of medical treatment.2 A sugar pill is a placebo, designed to mimic all the conditions of the medical treatment except for the pharmacological effects of the new drug itself. Placebo effects are well documented for pain relief, but they are more questionable for other kinds of diseases (Hróbjartsson and Gøtzsche 2001). Placebo effects are typically larger for illnesses in which the response variable is subjectively reported by the patient. Placebo effects are smaller or nonexistent in studies that report on more objectively measured variables. The fact that the improvements are subjective does not mean that they are not real; pain, for example, is subjectively experienced but is nonetheless a real condition. In fact, MRI studies have shown that the placebo effect for pain has a neurological basis (Wager et al. 2004). Even conditions requiring surgical interventions can show improvement without treatment. Studies of human surgery that include a “sham surgery” treatment—making surgical incisions without the specific treatment— are rare, for ethical reasons. In many cases, though, sham surgeries have been associated with real improvements in medical conditions. For example, ligation of the mammary artery to treat anginal pain was common in the middle part of the 20th century, with improvement rates of about two-thirds after this surgery. Subsequently, however, two studies showed that improvement rates in patients with this surgery were nearly identical to those for patients who received only a sham skin incision (Shapiro and Shapiro 1997). What we have seen is that people can improve for a wide variety of reasons, even without specific treatment. It is therefore crucial that medical studies include control treatments, in which a randomized selection of participants are treated in every way identically to those receiving a new treatment, except for the treatment itself. These controls can take the form of sugar pills, but, by preference, control patients can receive the currently most effective treatment. In this way, all patients receive care, and if the new drug or treatment is better than the old one, we have evidence in support of switching the most advisable treatment. When we report the results of a study, we want to be able to say that there is an effect, “all else being equal.” The goal of careful experimental design is to make all else equal. 11 Inference for a normal population Koala We learned in Chapter 10 that many variables in nature are approximately normally distributed. Most of the methods in this book are geared toward making inferences and testing hypotheses about variables that have a normal distribution in the population. In this chapter, we begin the analysis of normally distributed measurements, a theme that will continue for much of the rest of the book. We start by discussing estimation of the mean, including calculations for exact confidence intervals. Next we describe the simplest hypothesis test on a normal variable, the “one-sample t-test,” which lets us ask whether the measurements of a sample of data are consistent with a hypothesized value for the population mean. We finish the chapter by discussing how to make statistical statements about the standard deviation. The t-distribution for sample means The sampling distribution of a statistic is the probability distribution of all the values for a statistic that we might obtain when we sample the population. In this section, we describe the sampling distribution of a statistic called t. This number will allow us to use data to calculate confidence limits and carry out hypothesis tests about the means of populations. Student’s t-distribution Recall from Section 10.5 that the sampling distribution for the sample mean Y¯ is a normal distribution if the variable Y is itself normally distributed. As a result, we could use the Zstandardization to calculate probabilities for Y¯ under the normal curve: Z=Y¯-μσY,¯, where σY¯ is the standard error of the sample mean, the standard deviation of the sampling distribution of Y¯. In general, however, we can’t just apply the Z-standardization to Y¯ -values calculated from real data. This is because to calculate Z we need to know σY¯, yet this is almost never possible. However, we can use the estimate of the standard error: SEY¯=sn, which was first discussed in Section 4.2. The numerator is the sample standard deviation (i), which is our best estimate of the true standard deviation (σ). We will usually call SEY¯ simply “the standard error of the mean,” even though it is really only an estimate of the true standard error. Substituting SEY¯ for σY¯ in the formula for Z leads to a related quantity called Student’s t: t=Y¯-μSEY¯. This is called “Student’s t” after its inventor,1 but we will usually refer to it as simply t. While the formula for t resembles that for Z, the important difference is that the sampling distribution for this statistic is not the normal distribution. Instead, t has a t-distribution. SEY¯ is not a constant like σY¯ but is a variable, varying by chance from sample to sample (because s itself changes from sample to sample). Therefore, the distribution of t is not the same as Z. Substituting SEY¯ for σY¯ adds sampling error to the quantity t. As a result, the sampling distribution of t is wider than the standard normal distribution. As the sample size increases, t becomes more like Z. The difference between the sample mean and the true mean (Y¯−μ) divided by the estimated standard error (SEY¯ ), has a Student’s t-distribution with n − 1 degrees of freedom. The sample size determines the number of degrees of freedom (df) of the t-distribution. The degrees of freedom specify which particular version of the t-distribution we need. With a t- distribution, we must estimate a parameter from the data—the standard deviation—before calculating t. As a result, the number of degrees of freedom for t, when applied to inference about one sample, is one less than the number of independent data points: df=n−1. Consider the t-distribution for a sample size of five. With five individuals in a random sample, there are four degrees of freedom. Figure 11.1-1 plots the t-distribution with four degrees of freedom in blue; for comparison, the standard normal distribution is shown in red. In most respects, the t-distribution is similar to the standard normal distribution. It is symmetric around a mean of zero, is roughly bell shaped, and it has tails that fall off toward plus infinity and minus infinity. FIGURE 11.1-1 Student’s t-distribution with four degrees of freedom (in blue), compared with the standard normal distribution (in red). The two distributions are similar, though not identical. The tails of the t-distribution have more probability than the normal distribution. The t-distribution, however, is fatter in the tails than the standard normal distribution. The difference in the tails is crucial, because it is the tails that matter most when calculating confidence intervals and testing hypotheses. For example, 95% of the area under the curve of the t-distribution with 4 df is between −2.78 and 2.78; the remaining 5% lies under the tails outside this range (we explain in the next subsection how we obtained this number). In other words, in 95% of repeated random samples of size n = 5 measurements from a normal population, the resulting Y¯ will fall within 2.78 estimated standard errors of the true population mean (µ). With the standard normal (Z) distribution, on the other hand, 95% of the area under the curve lies between −1.96 and 1.96; the remaining 5% lies beyond these extreme values. The larger range of values of t, compared to Z, results from the uncertainty about the true value of the standard error. Finding critical values of the t-distribution Where did this value of 2.78 come from? The value 2.78 is the “5% critical value” of the tdistribution having df = 5 − 1 = 4 degrees of freedom. The 5% refers to the percentage of the area in the tails of the t-distribution (Figure 11.1-2). Every t-distribution has its own critical 5% t-value, depending on the number of degrees of freedom. Once we know the number of degrees of freedom, we can find the critical value by using a computer program or by using Statistical Table C in the back of this book. FIGURE 11.1-2 The critical value of the t-distribution that confines a total of 5% of the area under the curve to its two tails, 2.5% to each side. With four degrees of freedom, this critical value is t = 2.78. We use the symbol t0.05(2), df to indicate the 5% critical t-value of a t-distribution having “df” degrees of freedom. In this notation, the 0.05 stands for the fraction of the area under the curve lying in the tails of the distribution. The “(2)” indicates that the 5% area is divided between the two tails of the t-distribution—that is, 2.5% of the area under the curve lies above t0.05(2), df and 2.5% lies below −t0.05(2), df. We can use an excerpt from Statistical Table C, depicted in Table 11.1-1, to show how to find the 5% critical value for a t-distribution with four degrees of freedom. First find the row in the table corresponding to four degrees of freedom. Then find the column that corresponds to α(2) = 0.05. The corresponding cell contains the number 2.78, which is the critical value we are looking for. TABLE 11.1-1 Critical values of the t-distribution. Excerpted from Statistical Table C. α(2) = 0.20 α(2) = 0.10 α(2) = 0.05 α(2) = 0.02 α(2) = 0.01 df α(1) = 0.10 α(1) = 0.05 α(1) = 0.025 α(1) = 0.01 α(1) = 0.005 1 2 3 3.08 1.89 1.64 6.31 2.92 2.35 12.71 4.30 3.18 31.82 6.96 4.54 63.66 9.92 5.84 4 1.53 2.13 2.78 3.75 4.60 5 ... 1.48 ... 2.02 ... 2.57 ... 3.36 ... 4.03 ... The other columns in Table 11.1-1 indicate the critical values corresponding to tail probabilities of α(2) = 0.20, 0.10, 0.02, and 0.01 under the curve for the t-distribution. Notice, though, that the column headings in Table 11.1-1 contain α(1) designations, too. These values, such as α(1) = 0.025 in the middle column of the table, indicate areas under the curve at only one tail of the t-distribution (Figure 11.1-3), and are needed when carrying out one-sided hypothesis tests. FIGURE 11.1-3 The critical value corresponding to 2.5% of the area under the curve at only one tail of the t-distribution. In the remainder of this chapter, we use the t-distribution to compute exact confidence intervals for the population mean and to test hypotheses about the population mean. The confidence interval for the mean of a normal distribution As we discussed in Section 4.3, a confidence interval is a very useful way to express the precision of an estimate of a parameter. In that section, we gave an approximate confidence interval for the mean using the two standard errors (2SE) rule of thumb, but we can now do even better. We can use the t-distribution to calculate a more accurate confidence interval for the mean of a population having a normal distribution. Example 11.2 illustrates the appropriate method. EXAMPLE 11.2 Eye to eye The stalk-eyed fly, Cyrtodiopsis dalmanni, is a bizarre-looking insect from the jungles of Malaysia. Its eyes are at the ends of long stalks that emerge from its head, making the fly look like something from the cantina scene in Star Wars. These eye stalks are present in both sexes, but they are particularly impressive in males. The span of the eye stalk in males enhances their attractiveness to females as well as their success in battles against other males. The span, in millimeters, from one eye to the other was measured in a random sample of nine male stalkeyed flies.2 The data are as follows: 8.69 8.15 9.25 9.45 8.96 8.65 8.43 8.79 8.63. We can use these measurements to estimate the mean eye span in the fly population, and to quantify the uncertainty of our estimate using a 95% confidence interval. Assume that eye span has a normal distribution in the population. The 95% confidence interval for the mean The mean and standard deviation of the eye-span sample are Y¯=8.778 and s=0.398. (Review Section 3.1 if you don’t remember how to do these calculations.) How precise is this estimate of the population mean? Let’s describe the precision by calculating a 95% confidence interval for the mean. In Section 4.2, we used the 2SE rule of thumb to calculate this interval, but here in Section 11.2 we will use the t-distribution to give a more accurate formula. To do so, though, we must use the fact, learned in Section 11.1, that the standardized difference (Y¯−μ)/SEY¯ has a t-distribution with df = n − 1, assuming that Y is normally distributed. This means that in 95% of random samples from a normal distribution, the standardized difference will lie between −t0.05(2), df and t0.05(2), df : -t0.05(2),df<Y¯-μSEY¯<t0.05(2),df. Rearranging this equation shows that, in 95% of random samples from a normal distribution, Y¯±t0.05(2), dfSEY¯ will bracket the population mean: Y¯-t0.05(2),dfSEY¯<μ<Y¯+t0.05(2),dfSEY¯. This is the exact 95% confidence interval for the mean. It is similar to the 2SE rule of thumb described in Section 4.2, except that we use the critical value from the t-distribution in the formula rather than “2.” In 95% of random samples from a normal distribution, the interval from Y¯−t0.05(2), dfSEY¯ to Y¯+t0.05(2), dfSEY¯ will bracket the population mean. This interval is the 95% confidence interval of the mean. Let’s calculate the 95% confidence interval for mean eye span in male stalk-eyed flies. To begin, we’ll need the standard error of the mean: SEY¯=sn=0.3989=0.133. We also need the degrees of freedom to get the correct t-statistic. The sample size is n = 9, so the corresponding t has eight degrees of freedom. From Statistical Table C, t0.05(2), 8 = 2.31. Notice that this t = 2.31 is greater than the 1.96 we would have gotten from the normal distribution, reflecting the fatter tails of the t-distribution. Putting all of the numbers from the eye-stalk data into the confidence interval equation, we get 8.778 − (2.31×0.133)<μ<8.778+(2.31×0.133), which yields 8.47mm < μ < 9.08 mm. Thus, the 95% confidence interval for the mean of the eye span in this species, calculated from this particular random sample, is from 8.47 mm to 9.08 mm. We do not know for certain whether the population mean eye span lies between 8.47 and 9.08. All we can say is that the 95% confidence interval for the mean will capture the population mean in 95% of random samples. The 99% confidence interval for the mean There’s nothing special about 95%; it has just come to be adopted as the conventional level for a confidence interval. We can calculate a confidence interval for any significance level. The more general formula for a 100(1 − α)% confidence interval for the mean is Y¯-tα(2),dfSEY¯<μ<Y¯+tα(2),dfSEY¯. After 95%, the 99% confidence interval is the next most popular. Its principal advantage is that it provides better coverage than the 95% interval—it covers the population mean in 99% of random samples. Let’s calculate a 99% confidence interval of the mean for the stalk-eyed fly data. All of the numbers are the same as for the 95% interval, except that now we need t0.01(2), 8, because α is now 1 − 0.99 = 0.01. To use Statistical Table C, we go to the row for df = 8 and then over to the column for α(2) = 0.01. Confirm for yourself that this yields t0.01(2), 8 = 3.36. Thus, the formula for the 99% confidence interval is Y¯-t0.01(2),dfSEY¯<μ<Y¯+t0.01(2),dfSEY¯. When we apply this formula to the stalk-eyed fly data, we get 8.778 − (3.36×0.133) < μ < 8.778 +(3.36×0.133), which yields 8.33 mm < μ < 9.22 mm. The 99% confidence interval is broader than the 95% interval, because we have to include more possibilities to achieve the higher probability of covering the true mean. The one-sample t-test Many methods have been developed to test hypotheses about the means of populations. Because these tests are often based on the t-distribution, many are called t-tests. In this section we learn about the one-sample t-test, which is designed to compare the mean from a sample of individuals with a value for the population mean proposed in the null hypothesis. The null hypothesis for a one-sample t-test is that the true mean is equal to a specific value, µ0. The null and alternative hypothesis statements are as follows. H0: The true mean equals µ0. HA: The true mean does not equal µ0. Suppose, for example, we have a null hypothesis that the mean of a variable in a population is µ0 = 0. We can then use the one-sample t-test to determine whether the Y¯ that we calculate from a sample is sufficiently different from zero to warrant rejection of the null hypothesis. The one-sample t-test compares the mean of a random sample from a normal population with the population mean proposed in a null hypothesis. The test statistic for the one-sample t-test is t (no surprise), and it is calculated by t=Y¯-μ0SEY¯, where µ0 is the population mean proposed by the null hypothesis, Y¯ is the sample mean, and SEY¯ is the sample standard error of the mean. The sampling distribution of this test statistic under H0 is the t-distribution having n − 1 degrees of freedom. The P-value for the test can be computed by comparing the observed t with the Student’s t-distribution. Example 11.3 shows you how. EXAMPLE 11.3 Human body temperature Normal human body temperature, as kids are taught in North America, is 98.6°F. But how well is this supported by data? Researchers obtained body-temperature measurements on randomly chosen healthy people (Shoemaker 1996). The data for 25 of those people are as follows: 98.4 99.0 98.0 99.1 97.5 98.6 98.2 99.2 98.4 98.8 97.8 98.8 99.5 97.6 98.6 98.8 98.8 99.4 97.4 100.0 97.9 99.0 98.4 97.5 98.4 The body temperatures are not all identical to 98.6°F, but are the measurements consistent with a population mean of 98.6°F? Body temperature is approximately normally distributed, as shown in Figure 11.3-1, so a one-sample t-test is appropriate. Our null and alternative hypotheses are as follows. H0: The mean human body temperature is 98.6°F. HA: The mean human body temperature is different from 98.6°F. FIGURE 11.3-1 The frequency distribution of body temperatures in a sample of 25 individuals. The null hypothesis is not arbitrary, because we are testing the common wisdom that the mean is µ0 = 98.6°F. The test is two-sided. A sample mean much higher than 98.6°F or a sample mean much lower than 98.6°F would both lead to rejection of the null hypothesis. The sample mean body temperature is Y¯= 98.524, and the standard deviation is s=0.678. The standard error is SEY¯=0.678/25=0.136. We can now use the test statistic t to measure how different the observed mean Y¯ is from 98.6. If the sample mean perfectly matches the hypothesized value, then t would equal zero. Some difference is expected just by chance, but if the null hypothesis is true, then t should have a t-distribution with n − 1 df. If the difference between Y¯ and 98.6 is too large, then the observed t will lie out in one of the tails of the t-distribution. Hence, by comparing this observed t to the t-distribution, we assess whether 98.6 is a reasonable fit to the data. If the null hypothesis is not a good fit, then we reject H0 and conclude that the population mean µ does not equal 98.6. The t-statistic is t=Y¯-μ0SEY¯=98.524-98.60.136=-0.56. Under the null hypothesis, the sampling distribution of t is the t-distribution with n − 1 degrees of freedom. There are 25 individuals in our sample, so df = 25 − 1 = 24. A perfect match between the null hypothesis and the data would mean that Y¯=μ0, and t would equal zero. Our question becomes: Is t = −0.56 sufficiently different from zero that we should reject the null hypothesis? The P-value is the probability of obtaining a result as extreme as, or more extreme than, t = −0.56, assuming that the null hypothesis is true. This probability is the area under the tdistribution shown in Figure 11.3-2. Both tails of the t-distribution are included in the probability, because the test is two-tailed. The P-value is P=Pr[t<−0.56]+Pr[t>0.56]. Because the t-distribution is symmetric around zero, this is the same as P=2Pr[t>0.56]. Using a computer, we find that P=0.58. Since P is greater than 0.05, we do not reject the null hypothesis. Our results would not surprise someone who believed that mean body temperature really is 98.6°F. FIGURE 11.3-2 The t-distribution with 24 degrees of freedom. Shaded areas include all values less than −0.56 and greater than 0.56. These are the values as extreme as, or more extreme than, the t-statistic calculated from the data. The same conclusion is obtained if we use Statistical Table C for the t-distribution. The probability that we need, Pr[t > 0.56 or t < −0.56], is not given directly in Statistical Table C, but we can find the critical values of t for different α values. Remember that the critical value of a test statistic marks off the point or points in the distribution that have a certain probability α in the tails of the distribution. Values of the test statistic that are further in the tails have Pvalues lower than a. In other words, if the value of t is further from zero than the critical value, then we can reject the null hypothesis. If the calculated value of t is closer to zero than the critical value, then we cannot reject the null hypothesis. For the present example, we find the row in Statistical Table C corresponding to 24 degrees of freedom and move to the right to find the critical t-value corresponding to the column α(2) = 0.05. Confirm for yourself that this value is t0.05(2), 24=2.06. This critical value defines 5% in the tails of the t-distribution, with 2.5% in each tail for a two-tailed test (Figure 11.3-3). In other words, 5% of the area under the t-distribution with df = 24 falls outside either −2.06 or 2.06. The t-value of −0.56 that we calculated from the data occurs inside this range. The observed t-statistic does not fall within one of the tails. Therefore, P > 0.05, and we fail to reject the null hypothesis. FIGURE 11.3-3 Five percent of the area under the curve of the t-distribution with 24 degrees of freedom lies in the two tails beyond −2.06 and 2.06. The t-value calculated from the body temperature data (t = −0.56) lies closer to the value expected by the null hypothesis. Therefore, the null hypothesis is not rejected with these data. At a significance level of α = 0.05, we would not reject the null hypothesis based on this sample of body temperatures. In other words, these data would be reasonably likely if the null hypothesis were true. With this sample, we cannot reject the view that mean body temperature in the sampled human population is 98.6°F. Does this result imply that the common wisdom about human body temperature is correct? Well, not necessarily: the null hypothesis might still be false, but we may have lacked sufficient power to detect the difference. What range of values for mean body temperature is most plausible given the data? To answer this question, we calculated the 95% confidence interval for the mean using the methods described in Section 11.2. We obtained 98.24∘F <μ < 98.80∘F. Check the calculations for yourself. This 95% confidence interval for the mean puts fairly narrow bounds on the mean temperature in the population. The value 98.6 is within the 95% confidence interval of the mean for these data, but slightly different values for the mean body temperature between about 98.2 and 98.8 are also consistent with the data. The effects of larger sample size: body temperature revisited In the body-temperature study in Example 11.3, we used a sample with only 25 data points to do our calculations. A larger sample is available, that includes 130 individuals taken from the same population. Using this larger sample, the estimated mean was 98.25°F with s = 0.733°F. A t-test using this data set finds t = −5.44. (Check this value for yourself, for practice.) This t corresponds to P = 0.000016, so it is very unlikely that the true value of human body temperature is the canonical 98.6°F. This value is clearly rejected as the mean body temperature. The confidence interval for the mean, using the larger data set, is 98.12 < μ < 98.38. Why should the larger sample get a different answer than the subset? With a larger sample size, sampling error in the estimate of the mean tends to decrease. Even though the smaller sample by chance had an estimated mean not very different from the larger sample, the larger sample led to an estimate with considerably narrower bounds. Thus, a larger sample is more likely to reject a false null hypothesis.3 Assuming that the result from the analysis of the larger sample is true, our smaller sample gave us a Type II error (i.e., a failure to detect a false null hypothesis). Assumptions of the one-sample t-test Methods described for calculating confidence intervals for the mean, and for testing a population mean using the one-sample t-test, make only two assumptions: ■ The data are a random sample from the population. (This assumption is shared by every method of statistical inference in this book.) ■ The variable is normally distributed in the population. An excellent way to assess the assumption of normality is to examine the frequency distribution of the data using a histogram or other graphical method and look for skew, bimodality, or other departures from the normal distribution. Chapter 13 discusses methods to investigate this assumption in more detail. Few variables in biology show an exact match to the normal distribution, but an exact match is not essential. Under certain conditions (discussed in Chapter 13), the t-test and the confidence interval calculations are robust to violations of the assumption of normality. A method is robust if the answer it gives is not sensitive to modest departures from the assumptions. Chapter 13 gives a more thorough account of this issue. Until then, we will consider only data that meet the assumption of normality reasonably well. Estimating the standard deviation and variance of a normal population Up to now, all of our attention has been focused on estimating and testing the population mean. But the standard deviation (or variance) is also interesting in many studies. For example, male stalk-eyed flies (see Example 11.2) often battle each other by pairing head to head and staring at each other for an extended period. The male with the longer eye stalks usually wins these staring matches, thus giving him greater access to females. When paired males have similar eye spans, the outcomes of the staring matches are difficult to predict. It is therefore interesting to know the standard deviation of eye span in the population, because it predicts the typical difference in eye span of any two males. We must use the sample standard deviation s to estimate the population standard deviation σ. Can we say how precise our estimate s is? We can, provided that the data are from a population having a normal distribution. Confidence limits for the variance It is easiest first to work with the variance, the square of the standard deviation. The population parameter is σ2 and the sample estimate is s2. For the stalk-eyed fly data, the estimate of σ2 is s2 = 0.1586 mm2. To calculate a confidence interval for the sample variance, all we need to know is its sampling distribution. Theory tells us that if a variable Y has a normal distribution, then the sampling distribution of the quantity χ2=(n−1)s2/σ2 is the χ2 distribution with n − 1 degrees of freedom. We encountered the χ2 distribution in Chapters 8 and 9, where we used it to approximate the null sampling distribution of a goodness-of-fit test statistic. Here the same theoretical distribution is useful for calculating the precision of estimates of population variability. A typical χ2 distribution is shown in Figure 11.5-1. Notice that χ2 is always zero or positive, and it extends all the way to positive infinity. This fits with the properties of a sample variance, which cannot be less than zero but can be indefinitely large. Notice also that, unlike the normal distribution, the χ2 distribution is not symmetric around its mean. FIGURE 11.5-1 The χ2 distribution with critical values for a 95% confidence interval indicated. Ninety-five percent of the area under the χ2 curve falls between the two critical values, and 5% lies beyond them. These features make it possible to determine confidence intervals for the variance. The 1 − α confidence interval is df s2χα/2,df2<σ2<df s2χ1-α/2,df2, where the values χα/2, df2 and χ1 - α/2, df2 represent the critical values of the 2 χ distribution corresponding to the upper and lower tails in Figure 11.5-1. The area under the χ2 curve to the right of χα/2, df2 is α/2. The other α/2 is to the left of χ1 - α/2, df2. Because the χ2 distribution is not symmetric, we are forced to calculate the left and right tails separately. The critical values are available in Statistical Table A at the back of this book. Let’s find the 95% confidence interval of the variance for eye span in male stalk-eyed flies. We have already calculated s2 to be 0.1587. The number of degrees of freedom is just one less than the number of data points, so df = 9 − 1 = 8, as before. All that’s left is finding the values of χα/2, df2 and χ1 - α/2, df2, which are χ0.025,82 and χ0.975,82, respectively, for the stalk-eyed fly study. Looking in Statistical Table A under eight degrees of freedom, we find the χ2 value that has 2.5% to the right of it is χ0.025,82= 17.53, and the χ2 value that has 2.5% probability to the left (i.e., 97.5% of the area to the right) of it is χ0.975,82= 2.18. Thus, the 95% confidence interval for the variance of this distribution is given by df s2χα/2,df2<σ2<df s2χ1-α/2,df28(0.1587)17.53<σ2<8(0.1587)2.180.072<σ2<0.582. Any value of the population variance between 0.072 mm and 0.582 mm is reasonably plausible, based on these data. The true variance would be within the calculated interval in 95% of samples. The estimate of the variance (0.1587) is closer to the lower bound (0.0724) than the upper bound (0.5824) because the χ2 distribution is asymmetrical. However, a 95% confidence interval for the variance based on a random sample has an equal chance of falling either below or above the true variance. Confidence limits for the standard deviation An approximate 95% confidence interval for the standard deviation can be obtained by taking the square roots of the upper and lower limits of the 95% confidence interval for the variance. Therefore, the 95% confidence interval for the standard deviation of eye span is given by 0.0724<σ<0.5824 or 0.27<σ<0.76. This interval surrounds the sample standard deviation of s = 0.40. Assumptions The assumptions of the method for calculating confidence intervals for the variance and standard deviation are the same as for confidence intervals for the mean—namely, the sample must be a random sample from the population, and the variable must have a normal distribution in the population. Unfortunately, the formulas for the confidence intervals for variance and standard deviation are very sensitive to the assumption of normality. The method is not robust to departures from this assumption. 1.6 Summary ■ If a variable Y is normally distributed in the population with mean µ and we have random samples of n individuals, then the sample means Y¯ are also normally distributed, with mean equal to µ (the same as the true mean of Y) and standard error σY¯=σ/n, where σ is the true standard deviation in the population. ■ The estimated standard error of the distribution of sample means is SEY¯=s/n. ■ If the population is normally distributed, then the standardized sample mean t=Y¯-μSEY¯ has a Student’s t-distribution with n − 1 degrees of freedom. ■ The t-distribution can be used to calculate a confidence interval for the mean. ■ A one-sample t-test compares the sample mean with µ0, a specific value for the population mean proposed in a null hypothesis. Under the null hypothesis that the population mean is equal to µ0, the sampling distribution of the test statistic t=Y¯-μ0SEY¯ distribution with n − 1 degrees of freedom. is a t- ■ The confidence interval for the variance is based on the χ2 distribution. Take the square root of the limits of the confidence interval for the variance to yield an approximate confidence interval for the standard deviation. ■ The confidence intervals for mean and variance, as well as the one-sample t-test, assume that the data are randomly sampled from a population with a normal distribution. Quick Formula Summary Confidence interval for a mean What does it assume? Individuals are chosen as a random sample from a population that is normally distributed. Estimate: Y¯ Parameter: µ Degrees of freedom: n − 1 Formula: Y¯−tα(2), dfSEY¯<μ<Y¯+tα(2), df SEY¯, where SEY¯=s/n. This formula gives the 1 − α confidence interval. One-sample t-test What is it for? Compares the sample mean of a numerical variable to a hypothesized value, µ0. What does it assume? Individuals are randomly sampled from a population that is normally distributed. Test statistic: t Distribution under H0: The t-distribution with n − 1 degrees of freedom. Formula: t=Y¯-μ0s/n. Confidence interval for variance What does it assume? Individuals are chosen as a random sample from a normally distributed population. Estimate: s2 Parameter: σ2 Degrees of freedom: df = n − 1. Formula: df s2χα/2, df2<σ2<df s2χ1 - α/2, df2. PRACTICE PROBLEMS 1. Calculation practice: Confidence interval for a mean and one-sample t-test. As the world warms, the geographic ranges of species might shift toward cooler areas. Chen et al. (2011) studied recent changes in the highest elevation at which species occur. Typically, higher elevations are cooler than lower elevations. Below are the changes in highest elevation for 31 taxa, in meters, over the late 1900s and early 2000s. (Many taxa were surveyed, including plants, vertebrates, and arthropods.) Positive numbers indicate upward shifts in elevation, and negative numbers indicate shifts to lower elevations. The values are displayed in the accompanying figure. 58.9, 7.8, 108.6, 44.8, 11.1, 19.2, 61.9, 30.5 12.7, 35.8, 7.4, 39.3, 24.0, 62.1, 24.3, 55.3 32.7, 65.3, −19.3, 7.6, −5.2, −2.1, 31.0, 69.0 88.6, 39.5, 20.7, 89.0, 69.0, 64.9, 64.8 a. b. c. d. e. f. g. h. i. j. k. l. m. n. What is the sample size n? What is the mean of these data points? Remember to give the units. What is the standard deviation of elevational range shift? (Give the units as well.) What is the standard error of the mean for these data? How many degrees of freedom will be associated with a confidence interval and a onesample t-test for the mean elevation shift? What value of α is needed for a 95% confidence interval? What is the critical value of t for this α and number of degrees of freedom? What assumptions are necessary to use the confidence interval calculations in this chapter? Calculate the 95% confidence interval for the mean using these data. For the one-sample t-test, write the appropriate null and alternative hypotheses. Calculate the test statistic t for this test. What assumptions are necessary to do a one-sample t-test? Describe the P-value for this test as accurately as you can. Did species change their highest elevation on average? 2. Calculation practice: Confidence interval for variance. Refer to Practice Problem 1. Using the data, calculate the 95% confidence interval for the variance with the following steps. a. What does the confidence interval refer to—the sample variance or the population variance? b. What assumptions are necessary to apply the formula discussed in this chapter for the confidence interval for variance? c. What is the sample variance of these data? d. How many degrees of freedom are associated with these data for estimates of variance? e. What is α for this analysis? f. Find the critical value of χ2 for α/2? g. What is the critical value of χ2 for 1 − α/2? h. Calculate the 95% confidence interval for the variance of the elevational range shift. 3. For each of the following, the mean, standard deviation, sample size, and desired confidence interval are given. Find the critical values of t required for a confidence interval of the mean. a. Y¯=14, s=32, n=12,95% confidence interval b. Y¯=−23, s=12, n=32,95% confidence interval c. Y¯=144, s=2.1 n=101,99% confidence interval d. Y¯=3.21, s=38, n=23,95% confidence interval e. Y¯=−152, s=38, n=8,99% confidence interval 4. Refer to Practice Problem 3. For each of the parts (a)-(e), find the two χ2 critical values necessary to calculate a confidence interval of the variance. 5. Measurements of the distance between the canine tooth and last molar for 35 wolf upper jaws were made by a researcher. He found the 95% confidence interval for the mean to be 10.17 cm < µ < 10.47 cm and the 99% confidence interval to be 10.21 cm < µ < 10.44 cm. Without seeing the data, explain why he must have made a mistake. 6. Here are the data on wolf upper jaws from Practice Problem 5, in centimeters. There are 35 individuals in this data set. Assume that this variable is normally distributed. 10.2, 10.4, 9.9, 10.7, 10.3, 9.7, 10.3, 10.7, 10.1, 10.6, 10.3, 10.0, 10.2, 10.1, 10.3, 9.9, 9.7, 10.6, 10.4, 10.1, 10.6, 10.3, 10.3, 10.5, 10.2, 10.2, 10.5, 10.1, 11.2, 10.5, 10.3, 10.0, 10.3, 10.7, 11.1 a. Draw a graph to confirm that the frequency distribution of the data is roughly bell shaped. b. What are the sample mean and the standard error of the mean for this data set? Provide an interpretation of this standard error. c. Find the 95% confidence interval for the mean. d. Find the 99% confidence interval for the variance. e. Find the 99% confidence interval for the standard deviation. 7. In the data set on wolf upper jaws (Practice Problem 6), each measurement was actually the average of two measurements made on the left and right sides of the jaw of an individual wolf. Thus, a total of 70 measurements were made. Could we use n = 70 when calculating confidence intervals for the mean and variance? Why or why not? 8. Polyandry is the name given to a mating system in which females mate with more than one male in a breeding season. This mating system leads to competition for fertilization between sperm of different males. The prediction has been made that males in polyandrous populations should evolve larger testes than males in monogamous populations (where females mate with only one male), because larger testes produce more sperm. To test this prediction, researchers carried out an experiment on eight separate lines of yellow dung flies (Hosken and Ward 2001). In four of these lines, each female was mated with three males before laying eggs (the polyandrous populations). In the other four lines, the females mated only once. After 10 generations, the testes size (in cross-sectional area) was measured in each of these lines. The four monogamous lines had testes with areas of 0.83, 0.85, 0.82, and 0.89 mm2. The polyandrous lines had testes areas of 0.96, 0.94, 0.99, and 0.91 mm2. a. Draw a graph to compare the testes areas of males in the two experimental treatments. What association is suggested? b. Estimate the mean and standard deviation of the testes areas for both monogamous and polyandrous lines. c. What is the standard error of the mean for each group? d. What is the 95% confidence interval for the mean testes area in polyandrous lines? e. What is the 99% confidence interval for the standard deviation of testes area among monogamous lines? 9. Community ecologists draw “food webs” to describe the predator and prey relationships between all organisms living in an area. A theoretical model predicts that a measure of the structure of food webs called “diet discontinuity” should be zero. Diet discontinuity is a measure of the relative numbers of predators whose prey are not ordered contiguously. Researchers have measured discontinuity scores for seven different food webs in nature. The values are given below (Cattin et al. 2004). 0.35, 0.08, 0.004, 0.08, 0.32, 0.28, 0.17 Assume that discontinuity in natural food webs has a normal distribution. Are the results consistent with the model’s prediction of a zero mean discontinuity score? Carry out an appropriate hypothesis test. 10. As part of a larger study into the role of the hippocampus in memory, Fortin et al. (2004) devised a test that required rats to choose between two odors, one of which had previously been presented to them as the first in a series of odors. To validate their procedure, the researchers tested whether normal rats are able to remember and choose the odor previously presented. By chance, the rats should score only 50% on the test. But if they are able to remember, they should score better than 50%. Seven normal rats were taken through the protocol, and their scores on the memory test were on average 68.4%, with a standard deviation of 7.1%. Do the rats show the ability to perform this task at levels better than that expected by chance? State all necessary assumptions. 11. A four-year review at the Provincial Hospital in Alotau, Papua New Guinea (Barss 1984), found that about 1/40 of their hospital admissions were injuries due to falling coconuts. If coconuts weigh on average 3.2 kg, and the upper bound of the 95% confidence interval is 3.5 kg, what is the lower bound of this confidence interval? Assume a normal distribution of coconut weights. 12. The following tables give the confidence intervals of either the standard deviation or variance from different samples. Provide the confidence interval of the other measure of spread (i.e., either standard deviation or variance). Standard deviation Variance 2.22 < σ < 4.78 425.4 < σ2 < 678.8 36.4 < σ < 59.6 185.8 < σ2 < 279.0 13. Can a human swim faster in water or in syrup? It is unknown whether the increase in the friction of the body moving through the syrup (slowing the swimmer down) is compensated by the increased power of each stroke (speeding the swimmer up). Finally, an experiment was done4 in which researchers filled one pool with water mixed with syrup (guar gum) and another pool with normal water (Gettelfinger and Cussler 2004). They had 18 swimmers swim laps in both pools in random order. The data are presented as the relative speed of each swimmer in syrup (speed in the syrup pool divided by his or her speed in the water pool). If the syrup has no effect on swimming speed, then relative swim speed should have a mean of 1. The data, which are approximately normally distributed, are as follows: 1.08, 0.94, 1.07, 1.04, 1.02, 1.01, 0.95, 1.02, 1.08 1.02, 1.01, 0.96, 1.04, 1.02, 1.02, 0.96, 0.98, 0.99 ∑(Y)=18.21, ∑(Y2)=18.4529. a. Draw a graph showing the frequency distribution. Why is this a good idea? What trend is suggested? b. Test the hypothesis that relative swim speed in syrup has a mean of 1. c. How uncertain are we about true relative swimming speed in syrup? Use the 99% confidence interval to find out. ASSIGNMENT PROBLEMS 14. Astronauts increased in height by an average of approximately 40 mm (about an inch and a half) during the Apollo-Soyuz missions, due to the absence of gravity compressing their spines during their time in space. Does something similar happen here on Earth? An experiment supported by NASA measured the heights of six men immediately before going to bed, and then again after three days of bed rest (Styf et al. 1997). On average, they increased in height by 14 mm, with standard deviation 0.66 mm. Find the 95% confidence interval for the change in height after three days of bed rest. 15. Two different researchers measured the weight of two separate samples of ruby-throated hummingbirds from the same population. Each calculated a 95% confidence interval for the mean weight of these birds. Researcher 1 found the 95% confidence interval to be 3.12 g < µ < 3.48 g, while Researcher 2 found the 95% confidence interval to be 3.05 g < µ < 3.62 g. a. Why would the two researchers get different answers? b. Which researcher most likely had the larger sample? c. Can you be certain about your answer in part (b)? Why or why not? 16. In the Northern Hemisphere, dolphins swim predominantly in a counterclockwise direction while sleeping. A group of researchers wanted to know whether the same was true for dolphins in the Southern Hemisphere (Stafne and Manger 2004). They watched eight sleeping dolphins and recorded the percentage of time that the dolphins swam clockwise. Assume that this is a random sample and that this variable has a normal distribution in the population. These data are as follows: 77.7, 84.8, 79.4, 84.0, 99.6, 93.6, 89.4, 97.2 a. What is the mean percentage of clockwise swimming for Southern Hemisphere dolphins? b. What is the 95% confidence interval for the mean time swimming clockwise in the Southern Hemisphere dolphins? c. What is the 99% confidence interval for the mean time swimming clockwise in the Southern Hemisphere dolphins? d. What is the best estimate of the standard deviation of the percentage of clockwise swimming? e. What is the median of the percentage of clockwise swimming? 17. Male koalas bellow5 during the breeding season, but do females pay attention? Charlton et al. (2012) measured responses of estrous female koalas to playbacks of bellows that had been modified on the computer to simulate male callers of different body size. Females were placed one at a time into an enclosure while loudspeakers played bellows simulating a larger male on one side (randomly chosen) and a smaller male on the other side. Male bellows were played repeatedly, alternating between sides, over 10 minutes. Females often turned to look in the direction of a loudspeaker (invisible to her) during a trial. The following data measure the preference of each of 15 females for the simulated sound of the “larger” male. Preference was calculated as the number of looks toward the larger-male side minus the number of looks to the smaller-male side. Preference is positive if a female looked most often toward the larger male, and it is negative if she looked most often in the direction of the smaller male. −2, 2, 6, 9, 13, 2, 5, 7, 2, −6, 4, 3, 2, 6, −6 a. Draw a graph to visualize the frequency distribution. What is the trend in female preference? b. Do females pay attention to body size cues in simulated male sounds? Carry out a test, making all necessary assumptions. 18. The mating habits of threespine stickleback (a fish) have been studied intensively. One experiment examined whether fish are more likely to mate with a member of the opposite sex that was similar in body size to themselves, rather than with fish that were different in size (McKinnon et al. 2004). The mating preferences were measured in nine different fish populations, and the preference was measured by an index that is zero if the population shows no preference for mating by size, positive if the population contains fish that prefer to mate with fish of a different size, and negative if the fish mate preferentially with individuals of the same size. Notice that the independent data points here are the indices for each fish population. The nine indices are as follows: −32.0, −29.8, −40.6, −90.8, −29.2, −28.8, −78.4, −59.2, −74.3 Assuming normality, test the hypothesis that, on average, sticklebacks do not prefer to mate differently by size. What can you conclude from these data? 19. Pit vipers (including rattlesnakes) have a pit organ located halfway between their eyes and nostrils. These organs detect the body heat of unlucky warm-blooded prey (mammals), but the snakes also use the pit organ to detect cooler spots in the environment to help in their thermo-regulation. Researchers determined that western diamondback rattlesnakes had on average a 73% chance of moving into the right half of a Y-maze that was cooled to 30°C, instead of the left half which was at 40°C (Krochmal et al. 2004). Was this pattern a preference for the right side of the maze (that just happened to be cooled), or was it a direct response to the difference in heat? To test this, five snakes were put into the same maze several times individually, and the percentage of trials in which they turned right was recorded when both sides were heated to 40°C. The average of the percentages for the five snakes was 47%, and the standard deviation was 13%. a. Is there a preference for the right side of the maze when the temperature is equalized? Test the null hypothesis that the mean percentage of right turns is 50% in this population. Assume that the distribution of scores was normal. b. Can you think of a way to design the experiment to test effects of temperature so that sidepreferences, if present, do not affect the outcome? 20. In Seychelles warblers, young adult females known as subordinates sometimes hang around the territories of older birds. Sometimes these subordinates help feed the offspring of the older birds, and sometimes not. In one study, subordinate birds that did not help were genetically assessed to discover whether they were related to the offspring of the older birds (Richardson et al. 2003). A “relatedness” score was assigned to each subordinate, with a value of zero meaning no relationship to the offspring. Out of five subordinates examined, the mean relatedness was −0.05 with standard deviation 0.45 (relatedness can be negative if by chance sampled individuals are less related than the population as a whole). What is the 95% confidence interval for relatedness of unhelpful subordinates to the offspring? 21. Hurricanes Katrina and Rita caused the flooding of large parts of New Orleans, leaving behind large amounts of new sediment. Before the hurricanes, the soils in New Orleans were known to have high concentrations of lead, which is a dangerous environmental toxin. Fortysix sites had been monitored before the hurricanes for soil lead content, as measured in mg/kg, and the soil from each of these sites was measured again after the hurricanes (Zahran et al. 2010). The data given below show the log of the ratio of the soil lead content after the hurricanes and the soil lead content before the hurricanes—we’ll call this variable the “change in soil lead.” (Therefore, numbers less than zero show a reduction in soil lead content after the hurricanes, and numbers greater than zero show increases.) This log ratio has an approximately normal distribution. −0.83, −0.18, 0.14, −1.46, −0.48, −1.04, 0.25, −0.34, −0.81, −0.83, −0.60, 0.34, −0.75, 0.37, 0.26, 0.46, −0.03, −0.32, −0.53, −1.55, −0.90, −0.95, −0.13, −0.75, 0.59, −0.06, 0.39, −0.40, −0.84, −0.56, 0.44, 0.18, 0.28, −0.41, −0.26, 0.64, −0.51, −0.36, 0.49, 0.21, 0.17, 0.13, −0.63, −1.24, 0.57, −0.78 a. Draw a graph of the data, following recommended principles of good graph design (Chapter 2). What trend is suggested? b. Determine the most-plausible range of values for the mean change in soil lead. Describe in words what the nature of that change is. Is an increase in soil lead consistent with the data? Is a decrease in soil lead consistent? c. Test whether mean soil lead changed after the hurricanes. 22. Refer to Assignment Problem 21. In the same study, Zahran et al. (2010) also measured the lead concentration of blood (in µg/dl) of children living in the 46 areas both before and after the hurricanes. The ratio of blood lead concentration after the hurricanes to that before the hurricanes is given below. This ratio has an approximately normal distribution. A ratio of 1.0 indicates no change. 0.67, 0.57, 0.88, 0.95, 0.60, 0.98, 0.64, 0.48, 0.26, 0.77, 0.87, 0.34, 0.56, 0.88, 0.77, 0.67, 0.59, 0.56, 0.54, 0.64, 0.90, 0.00, 0.53, 0.63, 0.57, 0.70, 0.41, 0.42, 0.69, 0.45, 0.67, 0.50, 0.60, 1.16, 0.33, 0.63, 0.63, 1.00, 1.05, 0.67, 1.00, 1.00, 0.75, 1.00, 1.00, 0.63 a. Determine the most-plausible range of values for the mean change in blood lead ratio. Describe in words the nature of that change. Is a ratio greater than one consistent with the data? Is a decrease in blood lead consistent? b. Is this ratio significantly different from one? Show the appropriate hypothesis test. 23. The evolution of blind cave forms of the fish, Astyanax mexicanus, is associated with large reductions in the amount of time spent sleeping. The eyed, surface forms sleep about 800 minutes per 24-hour day (about 13 hours). The accompanying graph shows the frequency distribution of sleep times per 24-hours for 23 blind individuals from a single cave population (Duboué et al. 2011). The sample mean is 129.4 and the standard deviation is 147.2. Assume that the sample is a random sample. a. Using the formula in Section 11.7, calculate a 95% confidence interval for mean sleep time in the cave population. b. Use the graph to evaluate whether the assumptions of the confidence interval are likely to be met. Explain your answer. c. In light of your answer in part (b), can we trust the confidence interval you calculated in part (a)? 24. Without external cues such as the sun, people attempting to walk in a straight line tend to walk in circles. One idea is that most individuals have a tendency to turn in one direction because of internal physiological asymmetries, or because of differences between legs in length or strength. Souman et al. (2009) tested for a directional tendency by blindfolding 15 participants in a large field and asking them to walk in a straight line. The numbers below are the median change in direction (turning angle) of each of the 15 participants measured in degrees per second. A negative angle refers to a left turn, whereas a positive number indicates a right turn. −5.19, −1.20, −0.50, −0.33, −0.15, −0.15, −0.15, −0.07, 0.02, 0.02, 0.28, 0.37, 0.45, 1.76, 2.80 a. Draw a graph showing the frequency distribution of the data. Is a trend in the mean angle suggested? b. Do people tend to turn in one direction (e.g., left) more on average than the other direction (e.g., right)? Test whether the mean angle differs from zero. c. Based on your results in part (b), is the following statement justified? “People do not have a tendency to turn more in one direction, on average, than the other direction.” Explain. 25. Functionally important traits in animals tend to vary little from one individual to the next within populations, possibly because individuals that deviate too much from the mean die sooner or leave fewer offspring in the long run. If so, does variance in a trait rise after it becomes less functionally important? Billet et al. (2012) investigated this question with the semicircular canals (SC) of the inner ear of the three-toed sloth (Bradypus variegatus). Sloths move very slowly and infrequently, and the authors suggested that this behavior reduces the functional demands on the SC, which usually provide information on angular head movement to the brain. Indeed, the motion signal from the SC to the brain may be very weak in sloths as compared to faster-moving animals. The following numbers are measurements of the ratio of the length to the width of the anterior semicircular canals in seven sloths. Assume that this represents a random sample. 1.53, 1.06, 0.93, 1.38, 1.47, 1.20, 1.16 a. In related, faster-moving animals, the standard deviation of the ratio of the length to the width of the anterior semicircular canals is known to be 0.09. What is the estimate of standard deviation of this measurement in three-toed sloths? b. Based on these data, what is the mostplausible range of values for the population standard deviation in the three-toed sloth? Does this range include the known value of the standard deviation in related, faster-moving species? c. What additional assumption is required for your answer in (b)? What do you know about how sensitive the confidence interval calculation is when the assumption is not met? 12 Comparing two means Galápagos marine iguana Biological data are often gathered to compare different group or treatment means. Do female hyenas differ from male hyenas in body size? Do patients treated with a new drug live significantly longer than those treated with the old drug? Do students perform better on tests if they stay up late studying or get a good night’s rest? In Chapter 8, we presented methods to compare proportions of a categorical variable between different groups. In this chapter, we develop procedures for comparing means of a numerical variable between two treatments1 or groups. We also include methods to compare two variances. All of the methods in the current chapter assume that the measurements are normally distributed in the populations. We show analyses for two different study designs. In the paired design, both treatments have been applied to every sampled unit, such as a subject or a plot, at different times or on different sides of the body or of the plot. In the two-sample design, each group constitutes an independent random sample of individuals. In both cases, we make use of the t-distribution to calculate confidence intervals for the mean difference and test hypotheses about means. However, the two different study designs require alternative methods for analysis. We elaborate on the reasons in Section 12.1. Paired sample versus two independent samples There are two study designs to measure and test differences between the means of two treatments. To describe them, let’s use an example: does clear-cutting a forest affect the number of salamanders present? Here we have two treatments (clear-cutting/no clear-cutting), and we want to know if the mean of a numerical variable (the number of salamanders) differs between them. “Clear-cut” is the treatment of interest, and “no clear-cut” is the control. This is the same as asking whether these two variables, treatment (a categorical variable) and salamander number (a numerical variable), are associated. We can design this kind of a study in either of two ways: a two-sample design (see the left panel in Figure 12.1-1) or a paired design (see the right panel in Figure 12.1-1). In the twosample design, we take a random sample of forest plots from the population and then randomly assign either the clear-cut treatment or the no-clear-cut treatment to each plot. In this case, we end up with two independent samples, one from each treatment. The difference in the mean number of salamanders between the clear-cut and no-clear-cut areas estimates the effect of clear-cutting on salamander number.2 FIGURE 12.1-1 Alternative designs to compare two treatments. A two-sample design is on the left; the paired design is on the right. Freestanding blocks represent sampling units, such as plots. The red and gold areas represent different treatments (e.g., clearcut and not clear-cut). In the paired design (right), both treatments are applied to every unit. In the two-sample design (left), different treatments are applied to separate units. In the paired design, we take a random sample of forest plots and clear-cut a randomly chosen half of each plot, leaving the other half untouched. Afterward, we count the number of salamanders in each half. The mean difference between the two sides estimates the effect of clear-cutting. Each of these two experimental designs has benefits, and both address the same question. However, they differ in an important way that affects how the data are analyzed. In the paired design, measurements on adjacent plot-halves are not independent. This is because they are likely to be similar in soil, water, sunlight, and other conditions that affect the number of salamanders. Because of these similarities, we can control for some of the noise between plots and see more clearly the effects of the treatment. As a result, we must analyze paired data differently than when every plot is independent of all the others, as in the case of the twosample design. In the paired design, both treatments are applied to every sampled unit. In the two-sample design, each treatment group is composed of an independent, random sample of units. Paired designs are usually more powerful than unpaired designs, because they control for a lot of the extraneous variation between plots or sampling units that sometimes obscures the effects we are looking for. It is easier to see a real difference between two treatments if nearly everything else is similar between sides of the same plot or sampling unit. Very often, though, a paired design is just not possible. Paired comparison of means The main advantage of the paired design is that it reduces the effects of variation among sampling units that has nothing to do with the treatment itself. This feature increases the power and the precision of estimates. For example, forest or agricultural plots likely differ greatly in their local environmental features, and this variation can make it difficult to detect a difference between the effects of two treatments applied to separate plots. The paired design reduces the impact of this variation by applying both treatments to different sides of all plots, making it easier to detect a real difference between treatments. Here are some other examples of paired study designs: ■ Comparing patient weight before and after hospitalization ■ Comparing fish species diversity in lakes before and after heavy metal contamination? ■ Testing effects of sunscreen applied to one arm of each subject compared with a placebo applied to the other arm ■ Testing effects of smoking in a sample of smokers, each of which is compared with a nonsmoker closely matched by age, weight, and ethnic background ■ Testing effects of socioeconomic condition on dietary preferences by comparing identical twins raised in separate adoptive families that differ in their socioeconomic conditions The last two examples (the effects of smoking and socioeconomic condition) show that even two unique individuals can constitute a pair if they are similar due to shared physical, environmental, or genetic characteristics. In a paired study design, the sampling unit is the pair. We must therefore reduce the two measurements made on each pair down to a single number—that is, the difference between the two measurements made on each sampling unit (e.g., patient, lake, subject, matched pair, or twins). This step correctly yields only as many data points as there are randomly sampled units. Thus, if 20 individuals are grouped into 10 pairs, there are 10 measurements of the difference between the two treatments. Ten would be the sample size. We can estimate and test the effect of treatment using the mean of the differences. Paired measurements are converted to a single measurement by taking the difference between them. Estimating mean difference from paired data We now describe how to estimate mean differences and calculate confidence intervals for those estimates. This method assumes that we have a random sample of pairs and that the differences between members of each pair have a normal distribution. We’ll use Example 12.2 to show the concepts and the calculation. EXAMPLE 12.2 So macho it makes you sick? In many species, males are more likely to attract females if the males have high testosterone levels. Are males with high testosterone paying a cost for this extra mating success in other ways? One hypothesis is that males with high testosterone might be less able to fight off disease—that is, their high levels of testosterone might reduce their immunocompetent. To test this idea, Hasselquist et al. (1999) experimentally increased the testosterone levels of 13 male red-winged blackbirds by surgically implanting a small permeable tube filled with testosterone. They measured immunocompetence as the rate of antibody production in response to a nonpathogenic antigen in each bird’s blood serum both before and after the implant. The antibody production rates were measured optically, in units of log 10−3 optical density per minute (ln[mOD/min]). The graph in Figure 12.2-1 shows that there is considerable variation among birds in their natural antibody production rates and that antibody production went up after the implant in some birds but went down in others. What is the mean difference between the two treatments? We can address this question by constructing a confidence interval for the mean change in antibody production. FIGURE 12.2-1 Immunocompetence of 13 red-winged blackbirds before and after testosterone implants. Immunocompetence was measured as the log of the rate of antibody production in response to an antigen (original units in mOD/min). The two measurements from each bird are connected by a line segment. The first step is to calculate the difference in antibody production for each male. The data, listed in Table 12.2-1, consist of a pair of measurements for each male: antibody production before the testosterone implant and antibody production after the implant. We calculated the difference between measurements within each male i as TABLE 12.2-1 Antibody production rate in blackbirds before and after testosterone implants. Each bird is represented by a single row and has a pair of antibody measurements; d is the difference (“after” minus “before”) between the pair of measurements. Male After implant: Antibody identification Before implant: Antibody production number production (ln[mOD/min]) (ln[mOD/min]) d 1 4 5 6 9 10 15 16 17 19 20 23 24 4.65 3.91 4.91 4.50 4.80 4.88 4.88 4.78 4.98 4.87 4.75 4.70 4.93 4.44 4.30 4.98 4.45 5.00 5.00 5.01 4.96 5.02 4.73 4.77 4.60 5.01 −0.21 0.39 0.07 −0.05 0.20 0.12 0.13 0.18 0.04 −0.14 0.02 −0.10 0.08 di = (antibody production of male i after) − (antibody production of male i before). For bird 1, for example, d1 = 4.44 − 4.65 = −0.21. It doesn’t much matter whether we subtract the “after” measurement from the “before” measurement (as in Table 12.2-1) or vice versa, but it is critical that we calculate the differences the same way for each individual. A histogram of the resulting differences is shown in Figure 12.2-2. FIGURE 12.2-2 A histogram of the differences in antibody production in male blackbirds before and after testosterone implants. We then find the sample mean difference (call it d¯ ), the sample standard deviation of the differences (sd), and the sample size (n): d¯=0.056,sd=0.159, andn=13. The trend was for immunocompetence (measured by the rate of antibody production) to go up after the testosterone implant, not down as predicted by the hypothesis. The confidence interval for the mean of a paired difference is generated in the same way as the confidence interval for any other mean (see Section 11.2). The confidence interval for the mean difference (µd) is d¯-tα(2),dfSEd¯<μd<d¯+tα(2),dfSEd¯, where SEd¯=sd/n is the standard error of the mean difference. Note that n is the number of pairs, not the total number of measurements, because pairs are the independent sampling unit. For this reason, we are carrying out the analysis on the differences. For the blackbird data, we calculate SEd¯=0.15913=0.044. With n = 13, we have df = 12, so we look in Statistical Table C to find t0.05(2), 12 = 2.18. Thus, the 95% confidence interval for the mean difference between antibody production before and after testosterone implants is d¯-tα(2), dfSEd¯<μd<d¯+tα(2), dfSEd¯0.056-2.18(0.044) <μd<0.056+2.18(0.044)-0.040<μd<0.152. In other words, the most-plausible range for the true mean difference is between −0.040 and 0.152 ln[mOD/min]. While this span includes zero, it is also consistent with the possibility of a modest drop in immunocompetence following testosterone implant. A larger sample of individuals would be needed to narrow the interval further. Paired t-test The paired t-test is used to test a null hypothesis that the mean difference of paired measurements equals a specified value. The method tests differences when both treatments have been applied to every sampling unit and the data are therefore paired. The paired t-test is straightforward once we reduce the two measurements made on each pair down to a single number: the difference between the two measurements. This difference is then analyzed in the same way as a regular one-sample t-test, as described in Chapter 11. We’ll continue to analyze the blackbird testosterone data from Example 12.2 to ask whether the antibody production in blackbirds changed significantly after testosterone implants. Because the data are paired, a paired t-test is appropriate, with before testosterone and after testosterone representing the two “treatments.” The null and alternative hypotheses are as follows. H0: The mean change in antibody production after testosterone implants was zero. HA: The mean change in antibody production after testosterone implants was not zero. The alternative hypothesis is two-tailed, because either greater or lesser antibody production after the testosterone implants would reject the null hypothesis. These hypothesis statements could also be written as H0:μd=0 and H0:μd≠0 where µd is the population mean difference between treatments. Again, the first step is to calculate the difference in antibody production before and after the implants, which we have already done in Table 12.2-1. We then need the sample mean (d¯=0.056) and standard error (SEd¯=0.044) of the mean differences, which we calculated in the previous subsection. From here on, the paired t-test is identical to a one-sample t-test on the differences. We can calculate the t-statistic as t=d¯-μd0SEd¯, where µd0 is the population mean of d proposed by the null hypothesis (0 in this example), and SE d¯ is the sample standard error of d. Under the null hypothesis, this t-statistic has a tdistribution with df = n − 1 degrees of freedom. When this formula is applied to the blackbird testosterone data, t=0.056-00.044=1.27. This test statistic has df = 13 − 1 = 12. The P-value for the test is P=Pr[t12<−1.27]+Pr[t12>1.27]=2Pr[t12>1.27]. Using a computer, we calculated this probability under a t-distribution with 12 df to be P=0.23. P is greater than 0.05, so we do not reject the null hypothesis that µd = 0 with these data. If we did this test without a computer handy, we could use Statistical Table C instead. In that case, we would find that the critical value for a two-tailed test with α = 0.05 and 12 degrees of freedom is t0.05(2), 12=2.18. Because t = 1.27 does not fall outside the critical limits of − 2.18 and 2.18, we know that P > 0.05, and we do not reject the null hypothesis. The mean difference that we saw (0.056 ln[mOD/min]) is in the direction of higher immune system function after testosterone implants, but we cannot reject the null hypothesis that the testosterone has no effect on immunocompetence. The confidence interval that we calculated in the previous subsection indicates that a broad range of values is consistent with these data. On the basis of this data set, we do not reject the null hypothesis, but we might want to study the problem further to resolve the issue more precisely. Assumptions The paired t-test and the method for calculating a confidence interval for a paired difference make the same assumptions as the one-sample methods described in Chapter 11: ■ The sampling units are randomly sampled from the population. ■ The paired differences have a normal distribution in the population. The analysis makes no assumptions about the distribution of either of the two measurements made on each sampling unit. These measurements can have any distribution, as long as the difference between the two measurements is approximately normally distributed. Two-sample comparison of means We now present methods to analyze the difference between the means of two treatments or groups in the case of a two-sample design. In a two-sample design, the two treatments are applied to separate, independent samples from two populations. We illustrate the process using Example 12.3. EXAMPLE 12.3 Spike or be spiked The horned lizard Phrynosoma mcallii has many unusual features, including the ability to squirt blood from its eyes. The species is named for the fringe of spikes surrounding the head. Herpetologists recently tested the idea that long spikes help protect horned lizards from being eaten, by taking advantage of the gruesome but convenient behavior of one of their main predators—the loggerhead shrike, Lanius ludovicianus. The loggerhead shrike is a small predatory bird that skewers its victims on thorns or barbed wire, to save for later eating. The researchers identified the remains of 30 horned lizards that had been killed by shrikes and measured the lengths of their horns (Young et al. 2004). As a comparison group, they measured the same trait on 154 horned lizards that were still alive and well. They compared the mean horn lengths of the dead lizards with those of the living lizards. Histograms of the horn lengths of the two groups are shown in Figure 12.3-1. Summary statistics are listed in Table 12.3-1. FIGURE 12.3-1 The frequency distributions of horn lengths for living and killed horned lizards. There are n1 = 154 live lizards and n2 = 30 killed lizards. TABLE 12.3-1 Summary statistics for lizard horn lengths. Lizard group Living Killed Sample mean Y¯ (mm) 24.28 21.99 Sample standard deviation s (mm) Sample size n 2.63 2.71 154 30 The lizards that were killed by shrikes are different individuals than the living lizards. They are not paired in any way; instead, each treatment (living or dead) is represented by a separate sample of lizards, belonging to different groups. Therefore, we must analyze the differences by using two-sample methods. Figure 12.3-1 and Table 12.3-1 suggest that the dead lizards have shorter horns than the living lizards on average, as might be expected if shrikes avoid the longest horns. Next, we calculate a confidence interval for the difference, and we test whether the difference is real. Confidence interval for the difference between two means How much longer on average are the horns of the surviving lizards? The best estimate of the difference between two population means is the difference between the sample means, Y¯1−Y¯2. Here, we will use Y¯1 to refer to the sample mean of the live lizards, and Y¯2 to refer to the sample mean of the dead lizards. Which group we call 1 and which we call 2 is arbitrary, but we have to be consistent with the labels throughout. We use subscripts to indicate the population or sample that a value comes from. The method for confidence intervals makes use of the fact that, if the variable is normally distributed in both populations, then the sampling distribution for the difference between the sample means is also normal. Thus, the Student’s t-distribution will be very helpful in describing the sampling properties of the standardized difference. First, we will need the standard error of Y¯1−Y¯2, which is given by SEY1¯-Y¯2=sp2(1n1+1n2), where sp2=df1s12+df2s22df1+df2. The quantity sp2 is called the pooled sample variance. It is a weighted average3 of the sample variances s12 and s22 (the squared standard deviations) of the two groups. The confidence interval formula assumes that the standard deviations (and variances) of the two populations are the same. The pooled variance sp2, uses the information from both samples to get the best estimate of this common population variance. The df1 and df2 terms refer to the degrees of freedom for the variances of the two samples: The pooled sample variance sp2 is the average of the variances of the samples weighted by their degrees of freedom. df1=n1−1 and df2=n2−1 where n1 and n2 are the sample sizes from the two populations. Because the sampling distribution of Y¯1−Y¯2, is normal, the sampling distribution of the following standardized difference has a Student’s t-distribution: t=(Y1¯-Y¯2)-(μ1-μ2)SEY1¯-Y¯2 with total degrees of freedom equal to df=df1+df2=n1+n2−2. From these two formulas, we can calculate the confidence interval for the difference between two population means: (Y¯1−Y¯2)−tα(2),df SEY¯1−Y¯2<μ1−μ2<(Y¯1−Y¯2)+tα(2),df SEY¯1−Y¯2. Let’s use the horned-lizard data to calculate the 95% confidence interval for the difference in horn length between the two groups of lizards. The place to start is with the summary statistics listed in Table 12.3-1. The difference in the means is Y¯1−Y¯2=24.28−21.99=2.29 mm. After that, we start from the bottom—we’ll need the pooled variance to calculate the standard error of Y¯1−Y¯2 and we’ll need the standard error to get the confidence interval for µ1 − µ2. The pooled sample variance, which depends on the number of degrees of freedom (df1 = 153 and df2 = 29), is sp2=df1s12+df2s22df1+df2=153(2.632)+29(2.712)153+29=6.98. The standard error of the difference between the two means is then SEY1¯-Y2¯=sp2(1n1+1n2)=6.98(1154+130)=0.527. One common mistake is to forget that the standard error equation uses the sample sizes, n1 and n2, not the number of degrees of freedom, in the denominators. There are 154 + 30 − 2 = 182 degrees of freedom in total, so we look up the critical value of t for df = 182: t0.05(2), 182=1.97. (There is no row in Statistical Table C for df = 182, but t0.05(2), df = 1.97 for both df = 180 and df = 200, and so to this order of precision, t0.05(2),182 = 1.97 as well.) By plugging these quantities into the formula, we find the 95% confidence interval for the difference in mean horn length between the living and dead lizards is (Y¯1-Y¯2)-tα(2),dfSEY¯1-Y¯2<μ1-μ2<(Y¯1-Y¯2)+tα(2),dfSEY¯1-Y¯22.29-1.97(0.527)<μ1- μ2<2.29+1.97(0.527)1.25<μ1-μ2<3.33. Thus, the 95% confidence interval for µ1 − µ2 is from 1.25 to 3.33 mm. We can be reasonably confident that surviving lizards have longer horns than lizards killed by shrikes, by an amount somewhere between 1.25 and 3.33 millimeters. Two-sample t-test The two-sample t-test is the simplest method to compare the means of a numerical variable between two independent groups. Its most common use is to test the null hypothesis that the means of two populations are equal (or, equivalently, that the difference between the means is zero): H0:μ1=μ2 and HA:μ1≠μ2 where µ1 is the population mean for the first of the two populations and µ2 is the mean for the second population. It doesn’t matter which population we designate as population 1 and which as population 2 as long as we are consistent throughout the analysis. The two-sample t-test uses the following test statistic based on the observed difference between the sample means:4 t=(Y¯1-Y¯2)SEY¯1-Y¯2. Provided that the assumptions are met, this t-statistic has a t-distribution under the null hypothesis with n1 + n2 − 2 degrees of freedom. The denominator of this formula, SEY¯1−Y¯2, is the standard error of the difference between the two sample means. This standard error is the same as that shown previously when discussing confidence intervals (p. 336): SEY¯1-Y¯2=sp2(1n1+1n2). The pooled sample variance, sp2, was defined in the previous subsection. The null hypothesis is tested by comparing the observed t-value to the theoretical t-distribution with df=df1+df2=n1+n2−2 degrees of freedom. To apply the two-sample t-test to the horned lizard study, begin by writing the null and alternative hypotheses. H0: Lizards killed by shrikes and living lizards do not differ in mean horn length (i.e., µ1 = µ2). HA: Lizards killed by shrikes and living lizards differ in mean horn length (i.e., µ1 ≠ µ2). The alternative hypothesis is two-sided. Let’s apply these equations to the lizard data to perform the two-sample t-test. Once again, the difference between the sample means of the two groups of lizards is Y¯1−Y¯2=2.29 mm. The standard error of the difference, computed in the previous subsection, is SEY¯1-Y¯2=0.527. Now we can calculate the test statistic, t. According to the null hypothesis, the two means are equal; that is, µ1 − µ2 = 0. With this information, we can find t=(Y¯1-Y¯2)SEY¯1-Y¯2=2.290.527=4.35. We need to compare this test statistic with the distribution of possible values for t if the null hypothesis were true. The appropriate null distribution is the t-distribution, and it will have df1 + df2 = 153 + 29 = 182 degrees of freedom. The P-value is then P=2Pr[t>4.35]. Using a computer, we find that this t-value corresponds to P=0.000023. Since P < 0.05, we reject the null hypothesis. We reach the same conclusion using Statistical Table C. The critical value for α = 0.05 for a t-distribution with 182 degrees of freedom is t0.05(2), 182=1.97. The t = 4.35 calculated from these data is much further into the tail of the distribution than this critical value, so we reject the null hypothesis. In fact, the t calculated for these data is further in the tail of the distribution than all values given in Statistical Table C, including those for α = 0.0001. From this information we may conclude that P < 0.0001. Based on these studies, there is a difference in the horn length of lizards eaten by shrikes, compared with live lizards. It is possible that shrikes avoid or are unable to capture the lizards with the longest horns. We can’t be certain that this explains the difference in horn length, because this is an observational study. To infer causation, we would need to carry out a controlled experiment that manipulates lizard horn lengths. Assumptions The two-sided t-test and two-sample confidence interval for a difference in means are based on the following assumptions: ■ Each of the two samples is a random sample from its population. ■ The numerical variable is normally distributed in each population. ■ The standard deviation (and variance) of the numerical variable is the same in both populations. We’ve heard the first two assumptions before—they are required for the one-sample t-test —but the third assumption is new. The two-sample methods we have just discussed are fairly robust to violations of this assumption. With moderate sample sizes (i.e., n1 and n2 both greater than 30), the methods work well, even if the standard deviations in the two groups differ by as much as threefold, as long as the sample sizes of the two groups are approximately equal. If there is more than a threefold difference in standard deviations, or if the sample sizes of the two groups are very different or less than 30 with some difference in standard deviations, then the two-sample t-test should not be used. The two-sample t-test is also robust to minor deviations from the normal distribution, especially if the two distributions being compared are similar in shape. For example, in Example 12.3, the distribution of horn size is unlikely to be perfectly normal in either group, but in both cases the measurements do not differ greatly from a normal distribution. The t-test is robust to these kinds of minor deviations from normality. The robustness of the t-test improves as the sample sizes get larger. In Chapter 13, we discuss at greater length the robustness of the t-test to deviations from normality. A two-sample t-test when standard deviations are unequal An important assumption of the two-sample t-test is that the standard deviations of the two populations are the same. If this assumption cannot be met, then the Welch’s approximate ttest should be used instead of the two-sample t-test. Welch’s t-test is similar to the two-sample t-test except that the standard error and degrees of freedom are computed differently. Similarly, it is possible to calculate a confidence interval for the difference in the means of the two groups using a formula that does not assume equal variance in the two groups. (See the Quick Formula Summary in Section 12.9 for the equations.) An example using the Welch’s t-test and Welch’s confidence interval is discussed in Section 12.4. Welch’s t-test compares the means of two groups and can be used even when the variances of the two groups are not equal. In Chapter 13, we describe additional ways to rescue situations in which the assumptions of normality and equal standard deviations are not met. Using the correct sampling units An assumption of the t-test, as with all other tests in this book, is that the samples being analyzed are random samples. Quite often, repeated measurements have been taken on each sampling unit. Because measurements made on the same sampling unit are not independent, they require special handling. One solution we’ve already mentioned is to summarize the data for each sampling unit by a single measurement. One of the biggest challenges when comparing groups is to identify correctly the independent units of replication. This decision determines not only what method is used, but it might also affect the conclusion reached, as Example 12.4 illustrates. EXAMPLE 12.4 So long; thanks to all the fish One of the greatest threats to biodiversity is the introduction of alien species from outside their natural range. These introduced species often have fewer predators or parasites in the new area, so they can increase in numbers and outcompete native species. Sometimes these species are introduced accidentally, but often they are introduced intentionally by humans. The brook trout, for example, is a species native to eastern North America that has been introduced into streams in the West for sport fishing. Biologists followed the survivorship of a native species, chinook salmon, in a series of 12 streams that either had brook trout introduced or did not (Levin et al. 2002). Their goal was to determine whether the presence of brook trout affected the survivorship of the salmon. In each stream, they released a number of tagged juvenile chinook and then recorded whether or not each chinook survived over one year. Table 12.4-1 summarizes the data. TABLE 12.4-1 The numbers and proportion of chinook released and surviving in streams with and without brook trout. The study included 12 streams in total. Brook Number of salmon Number of salmon Proportion trout released surviving surviving Present Present Present Present Present Present Absent 820 960 700 545 769 1001 467 166 136 153 103 173 188 180 0.202 0.142 0.219 0.189 0.225 0.188 0.385 Absent 959 178 0.186 Absent Absent Absent Absent 1029 27 998 936 326 7 120 135 0.317 0.259 0.120 0.144 Total 9211 1865 In all, 9211 salmon were released, of which 1865 survived and 7346 did not. A quick tally of the fish numbers by treatment yields the 2 × 2 table shown in Table 12.4-2. TABLE 12.4-2 Number of salmon surviving and not surviving in each trout treatment. Trout absent Trout present Survived Did not survive 946 3470 919 3876 We would like to test whether the proportion of salmon surviving differed between trout treatments. What method shall we use? It is tempting to carry out a χ2 contingency test of association between treatment and survival.5 The problem with using the contingency test approach is that individual salmon are not a random sample. Rather, salmon are grouped by the streams in which they were released. If there is any inherent difference between the streams in the probability of survival, over and above any effects of brook trout, then two salmon from the same stream are likely to be more similar in their survival than two salmon picked at random. In this case, salmon from the same stream are not independent. To lump all the salmon together and analyze with a contingency test is to commit the sin of pseudo-replication (see Interleaf 4). The key to solving the problem lies in recognizing that the stream is the independently sampled unit, not the salmon, and there are only 12 streams—six per treatment. As a result, the fates of all the salmon within a stream must be summarized by a single measurement for analysis—namely, the proportion of salmon surviving (given in the last column of Table 12.42). This changes the data type, because we are no longer comparing frequencies in categories, but rather differences in the means of a numerical variable. A two-sample test of the difference between the means is therefore required. Let’s label the streams with trout present as group 1 and the streams without trout as group 2. The null and alternative hypotheses are as follows: H0: The mean proportion of chinook surviving is the same in streams with and without brook trout (µ1 = µ2). HA: The mean proportion of chinook surviving is different between streams with and without brook trout (µ1 ≠ µ2). Sample means and standard deviations are listed in Table 12.4-3. The 95% confidence intervals are shown as “error bars” next to the data in Figure 12.4-1. TABLE 12.4-3 Sample statistics for the proportion of salmon surviving in streams (Example 12.4), using the stream as the sampling unit. Group Sample mean Sample standard deviation, si Sample size, ni Brook trout present Brook trout absent 0.194 0.235 0.0297 0.1036 6 6 FIGURE 12.4-1 Proportion of chinook salmon surviving in streams with and without brook trout. Each open circle represents the measurement from a single stream. There are six streams of each type. Means and 95% confidence intervals are indicated with error bars to the right of the data. Figure 12.4-1 and the summary statistics in Table 12.4-3 show that streams without introduced trout have a sample standard deviation more than three times that of streams with trout. This is a case where the Welch’s approximate t-test is appropriate (see Section 12.3). Using Welch’s t-test and a computer, we find that t = 0.93, df = 5, and the P-value for the test is P = 0.39. Hence, P > 0.05 and we cannot reject the null hypothesis. In other words, the data do not support the claim that the brook trout lower the survivorship of chinook salmon. The Welch’s 95% confidence interval for the difference in means between these two groups ranges from −0.07 to 0.15. The appropriate analysis, in which salmon data within streams are reduced to a single measurement per stream, might seem like a waste of hard-earned data. We started with survival data on 9211 salmon but used only six measurements per treatment. The contingency analysis rejected H0, but the Welch’s two-sample test did not! Have we thrown away data and lost power as a result? There are two answers to this question. First, if the raw data are not randomly sampled, then it is not legitimate to analyze them as though they were a random sample. You can’t lose power that you never had. But the kinder, gentler answer is that the data are not wasted. By pooling together several related data points into a single summary measure, such as a proportion, you will have an increasingly reliable measure of the true value of that measure in a given sample unit, such as a stream. As a result, little or no information is “lost.” The fallacy of indirect comparison A common error when comparing two groups is to test each group mean separately against the same null hypothesized value, rather than directly comparing the two means with each other. The error might go something like this: “Since group 1 is significantly different from zero, but group 2 is not, group 1 and group 2 must be different from each other.” We call this error the “fallacy of indirect comparison.” Example 12.5 demonstrates that even papers published in scientific journals with the highest profile can make this mistake. EXAMPLE 12.5 Mommy’s baby, Daddy’s maybe Do babies look more like their fathers or their mothers? The answer matters because, in most cultures, fathers are more likely to contribute to child rearing if they are convinced that a child is their biological offspring. In this case, babies who resemble their dads more closely have an evolutionary advantage, because they provide stronger assurance of paternity, which leads to greater paternal care (mothers do not face the same uncertainty of maternity). Christenfeld and Hill (1995) tested this by obtaining pictures of a series of babies and their mothers and fathers. A photograph of each baby, along with photos of three possible mothers and of three possible fathers, were shown to a large number of volunteers. Each volunteer was asked to pick which woman and which man were the parents of the baby based on facial resemblance. The percentage of volunteers who correctly guessed a parent was used as the measure of a given baby’s resemblance to that parent. If there were no facial resemblance of babies to parents, then the mean resemblance should be 33.3%, the percentage of correct guesses expected by chance. If babies did resemble a parent, then the mean resemblance should be greater than 33.3%. Figure 12.5-1 shows the means for each parent and the corresponding 95% confidence intervals. FIGURE 12.5-1 Resemblance of babies to their biological mothers and fathers, as measured by the percentage of volunteers who correctly guessed the mother and father of each baby from facial photographs. Dots are means, and vertical lines (error bars) are the 95% confidence intervals. The null expectation is 33.3% (shown with the red horizontal line). n = 24 babies. The null hypothesis of no resemblance (i.e., one-third correct guesses) was soundly rejected for fathers, and the null hypothesis of no resemblance was not rejected for mothers. So far, so good. However, the researchers concluded from these tests that babies therefore resembled their fathers more than they resembled their mothers. This is an indirect comparison. That is, both groups were tested against the same null expectation (i.e., 33.3%), but mothers and fathers were not directly compared with each other. If they had been, no significant difference would have been found. The problem with this sort of indirect comparison can be understood by considering a more extreme hypothetical case. In the example shown in Figure 12.5-2, the mean of group 1 is significantly different from the null expectation (indicated by the dashed line), but the mean of group 2 is not. It is false to conclude, therefore, that group 1 has a larger mean than group 2. In this example, it is group 2 that has the larger sample mean! FIGURE 12.5-2 The 95% confidence intervals for the means of two hypothetical groups. The dashed line represents the null hypothesized value for the means of both groups. How, then, should mothers and fathers be compared? They should be compared by testing directly whether the mean resemblance of babies to mothers is different from the mean resemblance to fathers. If the null hypothesis of no difference is rejected, then—and only then —we can conclude that babies resemble one parent more than the other.6 Indirectly comparing two groups by comparing them separately to the same null hypothesized value will often lead you astray. This error is extremely common in the published scientific literature (Nieuwenhuis et al. 2011). Groups should always be compared directly to each other. Comparisons between two groups should always be made directly, not indirectly by comparing both to the same null hypothesized value. Interpreting overlap of confidence intervals Scientific papers often report the means of two or more groups, along with their confidence intervals, but they might not test the difference between the means of the two groups with a two-sample t-test. How much information about whether group means differ significantly is contained in the amount of overlap between their separate confidence intervals? It turns out that reliable conclusions can be drawn only under the two scenarios depicted in panels (a) and (b) of Figure 12.6-1. FIGURE 12.6-1 Each panel shows sample means of two groups with 95% confidence intervals. (a) The confidence intervals do not overlap; in this case, the null hypothesis of no difference between group means would be rejected. (b) The confidence interval for one group contains the sample mean of the other group; in this case, the null hypothesis of no difference would not be rejected. (c) The confidence intervals overlap, but neither includes the sample mean of the other group; in this case, we cannot be sure what the results of a two-sample t-test comparing the means would show. If the 95% confidence intervals7 of two estimates do not overlap at all (as in Figure 12.6-1, panel [a]), then the null hypothesis of equal means would be rejected at the α = 0.05 level. If, on the other hand, the value of one of the sample means is included within the 95% confidence interval of the other mean (as in Figure 12.6-1, panel [b]), then the null hypothesis of equal means would not be rejected at α = 0.05. Between these two extremes, where the confidence intervals overlap but neither interval includes the sample mean of the other group (as in Figure 12.6-1, panel [c]), we cannot be sure what the outcome of a two-sample t-test would be from a simple inspection of the overlap in confidence intervals. Comparing variances Up to now we have focused on comparing group means, but often we want to test whether populations differ in the variability of measurements. Several techniques are available. Here we briefly describe two of these: the F-test and Levene’s test. In both cases, we describe the tests without much detail. The Quick Formula Summary (Section 12.9) gives the formulas, and most computer statistical programs will perform these tests. Be warned, though: the F-test is highly sensitive to departures from the assumption that the measurements are normally distributed in the populations. It is not recommended for most data, therefore, because real data often show some departure from normality. Nevertheless, we present it here because many researchers continue to use it, and you will encounter it in the literature and in statistics packages on the computer. Levene’s test is a popular alternative test that is more robust to the assumption of normal populations, but it has lower power. The F-test of equal variances The F-test evaluates whether two population variances are equal. That is, it tests the null hypothesis that H0:σ12=σ22 against the alternative H0:σ12≠σ22 where σ12 is the variance (the squared standard deviation) of population 1 and σ22 is the variance of population 2. The test statistic is called F, and it is calculated from the ratio of the two sample variances: F=s12/s22. If the null hypothesis were true, then F should be near one, deviating from it only by chance. Under the null hypothesis, the F-statistic has an F-distribution with the pair of degrees of freedom (n1 − 1, n2 − 1). The first number of the pair refers to the degrees of freedom of the top part (the numerator) of the F-ratio, and the second pair is for the bottom part (the denominator) of the F-ratio. We present more details in the Quick Formula Summary (Section 12.9). The F-distribution is discussed more fully in Chapter 15. The F-test to compare two variances assumes that the variable is normally distributed in both populations. Unfortunately, the test is highly sensitive (i.e., not robust) to this assumption. For example, the F-test will often falsely reject the null hypothesis of equal variance if the distribution in one of the populations is not normal. For this reason, the F-test is not recommended for general use. Levene’s test for homogeneity of variances Several alternative methods also test the null hypothesis that the variances of two or more groups are equal. One of the best is Levene’s test, which is available in many statistical packages on the computer. Levene’s test assumes that the frequency distribution of measurements is roughly symmetrical within all groups, but it performs much better than the Ftest when this assumption is violated. Levene’s test has the further advantage that it can be applied to more than two groups; in fact, it can test the null hypothesis that multiple groups all have equal variances. Levene’s test works by first calculating the absolute value of the difference between each data point and the sample mean for its group. These quantities are called “absolute deviations.” The method then tests for a difference between groups in the means of these absolute deviations. The test statistic is called W, and it too has an F-distribution under the null hypothesis of equal variances. The calculations are somewhat cumbersome, so we will not detail them here, but we give the formula in the Quick Formula Summary (Section 12.9). Most modern statistical programs on the computer will do a Levene’s test, however, and we recommend that you use it when comparing the variances of two or more groups. The online Engineering Statistics Handbook8 is a good place to look for more information on Levene’s test. Summary ■ Two study designs are available to compare two treatments. In a paired design, both treatments are applied to every randomly sampled unit. In a two-sample design, treatments are applied to separate randomly sampled units. ■ Comparing two treatments in a paired design involves analyzing the mean of the differences between the two measurements of each pair. Comparing two treatments in a two-sample design involves analyzing the difference in means of two independent samples of measurements. ■ A test of the mean difference between two paired treatments uses the paired t-test. ■ Both the confidence interval for the mean difference and the paired t-test assume that the pairs are randomly chosen from the population and that the differences (di) have a normal distribution. These methods are robust to minor deviations from the assumption of normality. ■ The means of a numerical variable from two separate groups or populations can be compared with a two-sample t-test. ■ The two-sample t-test and the confidence intervals for the difference between the means assume that the variable is normally distributed in both populations and that the variance is the same in both populations. The methods are robust to minor deviations from these assumptions. ■ The pooled sample variance is the best estimate of the variance within groups, assuming that the groups have equal variance. ■ Welch’s approximate t-test compares the means of two groups when the variances of the two groups are not equal. ■ Repeated measurements made on the same sampling unit are not independent and should be summarized for each sampling unit before further analysis. ■ Indirectly comparing two groups by comparing each of them separately to the same null hypothesized value will often lead you astray. Groups should always be compared directly to each other. ■ For variables that are normally distributed, variances of two groups can be compared with an F-test. The F-test, however, is highly sensitive to the departures from the assumption of normal populations. ■ Levene’s test compares the variances of two or more groups. It is more robust than the Ftest to departures from the assumption of normality. Quick Formula Summary Confidence interval for the mean difference (paired data) What does it assume? Pairs are a random sample. The difference between paired measurements is normally distributed. Parameter: µd Statistic: d¯ Degrees of freedom: n − 1 Formula: d¯−tα(2), dfSEd¯<μd<d¯+tα(2), df SEd¯, where d¯ is the mean of the differences between members of each of the pairs, SEd¯=sd/n,sd is the sample standard deviation of the differences, and n is the number of pairs. Paired t-test What is it for? To test whether the mean difference in a population equals a null hypothesized value, µd0. What does it assume? Pairs are randomly sampled from a population. The differences are normally distributed. Test statistic: t Distribution under H0: The t-distribution with n − 1 degrees of freedom, where n is the number of pairs. Formula: t=d¯-μd0SEd¯, where the terms are defined as for the confidence interval. Standard error of difference between two means Formula: SEY¯1-Y¯2=sp2(1n1+1n2), where sp2 is the pooled sample variance: sp2=df1s12+df2s22df1+df2. degrees of freedom are df1 = n1 − 1 and df2 = n2 − 1. The Confidence interval for the difference between two means (two samples) What does it assume? Both samples are random samples. The numerical variable is normally distributed within both populations. The standard deviation of the distribution is the same in the two populations. Degrees of freedom: n1 + n2 − 2 Statistic: Y¯1−Y¯2 Parameter: µ1 − µ2 Formula: (Y¯1−Y¯2)−tα(2), df SEY¯1−Y¯2<μ1−μ2<(Y¯1−Y¯2)+tα(2), df SEY¯1−Y¯2, where SEY¯1−Y¯2 above. is the standard error of the difference between means, as defined Two-sample t-test What is it for? Tests whether the difference between the means of two groups equals a null hypothesized value for the difference. What does it assume? Both samples are random samples. The numerical variable is normally distributed within both populations. The standard deviation of the distribution is the same in the two populations. Test statistic: t Distribution under H0: The t-distribution with n1 + n2 − 2 degrees of freedom. Formula: t=(Y¯1-Y¯2)-(μ1-μ2)0SEY¯1-Y¯2, where (µ1 − µ2)0 is the null hypothesized value for the difference between population means, and SEY¯1−Y¯2 is the standard error of the difference between means, as defined previously. Welch’s confidence interval for the difference between two means What does it assume? Both samples are random samples. The numerical variable is normally distributed within both populations. Statistic: Y¯1−Y¯2 Parameter: µ1 − µ2 Formula: (Y¯1-Y¯2)±tα(2),dfs12n1+s22n2, where df=(s12n1+s22n2)2[(s12/n1)2n1-1+ (s22/n2)2n2-1] where df is rounded down to the nearest integer. Welch’s approximate t-test What is it for? Tests whether the difference between the means of two groups equals a null hypothesized value when the standard deviations are unequal. What does it assume? Both samples are random samples. The numerical variable is normally distributed within both populations. Test statistic: t Distribution under H0: t-distribution. The number of degrees of freedom are fewer than in the case of the two-sample t-test. See the previous entry on Welch’s confidence interval for difference between two means for the formula for df. Formula: t=(Y¯1-Y¯2)-(μ1-μ2)0s12n1+s22n2. F-test What is it for? Tests whether the variances of two populations are equal. What does it assume? Both samples are random samples. The numerical variable is normally distributed within both populations. Test statistic: F Distribution under H0: F-distribution with n1 − 1, n2 − 1 degrees of freedom. H0 is rejected if F ≥ Fα(2), n1−1, n2−1. The quantity Fα(1), k−1, N−k is the critical value of the F-distribution corresponding to the pair of degrees of freedom (n1 − 1 and n2 − 1). Critical values of the Fdistribution are provided in Statistical Table D. F is compared only to the upper critical value, because F, as computed below, always puts the larger sample variance in the top (the numerator) of the F-ratio. Formula: F=s12s22, where s12 is the larger sample variance and s22 is the smaller sample variance. Levene’s test What is it for? Testing the difference between the variances of two or more populations. What does it assume? Both samples are random samples, and the distribution of the variable is roughly symmetrical in both populations. Test statistic: W Distribution under H0: F-distribution with the pair of degrees of freedom k − 1 (for the numerator of W; see the formula below) and N − k (for the denominator of W), where k is the number of groups (two in the case of a two-sample test) and N is the total sample size (n1 + n2 in the case of two groups). H0 is rejected if W ≥ Fα(1), k−1, N−k. The quantity Fα(1), k−1, N−k is the critical value of the F-distribution (see the preceding description of the F-test for an explanation of the critical value). Formula: W=(N−k)∑i=1kni(Z¯i−Z¯)2(k−1)∑i=1k∑j=1ni(Zij−Z¯i)2 where Zij=|Yij−Y¯i| is the absolute value of the deviation between individual observation Yij (symbolized as the jth data point in the ith group) and the sample mean for its group Y¯i (symbolized as the sample mean for group i). Z¯ is the mean of all the Zij for the ith group, and Z¯ is the grand mean of all the Zij’s, calculated as the average of all the Zij regardless of group. The ni is the number of data points in the ith group, and k is the number of groups (two in the case of a two-sample test). PRACTICE PROBLEMS 1. Calculation practice: Paired t-test. Can the death rate be influenced by tax incentives?9 Kopczuk and Slemrod (2003) investigated this possibility using data on deaths in the United States in years in which the government announced it was changing (usually raising) the tax rate on inheritance (the estate tax). The authors calculated the death rate during the 14 days before, and the 14 days after, the changes in the estate tax rates took effect. The number of deaths per day for each of these periods is given in the table at right. The data are illustrated in the strip chart on the next page (paired observations are connected by line segments). Death rate under higher estate Year tax rate 1917 1917 1919 1924 1926 1932 1934 1935 1940 1941 1942 22.21 18.86 28.21 31.64 18.43 9.50 24.29 26.64 35.07 38.86 28.50 Death rate under lower estate tax rate 24.93 20.00 29.93 30.64 20.86 10.14 28.00 25.29 35.00 37.57 34.79 Let’s use a paired t-test to ask whether the death rate changed significantly after the estate tax rate change. a. State the null and alternate hypotheses for this analysis. b. Why might a paired t-test be an appropriate method to apply to this comparison? c. For each change in the estate tax, calculate the difference in death rate between the higher and lower tax regimes. d. What is the mean of this difference? e. What is the standard deviation of the difference? f. What is the sample size? g. What is the standard error of the mean difference? h. Calculate t for this test. i. How many degrees of freedom will this paired t-statistic have? j. What is the critical value for the test, corresponding to α = 0.05? k. What is the P-value associated with the test statistic, and what is the conclusion from the test? l. What scientific conclusion do you draw from these findings? 2. Calculation practice: two-sample t-test. When your authors were growing up, it was thought that humans dreamed in black and white rather than color. A recent hypothesis is that North Americans did in fact dream in black and white in the era of black-and-white television and movies, but that we shifted back to dream in color after the introduction of color media. To test this hypothesis, Murzyn (2008) queried 30 older individuals who had grown up in the black-and-white era and 30 younger individuals who grew up with color media about their dreams. She recorded the percentage of color dreams for each individual. A mean of 68.4% of the younger peoples’ dreams were in color (with a standard deviation of 31.8%). On average, 33.9% of the older individuals’ dreams were in color, with a standard deviation of 36.9%. The scores were approximately normally distributed in each group. Is the difference between the two means statistically significant? We’ll use a two-sample t-test for this comparison. a. State the null and alternate hypotheses for this test. b. What are the assumptions of a two-sample t-test? Do these data match these assumptions well enough? c. What are the sample variances for each group? d. How many degrees of freedom are associated with each group? e. Calculate the pooled variance for these data. f. What is the standard error of the difference between means? g. Calculate t. h. How many degrees of freedom does this t-statistic have? i. What is the critical value of t with α = 0.05 for this test? j. What is the most precise P-value that you can determine for this test? k. What can you conclude about the difference between the older and younger people in how often they dream in color? Interpret the conclusions of the test in terms of the original scientific question. 3. Calculation practice: Confidence interval for the difference between two means. Return to the data about the percentage of dreams in color as a function of age from Practice Problem 2. Determine a 95% confidence interval for the difference in mean percent of color dreams between the older and younger groups. a. If you have not done so already from the previous problem, calculate the standard error of the difference between means. (This should be the same as in part (f) of Practice Problem 2.) b. How many degrees of freedom does the t-statistic for this analysis have? c. For a 95% confidence interval, what value of α should we use? d. Using Statistical Table C, find the critical value of t for this α with the correct number of degrees of freedom. e. What is the observed difference between means (Y¯1−Y¯2)? f. Calculate the 95% confidence interval for the difference between population means. 4. For each of the following scenarios, the researchers are interested in comparing the mean of a numerical variable measured in two circumstances. For each, say whether a paired t-test or two-sample t-test would be appropriate. a. The weight of 14 patients before and after open-heart surgery. b. The smoking rates of 14 men measured before and after a stroke. c. The number of cigarettes smoked per day by 14 men who have had strokes compared with the number smoked by 14 men who have not had strokes. d. The lead concentration upstream from five power plants compared with the levels downstream from the same plants. e. The basal metabolic rate (BMR) of seven chimpanzees compared with the BMR of seven gorillas. f. The photosynthetic rate of leaves in the crown of 10 Sitka spruce trees compared with the photosynthetic rate of leaves near the bottom of the same trees. g. The photosynthetic rates of 10 randomly chosen Douglas-fir trees compared with 10 randomly chosen western red cedar trees. h. The photosynthetic rate measured on 10 randomly chosen Sitka spruce trees compared with the rate measured on the western red cedar growing next to each of the Sitka spruce trees. 5. Dung beetles are one of the most common types of prey items for burrowing owls. The owls collect bits of large mammal dung and leave them around the entrance to their burrows, where they spend long hours waiting motionless for something tasty to be lured in. A research team wanted to know whether this dung actually attracted dung beetles or whether it had another use, such as to mask the odor of owl eggs from predators (Levey et al. 2004). They added dung to 10 owls’ burrows, randomly chosen, and did not add dung to 10 other owl burrows. The researchers then counted the number of dung beetles consumed by the two types of owls over the next few days. The mean number of beetles consumed in the dungaddition group was 4.8, while the mean number was 0.51 in the control group. The standard deviations for the two groups were 3.26 and 0.89, respectively. What is an appropriate way to test for a difference in these two groups’ beetle-capture rates? 6. Practice Problem 8 in Chapter 11 described an experiment that compared the testes sizes of four experimental populations of monogamous flies to four populations of polyandrous flies. The data are as follows: Mating system Testes area (mm2) Monogamous Monogamous Monogamous Monogamous Polyandrous Polyandrous Polyandrous Polyandrous 0.83 0.85 0.82 0.89 0.96 0.94 0.99 0.91 a. What is the difference in mean testes size for males from monogamous populations compared to males from polyandrous populations? What is the 95% confidence interval for this difference? Assume normality and equal variances. b. Carry out a hypothesis test to compare the means of these two groups. What conclusions can you draw? 7. In garter snakes, some males emerging from overwintering dens mimic females by producing female pheromones. Males might mimic females to warm up: males tend to emerge sooner than females and warm up in the sun while they wait for the females; the females are surrounded by males soon after emergence, and soak up their warmth. A prediction based on this idea is that males that mimic females should be covered by more males than males that don’t mimic females. Observations on newly emerging garter snakes in Manitoba found that, on average, 58% of a male’s body was covered by other males if he emitted female pheromones (with standard deviation 28%, measured on 49 males). In comparison, 32 males that did not emit female pheromones had, on average, 25% of their bodies covered by other males, with standard deviation 24% (Shine et al. 2001). a. On average, how much more covered by other males are female mimics compared with nonmimics? Give a 95% confidence interval for this parameter. b. Test the hypothesis that female mimicry has no effect on the proportion of body coverage in these garter snakes. What assumptions are you making? 8. Spot the flaw. Bluegill sunfish, a species of freshwater fish, prefer to feed in the open water in summer, but, in the presence of predators, they tend to hide in the weeds near shore. A study compared the growth rate of bluegills that fed in the open water with the growth rate of bluegills that fed only in nearshore vegetation. “Open-water” and “nearshore” fish were both measured in eight lakes, and the mean growth rate of open-water fish was compared to the mean growth rate of nearshore fish using a two-sample t-test. What was done wrong in this study? 9. The astonishing diversity of cichlid fishes of Lake Victoria is maintained by the preferences of females for males of their own species. To understand how the species arose in the first place, it is important to know the genetic basis of this preference in females. Researchers crossed two species of cichlids, Pundamilia pundamilia and P. nyererei, and raised the “F, hybrids” to adulthood. They measured degree of preference by the female F, fish for P. pundamilia males over P. nyererei males (Haeslery and Seehausen 2005). They then crossed the F1 hybrids with each other to produce a second generation of hybrids (the “F2”), which they also raised to adulthood and measured the same index of female preference. If a small number of genes are important in determining the preference, then the variance of the preference index will differ between these two generations (it will be highest in the F2 hybrids). The researchers measured preference on 20 F1 individuals and 33 F2 individuals. The results are given below. Assume preference has a normal distribution in both populations. F2 hybrids: 0.380, 0.271, 0.211, 0.188, 0.157, 0.140, 0.131, 0.126, 0.126, 0.065, 0.048, 0.048, 0.024, 0.017, 0.017, 0.000, 0.000, −0.009, −0.009, −0.014, −0.032, −0.032, −0.044, −0.063, −0.068, −0.082, −0.082, −0.082, −0.143, −0.198, −0.300, −0.314, −0.348. F1 hybrids: 0.114, 0.101, 0.080, 0.082, 0.080, 0.067, 0.054, 0.015, 0.015, 0.007, 0.007, −0.019, −0.019, −0.019, −0.024, −0.034, −0.049, −0.058, −0.099, −0.105. a. Choose a type of graph and compare the frequency distributions of female preference index in the F1 and F2 hybrids. What difference is suggested? b. Calculate the variances of female preference index in the two hybrid crosses. Do the numbers agree with your visual estimate in (a)? c. Test whether the variance of female preference index differs between the two crosses. 10. In most election years since 1960, a televised debate between the leading candidates for president has been influential in determining the outcome of the U.S. election. One analysis of the transcripts of these debates looked at the number of times that each candidate used the words “will,” “shall,” or “going to” as an indication of how many promises the candidate made. Also recorded was whether the candidate won the popular vote. (This is not always the same candidate who won the election. In 200Q, for example, Al Gore won the popular vote, but George Bush attained the presidency.) The results are shown in the following table. Debates were not held in 1964, 1968, and 1972, and full transcripts were not available from 1984. Was the winner or loser significantly more likely to make promises, as measured by this index? Use an appropriate test. Number of “will”s, Won (W) or lost (L) “shall”s, and “going popular vote to”s Year Candidate 1960 Kennedy W 163 1960 1976 1976 1980 1980 1988 1988 1992 1992 1996 1996 2000 2000 2004 2004 Nixon Carter Ford Reagan Carter G. Bush Dukakis Clinton G. Bush Clinton Dole Gore G. W. Bush G. W. Bush Kerry L W L W L W L W L W L W L W L 122 68 32 19 18 111 85 79 75 56 33 68 48 176 149 11. Most bats are very poor at walking, but the vampire bat10 is an exception. It is not clear why bats are poor walkers from a mechanical perspective, although a leading hypothesis has been that their hind legs are too weak. A test of this hypothesis by Riskin et al. (2005) measured and compared the strength of the hind legs between an insectivorous bat that walks poorly (Pteronotus parnellii) and the vampire bat (Desmodus rotundus). Six individual Pteronotus were measured, with an average hind-leg strength of 93.5 (in units of percent of body weight), with standard deviation 36.6. The mean hind-leg strength for six vampire bats was 69.3, with standard deviation 8.1. a. Assuming that these measures of strength were normally distributed within groups, what is the appropriate test for comparing the mean hind-leg strength of these two species? b. Look at the results closely. Without performing a hypothesis test, comment on the hypothesis that insufficient hind-leg strength is the reason that the insectivorous bat Pteronotus cannot walk well. 12. Ostriches live in hot environments, and they are normally exposed to the sun for long periods. Mammals in similar environments have special mechanisms for reducing the temperature of their brain relative to their body temperature. Fuller et al. (2003) tested whether ostriches could do the same. The mean body and brain temperature of six ostriches was recorded at typical hot conditions. The results, in degrees Celsius, are as follows: Ostrich Body temperature Brain temperature 1 2 3 4 5 38.51 38.45 38.27 38.52 38.62 39.32 39.21 39.20 38.68 39.09 6 38.18 38.94 a. Test for a mean difference in temperature between body and brain for these ostriches. b. Compare the results to the prediction made from mammals in similar environments. 13. The following graphs are all based on random samples with more than 100 individuals. The red dots represent sample means. a. Assume that the error bars extend two standard errors above and two standard errors below the sample means. For which graphs can we conclude that group 1 is significantly different from group 2? b. Assume that the error bars mark 95% confidence intervals. For which graphs can we conclude that group 1 is significantly different from group 2? c. Assume that the error bars extend one standard error above and one standard error below the sample means. For which graphs can we conclude that group 1 is significantly different from group 2? d. Assume that the error bars extend two standard deviations above and two standard deviations below the sample means. For which graphs can we conclude that group 1 is significantly different from group 2? 14. Vertebrates are thought to be unidirectional in growth, with size either increasing or holding steady throughout life. Marine iguanas from the Galápagos (see the photo on the first page of this chapter) are unusual because they might actually shrink during low food periods caused by El Niño events (Wikelski and Thom 2000). During these low food periods, up to 90% of the iguana population can die from starvation. The following histogram plots the changes in body length of 64 surviving iguanas during the 1992-1993 El Niño event: The average change in length was − 5.81 mm, with a standard deviation of 19.50 mm. a. By how much did the iguanas shrink on average? Determine the most-plausible range of values for the change in mean length of marine iguanas during the El Niño event. What assumptions are you making? b. Using your answer in part (a), what are some of the plausible values for the mean change in weight over the El Niño event? Are the data consistent with a shrinking mean? Are they consistent with no change or even an increase in the mean? c. How variable was the change in length among individual iguanas? Calculate the 95% confidence interval for the standard deviation of the change in length. d. Test the hypothesis that length did not change on average during the El Niño event. 15. Red-winged blackbird males defend territories and attract females to mate and raise young there. A male protects nests and females on his territory both from other males and from predators. Males also frequently mate with females on adjacent territories. When males are successful in mating with females outside their territory, do they also attempt to protect these females and their nests from predators? An experiment measured the aggressiveness of males toward stuffed magpies placed on the territories adjacent to the males (Gray 1997). This aggressiveness was measured on a scale where larger scores were more aggressive and lower scores were less aggressive. This aggressiveness score was normally distributed. Later the researchers used DNA techniques on chicks in nests to identify whether the male had mated with the female on the neighboring territory. They compared the aggressiveness scores of the males who had mated with the adjacent female to those who had not. The results are as follows: Mated with neighbor Did not mate with neighbor Mean aggressiveness score Standard deviation of aggressiveness score Sample size (n) 0.806 1.135 10 −0.168 0.543 36 Test whether there are differences in the mean aggressiveness scores between the two groups of males. Are males aggressive to a different degree depending on whether they had mated with a neighboring female? 16. Mosquitoes find their victims in part by odor, so it makes sense to wonder whether what we eat and drink influences our attractiveness to mosquitoes. A study in West Africa (Lefèvre et al. 2Q1Q), working with the mosquito species that carry malaria, wondered whether drinking the local beer influenced attractiveness to mosquitoes. They opened a container holding 5Q mosquitoes next to each of 25 alcohol-free participants and measured the proportion of mosquitoes that left the container and flew toward the participants (they called this proportion the “activation”). They repeated this procedure 15 minutes after each of the same participants had consumed a liter of beer and measured the “change in activation” (after minus before). This procedure was also carried out on another 18 human participants who were given water instead of beer. The change in activation of mosquitoes is given for both the beer-and water-drinking groups: Beer group: 0.36, 0.46, 0.Q6, 0.18, 0.25, 0.18, −0.06, −0.14, 0.12, 0.39, 0.17, −0.16, −0.05, 0.19, 0.25, 0.31, 0.17, −0.03, 0.23, −0.03, 0.26, 0.30, 0.11, 0.13, 0.21. Water group: 0.04, 0, −0.08, −0.12, 0.201, −0.039, 0.10, 0.041, 0.02, 0.236, 0.05, 0.097, 0.122, −0.019, 0.021, −0.08, −0.165, −0.28. a. Name three types of graphs that could be used to examine and compare the frequency distributions of the two samples. Choose one of these methods and construct a graph. What trend is suggested? b. Test for a difference between the mean changes in mosquito activation between beerdrinking and water-drinking groups. 17. Development of an effective vaccine against HIV has proven difficult. Even though the immune systems of infected individuals produce antibodies that neutralize circulating virus, the disease destroys the CD4 T cells producing those antibodies. Balazs et al. (2011) investigated a novel HIV treatment that bypasses the immune system. They used a special strain of “humanized” mice that carry human CD4 T cells, which are susceptible to HIV. Treatment mice received human antibody-producing genes, which were injected into leg muscle using a harmless virus. Control mice were injected with a reporter gene instead (luciferase). All mice were then injected with high doses of HIV. The data below record the percentage of healthy CD4 T cells remaining in the mice five weeks later. A high value indicates that many CD4 T cells remain, and hence that HIV has been neutralized, whereas a low value indicates that the mouse has succumbed to the disease. Antibody treatment mice: 94, 96, 92, 88, 84, 81, 76, 54. Control treatment mice: 20, 15, 11, 7, 3, 0. a. Plot these data using a strip chart. b. What are the most-plausible values for the means of each of the two treatments? Calculate 95% confidence intervals for the mean percent of both treatments. c. Add the confidence intervals from part (b) as error bars to your plot in part (a). d. Examine the two confidence intervals in your graph. If the null hypothesis of no difference between means were tested with these data, would the null hypothesis be rejected? How do you know? ASSIGNMENT PROBLEMS 18. The males of stalk-eyed flies (Cyrtodiopsis dalmanni) have long eye stalks. The females sometimes use the length of these eye stalks to choose mates. (See Example 11.2.) Is the male’s eye-stalk length affected by the quality of its diet? An experiment was carried out in which two groups of male “stalkies” were reared on different foods (David et al. 2000). One group was fed corn (considered a high-quality food), while the other was fed cotton wool (a food of substantially lower quality). Each male was raised singly and so represents an independent sampling unit. The eye spans (the distance between the eyes) were recorded in millimeters. The raw data, which are plotted as histograms at right, are as follows: Corn diet: 2.15, 2.14, 2.13, 2.13, 2.12, 2.11, 2.10, 2.08, 2.08, 2.08, 2.04, 2.05, 2.03, 2.02, 2.01, 2.00, 1.99, 1.96, 1.95, 1.93, 1.89. Cotton diet: 2.12, 2.07, 2.01, 1.93, 1.77, 1.68, 1.64, 1.61, 1.59, 1.58, 1.56, 1.55, 1.54, 1.49, 1.45, 1.43, 1.39, 1.34, 1.33, 1.29, 1.26, 1.24, 1.11, 1.05. These data can be summarized as follows, where the corn-fed flies represent treatment group 1 and the cotton-fed flies represent treatment group 2. Mean (mm) Variance (mm2) Sample size (n) Corn diet (group 1) Cotton diet (group 2) 2.05 1.54 0.00558 0.0812 21 24 a. What is the best test to use for comparing the means of the two groups? Why? b. Carry out the test identified in part (a), using α = 0.01. 19. Fruit flies, like almost all other living organisms, have built-in circadian rhythms that keep time even in the absence of external stimuli. Several genes have been shown to be involved in internal timekeeping, including per (period) and tim (timeless). Mutations in these two genes, and in other genes, disrupt timekeeping abilities. Interestingly, these genes have also been shown to be involved in other time-related behavior, such as the frequency of wingbeats in male courtship behaviors. Individuals that carry particular mutations of per and tim have been shown to copulate for longer than individuals that have neither mutation. But do these two mutations affect copulation time in similar ways? The following table summarizes some data on the duration of copulation for flies that carry either the tim mutation or the per mutation (Beaver and Giebultowicz 2004): Mutation Mean copulation duration (min) Standard deviation of copulation duration Sample size (n) 17.5 19.9 3.37 2.47 14 17 per tim a. Do these two mutations lead to different mean copulation durations? Carry out the appropriate test. b. Do the populations carrying these mutations have different variances in copulation duration? 20. Researchers studying the number of electric fish species living in various parts of the Amazon basin were interested in whether the presence of tributaries affected the local number of electric fish species in the main rivers (Fernandes et al. 2004). They counted the number of electric fish species above and below the entrance point of a major tributary at 12 different river locations. Here’s what they found: Tributary Içá Jutaí Japurá Coari Purus Manacapuru Negro Madeira Trombetas Tapajós Xingu Tocantins Upstream number of species Downstream number of species 14 11 8 5 10 5 23 29 19 16 25 10 19 18 8 7 16 6 24 30 16 20 21 12 a. What is the mean difference in the number of species between areas upstream and downstream of a tributary? What is the 95% confidence interval of this mean difference? b. Test the hypothesis that the tributaries have no effect on the number of species of electric fish. c. State the assumptions that you had to make to complete parts (a) and (b). 21. Assignment Problem 20 from Chapter 11 discussed the relatedness of subordinate males to breeding females in the Seychelles warbler. Five subordinates that did not help feed the offspring of the older birds were measured for their relatedness to the offspring of the breeding females, with a mean relatedness of −0.05 and a standard deviation of 0.45. Another eight subordinates that did help feed younger offspring were also measured for their relatedness to the younger birds. For these eight, the mean relatedness was 0.27, with a standard deviation of 0.45. a. Are helpful and unhelpful subordinates different in their mean relatedness to the younger birds? Carry out an appropriate hypothesis test. b. Find the 95% confidence interval for the difference in mean relatedness for the two classes of subordinates. 22. In tilapia, an important freshwater food fish from Africa, the males actively court females. They have more incentive to court a female who still has eggs than a female who has already laid all of her eggs, but can they tell the difference? An experiment was done to measure the male tilapia’s response to the smell of female fish (Miranda et al. 2005). Water containing feces from females that were either pre-ovulatory (they still had eggs) or post-ovulatory (they had already laid their eggs) was washed over the gills of males hooked up to an electroolfactogram machine, which measured when the senses of the males were excited. The amplitude of the electro-olfactogram reading was used as a measure of the excitability of the males in the two different circumstances. Six males were exposed to the scent of preovulatory females; their readings averaged 1.51 with a standard deviation of 0.25. Six different males exposed to post-ovulatory females averaged readings of 0.87 with a standard deviation of 0.31. Assume that the electro-olfactogram readings were approximately normally distributed within groups. a. Test for a difference in the excitability of the males with exposure to these two types of females. b. What is the estimated average difference in electro-olfactogram readings between the two groups? What is the 95% confidence limit for the difference between population means? 23. A baby dolphin is born into the ocean, which is a fairly cold environment. Water has a high heat conductivity, so the thermal regulation of a newborn dolphin is quite important. It has been known for a long time that baby dolphins’ blubber is different in composition and quantity from the blubber of adults. Does this make the babies better protected from the cold compared to adults? One measure of the effectiveness of blubber is its “conductance.” This value was calculated on six newborn dolphins and eight adult dolphins (Dunkin et al. 2005). The newborn dolphins had an average conductance of 10.44, with a standard error of the mean equal to 0.69. The adult dolphins’ conductance averaged 8.44, with the standard error of this estimate equal to 1.03. All measures are given in watts per square meter per degree Celsius. a. Calculate the standard deviation of conductance for each group. b. Test the null hypothesis that adults and newborns do not differ in the conductance of their blubber. 24. Weddell seals live in the Antarctic and feed on fish during long, deep dives in freezing water. The seals benefit from these feeding dives, but the food they gain comes at a metabolic cost. The dives are strenuous. A set of researchers wanted to know whether feeding per se was also energetically expensive, over and above the exertion of a regular dive (Williams et al. 2004). They determined the metabolic cost of dives by measuring the oxygen use of seals as they surfaced for air after a dive. They measured the metabolic cost of 10 feeding dives and for each of these also measured a nonfeeding dive by the same animal that lasted the same amount of time. The data, in (ml O2 kg−1), are as follows: Individual Oxygen consumption after nonfeeding dive Oxygen consumption after feeding dive 1 2 3 4 5 6 7 8 9 10 42.2 51.7 59.8 66.5 81.9 82.0 81.3 81.3 96.0 104.1 71.0 77.3 82.6 96.1 106.6 112.8 121.2 126.4 127.5 143.1 a. Estimate the mean change in oxygen consumption during feeding dives compared with nonfeeding dives. b. What is the 99% confidence interval for the population mean change? c. Test the hypothesis that feeding does not change the metabolic costs of a dive. 25. Have you ever noticed that when you tear a fingernail, it tends to tear to the side and not down into the finger? (Actually, the latter doesn’t bear too much thinking about.) Why might this be so? One possibility is that fingernails are tougher in one direction than another. Farren et al. (2004) compared the toughness of human fingernails along a transverse dimension (side to side) with toughness along a longitudinal direction, with 15 measurements of each. The toughness of fingernails along a transverse direction averaged 3.3 kJ/m2, with a standard deviation of 0.95, while the mean toughness along the longitudinal direction was 6.2 kJ/m2, with a standard deviation of 1.48 kJ/m2. a. Test for a significant difference in the toughness of these fingernails along the two dimensions. Assume that the data are from two independent samples of 15 people. b. As it turns out, all of the fingernails in this study came from the same volunteer. How does this alter your conclusion from part (a)? Briefly, what steps would you take to design this study properly? 26. Hyenas, famously, laugh. (The technical term used by hyena biologists is “giggle.”) Mathevon et al. 2010) investigated the information content of hyena giggles. In one analysis, they compared the giggles of pairs of hyenas, in which one member of each pair was the more dominant and the other socially subordinate. They measured the spectral variability of the hyena giggles using the coefficient of variation (CV) of sound spectrum features. Here are the data with these measures for each member of the pairs: Spectral CV of dominant individual Spectral CV of subordinate individual 0.384 0.386 0.252 0.507 0.569 0.235 0.226 0.323 0.287 0.303 0.317 0.277 0.415 0.436 0.451 0.399 0.220 0.338 Do dominant and subordinate individuals differ in the means of giggle spectral CV? 27. Refer to Practice Problem 16. The accompanying data are the values of mosquito activation (fraction of 50 mosquitos that left the container) recorded on the 25 participants before and after drinking one liter of the local beer. a. Construct a graph illustrating the changes in mosquito activation. What trend is suggested? b. Comment on the graph in relation to the assumptions needed to carry out a test. c. Test whether mean activation of mosquitoes changed after consumption of beer. d. How big is the effect of beer on activation? Construct a 95% confidence interval. Subject Activation before beer Activation after beer 1 0.13 0.49 2 3 4 5 6 0.13 0.21 0.25 0.25 0.32 0.59 0.27 0.43 0.50 0.50 7 0.43 0.37 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 0.44 0.46 0.50 0.50 0.50 0.53 0.54 0.55 0.55 0.60 0.60 0.64 0.65 0.67 0.70 0.70 0.79 0.79 0.30 0.58 0.89 0.67 0.34 0.48 0.73 0.80 0.86 0.77 0.57 0.87 0.62 0.93 1.00 0.81 0.92 1.00 28. Does holding a weapon increase your aggressiveness afterward in other situations? Klinesmith et al. (2006) investigated this question by assigning 30 male college students to one of two groups. One group of 15 men were given a facsimile handgun to hold for 15 minutes, whereas the 15 men in the other group were given a toy instead. All men were then asked to participate in a test of taste sensitivity. Each was given a glass of water with a single drop of hot sauce to taste. Each man was also asked to add as much hot sauce as he wanted to a new glass of water to be given to the next person. (These were not actually used on the next person.) The researchers measured how much hot sauce each man added, as a stand-in for aggression, because it correlates with the amount of pain inflicted on the next person. Do the two groups differ in the mean amount of hot sauce they add to the water? Here is a summary of the results: Group Gun handlers Sample size Mean hot sauce added SD hot size sauce added 15 13.61 8.35 Toy handlers 15 4.23 2.62 a. Assuming that the amount of hot sauce added per person is normally distributed in each group, would an ordinary two-sample t-test be an appropriate test for this analysis? b. If not, what would be an appropriate method to use? 29. Refer to Practice Problem 17. How big is the estimated difference between the means of the antibody and control treatments? Use a confidence interval to calculate the most-plausible range of values for the difference in mean percent. 30. Spot the flaw. There are two types of males in bluegill sunfish. Parental males guard small territories, where they mate with females and take care of the eggs and young. Cuckolder males do not have territories or take care of young. Instead, they sneak in and release sperm when a parental male is spawning with a female, thereby fertilizing a portion of the eggs. A choice experiment was carried out on juvenile sunfish to test whether offspring from the two types of eggs (fertilized by parental male vs. fertilized by cuckolder male) are able to distinguish kin (siblings) from non-kin using odor cues. The researchers used a two-sample method to test the null hypothesis that fish are unable to discriminate between kin and nonkin. This null hypothesis was not rejected for offspring from parental males. However, the same null hypothesis was rejected for offspring from cuckolder males. The researchers concluded that offspring of cuckolder males are more able to discriminate kin from non-kin than are offspring of parental males. What is wrong with this conclusion? What analysis should have been conducted? 31. Rutte and Taborsky (2007) tested for the existence of “generalized reciprocity” in the Norway rat, Rattus norvegicus. That is, they asked whether a rat that had just been helped by a second rat would be more likely itself to help a third rat than if it had not been helped. Focal female rats were trained to pull a stick attached to a tray that produced food for their partners but not for themselves. Subsequently, each focal rat’s experience was manipulated in two treatments. Under one treatment, the rat was helped by three unfamiliar rats (who pulled the appropriate stick). Under the other treatment, focal rats received no help from three unfamiliar rats (who did not pull the stick). Each focal rat was exposed to both treatments in random order. Afterward, each focal rat’s tendency to pull for an unfamiliar partner rat was measured. The number of pulls in a given period (in pulls/min.) by 19 focal female rats after both treatments is given below. Focal rat After receiving help After receiving no help 10 11 12 20 30 31 32 33 34 40 0.43 0.86 0.57 0.86 0.29 1.14 0.57 0.86 1.43 0.86 0.29 0.14 0.29 0.57 0.29 0.71 0.29 0.86 0.86 0.86 a. b. c. d. 41 42 43 50 51 52 60 61 0.57 0.86 0.00 1.00 0.86 0.86 1.86 0.86 0.43 1.00 0.86 0.43 1.00 0.00 1.14 0.71 62 0.86 0.71 Draw a graph to illustrate the data. What trend is evident? What are the means of the two treatments, and what is the mean difference? Test whether a difference was detectable between the help and no-help treatments. Why is it important to apply the two treatments to the focal rats in random order? 32. Alcohol consumption is influenced by price and packaging, but what about glassware? Attwood et al. (2012) measured whether the time taken to drink a beer was influenced by the shape of the glass in which it was served. Participants were given 12 oz. (about 350 ml) of chilled lager and were told that they should drink it at their own pace while watching a nature documentary. The participants were randomly assigned to receive their beer in either a straight-sided glass or a curved, fluted glass. The data below are the total time in minutes to drink the glass of beer by the 19 women participants in the study. Straight glass: 11.63, 10.37, 17.89, 6.96, 20.40, 20.64, 9.26, 18.11, 10.33, 23.54. Curved glass: 7.46, 9.28, 8.90, 6.73, 8.25, 6.16, 13.09, 2.10, 6.37. a. Show the data in a graph. What trend is suggested? Comment on other differences between the frequency distributions of the two samples. b. Test whether the mean total time to drink the beer differs depending on beer-glass shape. c. How much difference does it make? Provide a confidence interval of the difference. d. A second test of the same hypotheses but using the data from male participants yielded the following results: Straight glass: Y¯1=7.987, s1=2.459. Curved glass: Y¯1=6.930, s2=3.459. Y¯1−Y¯2=1.057, sp2=10.048,SEY¯1−Y¯2=1.418, t= 0.746, df=18,P=0.466. Is the following conclusion from the tests valid? “There is a significantly greater effect of beer-glass shape on mean time to drink in women than in men.” Explain. 33. Spinocerebellar ataxia type 1 is a neurodegenerative disease marked by the gradual loss of motor skills and culminating in early death. It is caused by an expanded CAG repeat in the coding region of the Ataxin-1 gene. Fryer et al. (2011) investigated the possible beneficial effects of exercise in treating the disease. They used a mild exercise regimen in a mouse model of the disease (a mouse strain in which an expanded CAG repeat was “knocked in” to the mouse version of the same gene, and that had similar symptoms). The life spans (in days) are given below for six exercised mice and six mice not given the exercise regimen. The data and 95% confidence intervals are shown in the accompanying graph. No exercise: 240, 261, 271, 275, 276, 281. Exercise: 261, 293, 316, 319, 324, 347. a. What type of graph is shown? b. Using only the graph, is it possible to predict the outcome of a formal test of whether mean life span differs between the two treatments? Explain. c. Test whether exercise affects life span in mice affected by the disease. d. By how many days does exercise increase life span on average? Use a confidence interval to answer this question. 7 INTERLEAF Which test should I use? One of the most challenging parts of statistical analysis is deciding the right method for your particular question. In fact, with statistical computer programs so readily available, choosing the right method is usually the main challenge of data analysis left to us humans. The computer does the rest. We make choices to find the right graphical method, the right estimation approach, and the right hypothesis test. How can we choose the right method? Fortunately, the chain of logic involved in choosing the right method is similar in each case. In Section 2.7, we summarized the types of graphs available and when to use them. Here, we focus on choosing the right statistical test. We give four questions that you need to answer to help decide which test to use. The accompanying tables list information about the specific test, depending on your answers to these questions. Does your test involve just one variable, or are you testing the association between two or more variables? Different methods apply in each case. Tests for a single variable may address whether a certain probability model fits the data or whether a population parameter (such as a mean or a proportion) equals a specified value. Tests for two variables address whether the variables are associated or whether one variable differs between groups. Are the variables categorical or numerical? Different tests are suited to different types of data. When testing for association between two variables, it matters whether the variables are categorical, numerical, or a mixture. Are your data paired? Two treatments can be compared either with two independent samples or with a paired design in which both treatments are applied to every unit of a single sample. Different methods are required for the two approaches. What are the assumptions of the tests, and do your data meet those assumptions? For example, many powerful tests assume that the data are drawn from a normal distribution. If this is not true, then another approach must be found. In this interleaf, we arrange the tests that we have already learned and several that are still to come in relation to these questions. Table 1 lists some of the common methods used for hypothesis tests involving a single variable. TABLE 1 Commonly used statistical tests for data on a single variable. These methods test whether a population parameter equals the value proposed in the null hypothesis or whether a specific probability model fits a frequency distribution. (Red numbers in parentheses refer to the chapter that discusses the test. Some refer to future chapters.) Data type Goal Test Binomial test (7) Categorical Use frequency data to test whether a χ2 Goodness-of-fit test with two population proportion equals a null categories (used if sample size is too hypothesized value large for the binomial test) (8) Numerical Use frequency data to test the fit of a specific population model χ2 Goodness-of-fit test (8) Test whether the mean equals a null hypothesized value when data are approximately normal (possibly only after a transformation) (13) One-sample t-test (11) Test whether the median equals a null hypothesized value when data are not normal (even after transformation) Sign test (13) Use frequency data to test the fit of a discrete probability distribution χ2 Goodness-of-fit test (8) Use data to test the fit of the normal distribution Shapiro-Wilk test (13) Most hypothesis tests are carried out to determine whether two variables are associated or correlated. This question can be addressed when the two variables are both categorical, both numerical, or when there is one of each. Table 2 lists the most common tests used for each combination when the appropriate assumptions are met. TABLE 2 Commonly used tests of association between two variables. (Red numbers in parentheses refer to the chapter that discusses the test.) Type of explanatory variable Type of response variable Categorical Numerical Categorical Numerical Contingency analysis (9) Logistic regression (17) t-tests, ANOVA, Mann-Whitney U- Linear and nonlinear regression (17) test, etc. Linear correlation (16) Spearman’s [See Table 3 for rank correlation (when data are not more defails.] bivariate normal) (16) Many methods allow hypothesis tests of differences in a numerical response variable among different groups (see the bottom left corner of Table 2). Testing differences between groups is equivalent to a test of association between a categorical explanatory variable (group) and a response variable. Table 3 summarizes these tests and gives the particular circumstances in which each is used, along with alternatives that make fewer assumptions. TABLE 3 A comparison of methods to test differences between group means according to whether the tests assume normal distributions. (Red numbers in parentheses refer to the chapter that discusses the test.) Number of Tests not assuming normal Tests assuming normal distribution treatments distributions Two treatments (independent samples) Two-sample t-test (12) Mann-Whitney U-test (13) Welch’s t-test (used when variance is unequal in the two groups) (12) Two treatments (paired data) Paired t-test (12) Sign test (13) More than two treatments ANOVA (15) Kruskal-Wallis test (15) If you can organize these tests in your mind according to these classifications, it will be much easier to pick the right one. When you encounter a new test, think about whether it applies to one or more variables, whether the variables are continuous or numerical, and what assumptions it makes. 13 Handling violations of assumptions All of the methods that we have learned about so far to estimate and test population means assume that the numerical variable has an approximately normal distribution. The two-sample t-test requires the further assumption that the standard deviations (and variances) are the same in the two corresponding populations. However, frequency distributions often aren’t normal, and standard deviations aren’t always equal. More often than we would like, our study organisms haven’t read their stats books carefully enough, and the data they generate do not match the nice neat assumptions of classical statistical methods. What options are available to us when the data do not meet these assumptions? In this chapter, we focus on four alternative options for analyzing such data: 1. Ignore the violations of assumptions. In some situations, we can use a procedure even if its assumptions are not strictly met. Methods for estimating and comparing means often work quite well when the assumption of normality is violated, especially if sample sizes are large and the violations are not too drastic. 2. Transform the data. For example, taking the logarithm is one way to transform data, with the result that the transformed data may better meet the assumptions. This procedure is often, but not always, effective. 3. Use a nonparametric method. A nonparametric method is one of a class of methods that do not require the assumption of normality. These methods can handle even badly behaved data, such as outliers that don’t go away even when the data are transformed. 4. Use a permutation test. A permutation test uses a computer to generate a null distribution for a test statistic by repeatedly and randomly rearranging the data for one of the variables. More alternatives are available, but these four are the most commonly used.1 We examine each of these approaches and explain the circumstances under which they should be used. All four assume that each data point in the sample is randomly and independently chosen from the population. We begin by reviewing methods to evaluate the assumption of normality. Detecting deviations from normality A few techniques are available to judge whether numerical data are from a population with a normal distribution. Graphical methods The most convenient methods for evaluating whether data fit a normal distribution are graphical. The human eye is a powerful tool for detecting deviations from a pattern. Start by plotting a histogram of the data, separately for each group if there is more than one. Data can be noisy, especially when the sample size is small, so don’t expect your data to follow a perfect bell curve, even when they come from a normal population. For example, Figure 13.1-1 shows two rows of histograms of data, all sampled from a perfect normal distribution using a computer. The four in the first row are based on random samples of 10 individuals each, and the four in the second row are based on random samples of 20 individuals each. None of the eight histograms resembles a normal distribution precisely, but none is so badly behaved that it would cause us to give up our assumption of a normal population. Of course, the larger the sample size, the more likely it is for a frequency distribution to resemble that of the population it came from. FIGURE 13.1-1 Top row: Histograms of four random samples of size n = 10 from the same normal distribution. Bottom row: Histograms of four random samples of size n = 20 from the same normal distribution. In contrast, if the frequency distribution is strongly skewed or strongly bimodal, or if it has outliers, then the population distribution is unlikely to be normal. For example, Figure 13.1-2 shows histograms of four samples drawn from distributions that are not normal. Panels (a) and (c) are from distributions strongly skewed to the right, whereas panel (b) shows a distribution strongly skewed to the left. Panel (d) has an outlier far to the right of the rest of the data. Histograms like these indicate that the data were not from a normal population. FIGURE 13.1-2 Frequency distributions of four data sets drawn from non-normal distributions. These distributions are (a) skewed right, (b) skewed left, (c) more extreme skew to the right, and (d) with an outlier. Besides histograms, the normal quantile plot is a second type of graphical technique for detecting departures from normality. The normal quantile plot compares each observation in the sample with the corresponding quantile expected from the standard normal distribution. (Remember, quantiles were introduced in Section 3.4.) An example is shown in Figure 13.1-3. Each dot on this plot represents one data point, whose measurement is indicated by its position on the horizontal axis and whose expected normal quantile is indicated on the vertical axis.2 The points in a normal quantile plot should roughly follow a straight line if the data were sampled from a normal distribution. The dots will wiggle around a straight line even in the best of circumstances, as in Figure 13.1-3, because sampling is random. More systematic departures from a straight-line pattern, as judged by eye, indicate that the frequency distribution of the population likely deviates from the normal distribution. Substantial curvature over a large range of values or substantial jumps in the distribution indicate potential deviations from normality. We recommend using a statistics program on the computer to draw quantile plots. The normal quantile plot compares each observation in the sample with its quantile expected from the standard normal distribution. Points should fall roughly along a straight line if the data come from a normal distribution. FIGURE 13.1-3 A normal quantile plot for a random sample of 32 observations from a normal distribution. The points on the plot fall roughly along a straight line. Example 13.1 illustrates the methods used to evaluate normality. EXAMPLE 13.1 The benefits of marine reserves Marine organisms do not enjoy anywhere near the same protection from human influence as do terrestrial species. However, marine reserves are becoming increasingly popular for biological conservation and the protection of fisheries. But are reserves effective in preserving marine wildlife? Halpern (2003) matched each of 32 marine reserves to a control location, which was either the site of the reserve before it became protected or a similar unprotected site nearby. One index of protection evaluated by the study was the “biomass ratio,” which is the total mass of all marine plants and animals per unit area of reserve divided by the same quantity in the unprotected control. This biomass ratio would equal one if protection had no effect. The biomass ratio would exceed one if the protection were beneficial, and it would be less than one if protection reduced biomass. The following list gives the biomass ratio for each of the 32 reserves: 1.34, 1.96, 2.49, 1.27, 1.19, 1.15, 1.29, 1.05, 1.10, 1.21, 1.31, 1.26, 1.38, 1.49, 1.84, 1.84, 3.06, 2.65, 4.25, 3.35, 2.55, 1.72, 1.52, 1.49, 1.67, 1.78, 1.71, 1.88, 0.83, 1.16, 1.31, 1.40. Are marine reserves effective? In other words, does the mean biomass ratio differ from one? The null and alternative hypotheses are as follows. H0: The mean biomass ratio is unaffected by reserve protection (µ = 1). HA: The mean biomass ratio is affected by reserve protection (µ ≠ 1). We might be tempted to use a one-sample t-test, but this test assumes that the biomass-ratio data are drawn from a population having a normal distribution. We must start by asking whether this assumption is valid. The best starting point is to plot the data, as in Figure 13.1-4. The histogram shows that all is not well. The frequency distribution of the biomass ratio is skewed strongly to the right. The quantile plot, moreover, shows curvature, indicating the poor fit of a normal distribution. FIGURE 13.1-4 The frequency distribution of the “biomass ratio” of 32 marine reserves (top panel) and the corresponding normal quantile plot (bottom panel). Formal test of normality The “eyeball test” works quite well to detect departures from normality. However, formal goodness-of-fit tests to the normal distribution are available. These methods test the following hypotheses. H0: The data are sampled from a population having a normal distribution. HA: The data are sampled from a population not having a normal distribution. Formal tests of normality should be used with caution. On the one hand, the test can give a false sense of security. A small sample size might not yield enough power to reject the null hypothesis of normality, even when data are drawn from a population without a normal distribution. On the other hand, a sufficiently large sample size will reject the null hypothesis of normality for many data whose departure from the normal distribution is not severe enough to warrant giving up on the methods that assume normality. The assumption of normality becomes less and less important when testing means as sample size increases, for reasons discussed in Section 13.2. As a result, we suggest that graphical methods and common sense are essential when evaluating the assumption of normality. The Shapiro-Wilk test is probably the most powerful formal method for testing departures from normality. This test is carried out by most computer statistics programs, so we don’t discuss the calculation details here. The Shapiro-Wilk test first estimates the mean and standard deviation of the population using the sample data. It then tests the goodness of fit to the data of the normal distribution having this same mean and standard deviation. A Shapiro-Wilk test evaluates the goodness of fit of a normal distribution to a set of data randomly sampled from a population. When we ran the Shapiro-Wilk test on the biomass-ratio data (Example 13.1), we rejected the null hypothesis that the data come from a normal distribution (P < 0.001), supporting what we saw from looking at the histogram and quantile plots. When to ignore violations of assumptions What if the data don’t meet the assumptions of normality or (for the two-sample methods) equal standard deviations? In this section, we consider the option of simply ignoring the violations. The justification is that methods for estimating and testing means are not highly sensitive to violations of the assumptions of normality and equality of standard deviations. Under certain conditions, these methods are robust, meaning that the answers they give are not sensitive to modest departures from the assumptions. Here we explain what these conditions are. A statistical procedure is robust if the answer it gives is not sensitive to violations of the assumptions of the method. Violations of normality Even though confidence intervals and t-tests require normally distributed data, the methods can sometimes be used to analyze data that are not normally distributed. The reason for this robustness comes from the central limit theorem (Section 10.6), which states that when a variable does not have a normal distribution, the distribution of sample means is nevertheless approximately normal when sample size is large. With large enough samples, therefore, the sampling distribution of means behaves roughly as assumed by methods based on the tdistribution, even when the data are not from a normally distributed population, provided that the violation of normality is not too drastic (Box and Andersen 1955). Robustness applies only to methods for means, not to methods like the F-test for testing variances, which are not robust to departures from the assumption of normality. How large must samples be to allow us to ignore the assumption of normality? The answer depends on the shape of the distributions. If two groups are being compared and both differ from normality in different ways, then even subtle deviations from normality can cause errors in the analysis (even with fairly large sample sizes). For example, look back at the skewed distributions in Figure 13.1-2 (i.e., panels [a],[b], and [c]). If we compared two groups that both had the same skew as in panel (a) of Figure 13.1-2, we would get reasonably accurate answers from a two-sample t-test with sample sizes of about 30 in each group. However, we would require sample sizes of about 500 or more to get sufficient accuracy from a two-sample test if we were to compare one group with right-skewed measurements like those in panel (a) with another group whose data were left-skewed like those in panel (b). If the distributions are even more skewed than those in panels (a) and (b) of Figure 13.1-2, then the two-sample methods based on the t-distribution should be avoided in favor of alternative methods. The frequency distribution in panel (c) of Figure 13.1-2 is so skewed that a t-test would not give reliable answers even with extremely large sample sizes. Frequency distributions containing outliers (e.g., panel [d] of Figure 13.1-2) should never be analyzed with a t-test or a confidence interval based on the t-distribution. Methods that assume normality are very sensitive to outliers. In the absence of definitive guidelines for every possible case, we recommend a cautious approach to data analysis. If the data show strong departures from normality, such as outliers, or if the frequency distributions in different groups are markedly different, then it is best to adopt one of the other options (i.e., data transformations, permutation tests, or nonparametric methods) rather than simply ignoring the violations of assumptions. Unequal standard deviations When can we ignore the assumption of equal standard deviations in two-sample methods? With moderate sample sizes (greater than 30 in each group), the two-sample methods for estimation and hypothesis testing using the t-distribution will still perform adequately with even a threefold difference between groups in their standard deviations, as long as the sample sizes of the two groups are approximately equal (Ramsey 1980). If sample sizes are not approximately equal, or if the difference in standard deviations is more than threefold, then the Welch’s t-test (Section 12.3) should be used instead of the two-sample t-test, provided that the assumption of normality is met. (Indeed, Welch’s t-test can be used even if the standard deviations are thought to be equal, but it is not as powerful as the two-sample t-test.) If the assumption of normality is also not met, then it is best to try data transformations, a permutation test, or one of the methods described in Chapter 19). Data transformations One of the best ways to handle data that don’t match the assumptions of a statistical method is to see if you can transform the data to better meet the assumptions. Transformations can make the standard deviations more similar in different groups and improve the fit of the normal distribution to data. For example, it is often the case that a sample of data does not follow a normal distribution, but that the logarithms of the data match the normal distribution rather well. The test or estimation procedure could then be carried out on the log-transformed data instead. Our goal is to find a numeric scale on which a difference between two measurements has a similar interpretation regardless of the average measurement. For example, the difference in mass between two randomly chosen elephants is likely to be much greater than the difference in mass between two mice, simply because elephants are so much bigger. On a log scale, however, the differences among mice and among elephants might be more comparable, in which case the log scale is appropriate. A data transformation changes each measurement by the same mathematical formula. The three most frequently used transformations are the log transformation, the arcsine transformation, and the square-root transformation.3 In this section, we examine the situations in which these three transformations are most likely to be useful. Keep in mind that, if a transformation is made to one data point, it has to be made to all data points from all samples for that variable if the data are to be compared. Throughout this section and beyond, we use a “prime” mark (′) to denote transformed data. Thus, if the original variable is Y, we would call the transformed variable Y′ (pronouced “Yprime”). Log transformation The most common data transformation in biology is the log transformation. Typically, the data are converted by taking the natural log (ln) of each measurement. In mathematical terms, Y′=ln[Y]. The log transformation converts each data point to its logarithm. Log base-10 is also sometimes used instead of the natural log, but we will use the natural log throughout this text. Again, all observations must be transformed in exactly the same way. (For example, it is not legitimate to compare the natural log of a variable from population 1 to the log base-10 of the variable in population 2.) Finally, the log transformation can be applied to data only when all the values are greater than zero. (The natural log of numbers less than or equal to zero is undefined). If the data include zero, then Y′=ln[Y+1] can be tried instead. In general, the log transformation is most likely to be useful under one or more of these conditions: ■ The measurements are ratios or products of variables. ■ The frequency distribution of the data is skewed to the right (i.e., has a long tail on the right). ■ The group having the larger mean (when comparing two groups) also has the higher standard deviation. ■ The data span several orders of magnitude. For example, compare the two probability distributions in the left panel of Figure 13.3-1. Both distributions are right-skewed, and the group with the larger mean has the higher standard deviation. The log transformation fixes both problems. That is, both ln[Y1] and ln[Y2] are no longer skewed (they have normal distributions, instead) and both have the same standard deviation. Don’t hesitate to try the log transformation in other situations as well. FIGURE 13.3-1 Left panel: Two right-skewed probability distributions having different standard deviations. In this case, the distribution with the highest standard deviation also has the highest mean. Right panel: The log transformations of the same variables. On the log scale, these two distributions have normal distributions with the same standard deviation. In our experience, body measurements such as mass and length often show right-skewed frequency distributions that become more normally distributed after being log-transformed. But log transformation will not always solve the problem, even when frequency distributions are right-skewed or when the group with the larger mean also has the higher standard deviation. In these circumstances, though, the log transformation is worth a try. Just be sure to check the distribution of the transformed data to determine whether it fits the assumptions of the desired method. Let’s look again at the study on biomass ratio in marine reserves (Example 13.1). The raw data are listed in Table 13.3-1 along with the log-transformed data. Try a couple of the log calculations to make sure you get the same numbers that we did. Recall that the frequency distribution of the biomass ratio was right-skewed (Figure 13.1-4). TABLE 13.3-1 Biomass ratios from 32 marine reserves and their log transformations. Biomass ratio ln[Biomass ratio] 1.34 0.29 1.96 0.67 2.49 1.27 1.19 1.15 1.29 1.05 1.10 0.91 0.24 0.17 0.14 0.25 0.05 0.10 1.21 1.31 1.26 1.38 1.49 1.84 1.84 3.06 2.65 4.25 3.35 2.55 1.72 1.52 1.49 1.67 1.78 1.71 1.88 0.83 1.16 1.31 1.40 0.19 0.27 0.23 0.32 0.40 0.61 0.61 1.12 0.97 1.45 1.21 0.94 0.54 0.42 0.40 0.51 0.58 0.54 0.63 -0.19 0.15 0.27 0.34 The effects of the log transformation on the frequency distribution of the data are shown in Figure 13.3-2. The skew has been much reduced in the log-transformed data, which now conform to a normal distribution much more closely than before (Figure 13.1-4). FIGURE 13.3-2 Frequency distribution of the natural logarithm of the biomass ratio from 32 marine reserves. We can now apply the one-sample t-test to the transformed data, because they better meet the assumption that the sample came from a population having a normal distribution. Let’s use the data to test the following hypotheses. H0: The mean biomass ratio is unaffected by reserve protection (µ = 1). HA: The mean biomass ratio is affected by reserve protection (µ ≠ 1). Because we want to test the hypotheses on log-transformed data, though, we need to modify these hypotheses to reflect that transformation. Thus, a biomass ratio of 1 on the original scale is ln [1] = 0 on the log-transformed scale, so our revised hypotheses are as follows. H0: The mean of the log biomass ratio of marine reserves is zero (µ′ = 0). HA: The mean of the log biomass ratio of marine reserves is not zero (µ′ ≠ 0). The prime (′) denotes the transformed scale. After this, we proceed with the hypothesis test as usual. The sample mean of Y′ is Y¯′=0.479, and the sample standard deviation is s′=0.366. (Note that these are not the same as the log of Y¯ and the log of the standard deviation of Y.) The corresponding one-sample t-statistic is t=0.4790.366/32=7.40, with df = n − 1 = 32 − 1 = 31. Using a computer program, we find that the P-value is P=2.4×10−8. This P-value is considerably less than 0.05, so we soundly reject the null hypothesis. Marine reserves have a higher mean log biomass than comparable unprotected areas, which implies that protected areas also have a higher biomass then unprotected areas. Arcsine transformation The arcsine transformation is used almost exclusively on data that are proportions: p′=arcsin [p], where p is a proportion measured on a single sampling unit. The arcsine (abbreviated, with a bizarre lack of brevity, as “arcsin”) is the inverse of the sine function from trigonometry.4 A dedicated transformation for proportions is needed because proportions tend not to be normally distributed, especially when the mean is close to zero or to one, and because groups differing in their mean proportion tend not to have equal standard deviations. The arcsine transformation often solves both of these problems at the same time. Don’t forget the square-root component of the arcsine transformation. Note that if the original data are given as percentages, they must first be converted to proportions by dividing by 100 before applying the arcsine transformation. The square-root transformation The square-root transformation is often used when the data are counts, such as number of mates acquired, number of eggs laid, number of bacterial colonies, and so Y′=Y+1/2. This transformation, like the log transformation, sometimes helps to equalize standard deviations between groups when the group with the higher mean also has the higher standard deviation.5 In our experience, the effects of the square-root transformation are usually very similar to those obtained with the log transformation. It makes sense, then, to try both and see which one works best. If their effects are the same, then it doesn’t matter which transformation you use. Other transformations When the frequency distribution of the data is skewed left, try the square transformation. This is done by squaring each data point: Y′=Y2. If the square transformation doesn’t work, try the antilog transformation, Y′=eY, on left-skewed data. When the data are skewed right, try the reciprocal transformation, Y′=1Y. The square transformation and the reciprocal transformation are usable only if all the data points in all samples have the same sign. If all numbers in the samples are negative, then try multiplying each number by −1 before further transforming the new positive numbers. Confidence intervals with transformations It may often be necessary to transform the data to meet the assumptions of the methods for confidence intervals. Once the confidence interval is computed, it is usually best to backtransform the lower and upper limits of the confidence interval before presenting the results. For example, let’s find the confidence interval for the mean of the log biomass ratio of marine reserves (Example 13.1). The original data were not normally distributed, so it is not valid to calculate a confidence interval directly from these data using the method based on the t-distribution. Recall that a log transformation fixed this problem. The sample mean of the log-transformed data is Y¯′=0.479, and the sample standard deviation is s′=0.366. There are a total of n = 32 data points in the sample. The 95% confidence interval for the mean of log of the biomass ratio is therefore Y¯′-t0.05(2),31SEY¯′<μ′<Y¯′+t0.05(2),31SEY¯′0.479-(2.04)0.36632<μ′<0.479+ (2.04)0.366320.347<μ′<0.611 These limits to the confidence interval on the log scale are less easily interpreted than numbers on the original scale. However, we can back-transform the limits to yield a 95% confidence interval on the original scale. In the case of the log transformation, the back-transform is done by taking the antilog (i.e., raise e to the power of the quantity on the transformed scale). Therefore, the 95% confidence interval for the mean on the original scale (called the geometric mean)6 is e0.347< geometric mean < e0.611 or 1.41< geometric mean < 1.84. This 95% confidence for the geometric mean biomass ratio indicates that marine reserves have 1.41 to 1.84 times more biomass on average than the control sites do. This is a remarkably narrow interval for the mean effect of marine reserves. Similar back-transformations can be calculated for all of the other possible transformations. Back-transformations for each transformation are given in the Quick Formula Summary (Section 13.9 at the end of the chapter). A caveat: Avoid multiple testing with transformations Trying multiple transformations to find which one works best to meet the assumptions of a test is an excellent plan. It is not legitimate, though, to try multiple transformations to find one that leads to a statistically significant outcome (such as a P-value smaller than 0.05). Each transformation would give a slightly different result to the statistical test, and you might be tempted to choose the result that most agreed with your preconceptions about the data. Repeated testing inflates the Type I error rate, so it should be avoided. Instead, first decide which transformation best meets the assumptions of the method, and then stick with that decision when carrying out the test. Nonparametric alternatives to one-sample and paired ttests If ignoring violations of assumptions and transforming the data fail, sometimes you can try nonparametric methods instead. These methods—developed for calculating confidence intervals and hypothesis testing—make fewer assumptions about the probability distribution of the variable of interest in the population from which the data were drawn.7 Nonparametric methods are handy when the frequency distribution of the data is not normal—such as when there are outliers. In contrast, the methods that do make assumptions about the distributions are called parametric methods. All of the methods we have learned about so far, including those based on the t-distribution, are parametric methods. A nonparametric meihod makes fewer assumptions than standard parametric meihods do about the distributions of the variables. Here we focus on nonparametric methods for hypothesis testing. We discuss nonparametric tests that address the same types of questions that we have already considered in Chapters 11 and 12. In this section, we cover alternatives to the one-sample t-test and the paired t-test. In Section 13.5, we use nonparametric methods to compare two independent groups. In subsequent chapters, we will introduce new nonparametric tests in tandem with the parametric tests designed for the same purpose. Nonparametric tests are usually based on the ranks of the data points rather than the actual values of the data. In other words, the data points are ranked from smallest to largest, and the rank (first, second, third, etc.) of each data point is recorded. The actual measurements are not used again for the test. Using ranks is what frees us from making assumptions about the probability distribution of the measurements, because all distributions make similar predictions about the ranks of the measurements. Non-parametric tests are particularly useful when there are outliers in the data set, because ranks are not unduly affected by outliers. Sign test The sign test is a nonparametric method that can be used in place of the one-sample t-test or the paired t-test when the normality assumption of those tests cannot be met. The sign test assesses whether the median of a population equals a null hypothesized value. Measurements lying above the null hypothesized median are designated “+” and the numbers lying below are scored as “−.” If the null hypothesis is correct, we expect half of the measurements to lie above the null hypothesized median and half to lie below, except for sampling error. The P-value can then be calculated using the binomial distribution (see Section 7.2). The sign test is simply a binomial test in which the number of data points above the null hypothesized median is compared with that expected when p = 1/2. The sign test compares the median of a sample to a constant specified in the null hypothesis. It makes no assumptions about the distribution of the measurement in the population. Unfortunately, the sign test has very little power compared with the one-sample or paired ttest because it discards most of the information in the data. A measurement that is infinitesimally larger than the null hypothesized median and a data point that exceeds the median by several million both count only as a +. Nonetheless, the sign test is a useful tool to have in your statistical toolbox because sometimes no other test is possible. EXAMPLE 13.4 Sexual conflict and the origin of new species The process by which a single species splits into two species is still not well understood. One proposal involves “sexual conflict” − a genetic arms race between males and females that arises from their different reproductive roles.8 Sexual conflict can cause rapid genetic divergence between isolated populations of the same species, leading to the formation of new species. Sexual conflict is more pronounced in species in which females mate more than once, leading to the prediction that they should form new species at a more rapid rate. To investigate this, Arnqvist et al. (2000) identified 25 insect taxa (groups) in which females mate multiple times, and they paired each of these groups to a closely related insect group in which females only mate once. Which type of insect tends to have more species? Table 13.4-1 lists the numbers of insect species in each of the groups. TABLE 13.4-1 The number of species in 25 pairs of insect groups. Each pair matches a group of insect species in which females mate only once with a related group of insect species in which females mate multiple times. Number of species Taxon Multiple-mating Single-mating Above (+) or below pair group group Difference (−) zero A B C D E F G 53 73 228 353 157 300 34 10 120 74 289 30 4 18 43 −47 154 64 127 296 16 + − + + + + + H 3400 3500 −100 − I J 20 196 1000 486 −980 −290 − − K L M N O 1750 55 37 100 21,000 660 63 115 30 600 1090 −8 −78 70 20,400 + − − + + P Q R S 37 7 15 18 40 5 7 6 −3 2 8 12 − + + + T U V W X Y 240 15 77 15 85 86 13 14 16 14 6 8 227 1 61 1 79 78 + + + + + + The data are paired. Thus, for each group of insects whose females mate once, there is a corresponding, closely related group of insect species in which females mate more than once. For this reason, the analysis must focus on the paired differences. The differences listed in Table 13.4-1 were calculated by subtracting the number of species in the single-mating group from that of the corresponding multiple-mating group. First, examine the histogram of the differences in Figure 13.4-1. These data have one outlier at 20,400, and we hardly need a Shapiro-Wilk test (Section 13.1) to tell us that the measurements are not normally distributed. At the same time, there are only 25 data points, which is too small a sample size to rely on the robustness of the paired t-test. There is no obvious transformation that would make these data normal, so we should pursue a nonparametric test instead. We will use the sign test to evaluate whether the median of the difference equals zero. FIGURE 13.4-1 The distribution of differences in species number between singlemating and multiple-mating insect groups. There is an extreme outlier at 20,400. Our hypotheses are as follows. H0: The median difference in number of species between insect groups is zero. HA: The median difference in number of species between these groups is not zero. From this point on, the sign test is the same as the binomial test. If the null hypothesis is correct, then we expect half the measurements to fall above zero (+) and half to fall below zero (−). In fact, 18 out of the 25 measurements fall above zero and only seven fall below (see the last column in Table 13.4-1). We can use the binomial distribution to calculate the P-value for the test. What is the probability of getting seven or fewer “−” observations out of 25 when the probability of a “−” observation is 0.5 under the null hypothesis? The answer is Pr[X≤7]=∑i=07(25i)(0.5)i(0.5)25-i=0.02164. The alternative hypothesis requires a two-sided test, so we need to double this calculated probability to account for the other tail. Thus, the P-value is P=2(0.02164)=0.043. Because this P-value is less than α = 0.05, we reject the null hypothesis. Groups of insects whose females mate multiple times have more species than groups whose females mate singly, consistent with the sexual-conflict hypothesis. Although it didn’t come up when applying the sign test to the data in Example 13.4, what do you do about data points exactly equal to the hypothesized median? The usual approach is to drop all data points exactly equal to the median given by the null hypothesis. The test then proceeds as though those data had never existed. When this occurs, you must reduce the n used in the binomial calculations to the number of data points left after culling. Finally, if the sample size is five or fewer, it is impossible to reject the null hypothesis from a two-sided sign test with α = 0.05, no matter how different the true values are from the null hypothesized value. This underscores the fact that sign tests have low power and therefore require large sample sizes. The Wilcoxon signed-rank test The Wilcoxon signed-rank test is an improvement on the sign test for evaluating whether the median of a population is equal to a null-hypothesized constant. Unlike the sign test, the Wilcoxon signed-rank test retains information about magnitudes—that is, how far above or below the hypothesized median each data point lies. Unfortunately, it assumes that the distribution of measurements in the population is symmetric around the median—in other words, that no skew is present. This assumption is nearly as restrictive as the normality assumption of the one-sample t-test, greatly limiting its usefulness. (Skew, after all, is the usual reason the data don’t fit the normal distribution, as in Example 13.4.) Because of this limitation, we do not explain the details of its calculation. Most statistics packages on the computer will carry out the Wilcoxon signed-rank test, but without highlighting its limitations. Comparing two groups: the Mann-Whitney U-test The Mann-Whitney U-test can be used in place of the two-sample t-test when the normal distribution assumption of the two-sample t-test cannot be met.9 This method uses the ranks of the measurements to test whether the frequency distributions of two groups are the same. If the distributions of the two groups have the same shape, then the Mann-Whitney U-test compares the locations (medians or means) of the two groups. The Mann-Whiiney U-test compares the distributions of two groups. It does not require as many assumptions as the two-sample t-test. Example 13.5 shows how the Mann-Whitney U-test works. EXAMPLE 13.5 Sexual cannibalism in sagebrush crickets The sage cricket, Cyphoderris strepitans, has an unusual form of mating. During mating, the male offers his fleshy hind wings to the female to eat. The wounds are not fatal,10 but a male with already nibbled wings is less likely to be chosen by females he meets later. Females get some nutrition from feeding on the wings, which raises the question, “Are females more likely to mate if they are hungry?” Johnson et al. (1999) answered this question by randomly dividing 24 females into two groups: one group of 11 females was starved for at least two days and another group of 13 females was fed during the same period. Finally, each female was put separately into a cage with a single (new) male, and the waiting time to mating was recorded. The data are listed in Table 13.5-1. The median time to mating was 13.0 hours for starved females and 22.8 hours for fed females. TABLE 13.5-1 Times to mating (in hours) for female sagebrush crickets that were recently starved or fed. The measurements of fed females are in red to facilitate comparison after ranking (see Table 13.5-2). Starved Fed 1.9 2.1 3.8 9.0 9.6 13.0 14.7 17.9 21.7 29.0 72.3 1.5 1.7 2.4 3.6 5.7 22.6 22.8 39.0 54.4 72.1 73.6 79.5 88.9 We start by writing our hypotheses. H0: Time to mating is the same for female crickets that were starved and for those that were fed. HA: Time to mating differs between these two groups. This is a two-tailed test. The frequency distributions of the two groups are shown in the histograms in Figure 13.5-1. FIGURE 13.5-1 Histograms of the time to mating for female sagebrush crickets who were starved (upper panel) or fed (lower panel). The data have positive skew, and a log transformation did not make them normally distributed. With the relatively small sample sizes, we cannot count on the central limit theorem to yield a normal sampling distributions of means, so the two-sample t-test is a poor choice. Therefore, we will apply a nonparametric test: the MannWhitney U-test. There are n1 = 11 crickets that were starved and n2 = 13 that were fed. For convenience, we’ll label these group 1 and group 2, respectively. To use the Mann-Whitney U-test, first rank all the data from smallest to largest, combining data from both groups, as shown in Table 13.5-2. The smallest measurement from either data set is from group 2, at 1.5 hours. We assign this data point a rank of 1. The second smallest data point is 1.7 hours, which we assign a rank of 2. We continue to do this, looking at data from both groups together until we have ranked them all. TABLE 13.5-2 Times to mating of female crickets from both groups, ordered from smallest to largest and then ranked. Data from group 2 (fed crickets) are highlighted in red to facilitate comparison. Group Time to mating Rank 2 1.5 1 2 1 1 2 2 1 2 1 1 1 1 1 1 2 2 1 2 2 2 1 2 2 2 1.7 1.9 2.1 2.4 3.6 3.8 5.7 9.0 9.6 13.0 14.7 17.9 21.7 22.6 22.8 29.0 39.0 54.4 72.1 72.3 73.6 79.5 88.9 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Second, calculate the rank-sum for one of the two groups (it doesn’t matter which group, but it’s easier to use the group with the fewest data points). R1 is the sum of all of the ranks in group 1 (the starved group). The ranks for group 1 are 3, 4, 7, 9, 10, 11, 12, 13, 14, 17, and 21, which added together give R1=3+4+7+9+10+11+12+13+14+17+21=121. Third, use the rank-sum to calculate a new quantity, U1, as follows: U1=n1n2+n1(n1+1)2-R1=11(13)+11(11+1)2-121=88. U1 is the number of times that a data point from sample 1 is smaller than a data point from sample 2, if we compare all possible pairs of points taken one from each sample. Also calculate U2, as follows: U2=n1n2−U1=11(13)−88=55. U2 is the quantity associated with group 2. If we applied the equation for U1 to the data from group 2, we would get the same answer (55). Fourth, choose the larger of U1 or U2 as our test statistic, U. In this case, U1 is larger than U2, so our test statistic is U=U1=88. Finally, determine the P-value by comparing the observed U with the critical value of the null distribution for U. Critical values for cases in which sample sizes are relatively small are provided in Statistical Table E. Note that this table is based on sample sizes n rather than degrees of freedom. The null hypothesis is rejected if U equals or exceeds the critical value for U. For the cricket data, with α = 0.05, n1 = 11, and n2 = 13, the critical value of U is U0.05(2), 11,13=106. Our U (88) is less than this critical value (106), so P > 0.05 and we cannot reject the null hypothesis. The data, therefore, provide insufficient evidence that time to mating is different between starved and fed female crickets. Tied ranks The data from Example 13.5 contain no observations tied for the same value, but often the same value appears more than once in a data set. When measurements are tied, we assign to all instances of the same measurement the average of the ranks that the tied points would have received. Table 13.5-3, for example, contains measurements from two hypothetical groups. TABLE 13.5-3 Measurements from two hypothetical groups, illustrating how ties are ranked. Group Y Rank 2 2 1 12 14 17 1 2 3 1 2 1 2 1 19 19 24 27 28 4.5 4.5 6 7 8 The ranks 1, 2, and 3 are assigned, as usual, to the three smallest values: 12, 14, and 17. But there are two measurements with the same value, 19. The two tied measurements, 19, would have been given ranks 4 and 5. Therefore, we assign each of the tied measurements a rank of (4 + 5)/2 = 4.5 (called the “midrank”). Thereafter, we continue to assign ranks to the data by assigning the next value, 24, a rank of 6, because we have already used ranks 4 and 5 for the two tied measurements (mistakes are often made at this step when ranking by hand).11 Data can also be in three-, four-, or even more-way ties, but we apply the same procedure. If we had three values that would otherwise have been given ranks 6, 7, and 8, then all of those individuals would be assigned a midrank of (6 + 7 + 8)/3 = 7. Thereafter, we would continue by assigning the next higher data point the rank of 9. Large samples and the normal approximation Statistical Table E does not include critical values for the U-statistic when sample sizes are large. Fortunately, for medium and large sample sizes (n1 and n2 > 10), a transformation of the U-statistic, Z=2U-n1n2n1n2(n1+n2+1)/3, has a sampling distribution that is well approximated by the standard normal distribution if the null hypothesis is true. For example, the critical value for the Z-statistic at α = 0.05 is 1.96 (Statistical Table B). Assumptions of nonparametric tests While nonparametric tests do not rely on the normal distribution, they still make assumptions. Nonparametric tests assume, for example, that both samples are random samples from their populations. Without random samples, nonparametric tests are as likely as parametric tests to give erroneous answers. As mentioned in Section 13.4, the Wilcoxon signed-rank test assumes that the distribution of measurements in the population is symmetrical. This assumption limits the utility of the method. The Mann-Whitney U-test compares the distributions of two groups. The null hypothesis H0 is that the distributions are the same. Rejecting H0 therefore implies that the distributions are not the same. However, rejecting the null hypothesis does not necessarily imply that the two distributions have different locations (means or medians). Concluding that the locations are different, using a Mann-Whitney U-test, requires the additional assumption that the distributions of the two groups have the same shape. For example, the two distributions must have the same variance and the same skew. The Mann-Whitney U-test is very sensitive to differences between distributions in their shapes caused by unequal variances or different skews (Zimmerman 2003). For this reason we recommend that the Mann-Whitney U-test be used to test the null hypothesis that the distributions are the same, rather than testing the null hypothesis that the locations (means or medians) are the same. This limitation of the MannWhitney test is often not appreciated, and you will find many instances of misuse in the scientific literature. Type I and Type II error rates of nonparametric methods When the assumptions of a given test are met, the probability of making a Type I error (i.e., rejecting a true null hypothesis) is constant at α for both parametric and nonparametric tests. When the assumptions of a parametric test are not met, such as by a sharply nonnormal distribution of measurements, then the Type I error rate can become larger than the stated αvalue, and sometimes this excess is extremely large (Ramsey 1980). In this case, the researchers would have false confidence in the reliability of the results. This excess Type I error rate is the main reason not to use parametric tests when the assumptions are strongly violated. Under these conditions, a nonparametric test will yield a Type I error rate equal to a, provided its less restrictive assumptions are met. The story is different for Type II errors (i.e., failing to reject a false null hypothesis). By using only ranks, nonparametric tests use less information from the data than do parametric tests. The actual magnitudes are discarded. This loss of information causes nonparametric tests to have less power than the corresponding parametric test, when the assumptions of the parametric test are met. Less power means that the nonparametric test has a lower probability of rejecting a false null hypothesis, and therefore a higher Type II error rate, than does the parametric test. For this reason, most biologists prefer to use parametric tests if the assumptions can be met, and turn to nonparametric methods only after data transformation has failed to meet the assumptions of the parametric methods. The reduced power of nonparametric tests is immaterial when the assumptions of the parametric test are strongly violated. In that case, the parametric test cannot be used at all. Nonparametric tests are typically less powerful than parametric tests. How much lower is the power of the sign test and the Mann-Whitney U-test compared with their corresponding parametric tests? We can sensibly make this comparison only for cases when the assumptions of the parametric tests are met. In this case, the power of a MannWhitney U-test is about 95% as great as the power of a two-sample t-test when sample sizes are large (Mood 1954). This result is not too bad. The difference in power is greater, however, with smaller sample sizes. In the extreme case, when sample size in each of two groups is only two, the power of the Mann-Whitney U-test is zero. The sign test has low power, much lower than the one-sample t-test and the paired t-test. Even with large samples, the sign test has only 64% of the power of a t-test (Mood 1954), and its power is even lower with smaller samples. Therefore, the sign test should be used only as a last resort, when the assumptions of the parametric tests simply cannot be met. Permutation tests Cheap computing has made possible new approaches for analyzing data, especially non-normal data. In this section we describe one method: permutation tests. We use the approach here to provide an alternative to the two-sample t-test and the MannWhitney U-test, to test for an association between a categorical explanatory variable (treatments or groups) and a numerical response variable. Permutation tests can also be used to test an association between two categorical variables (like a contingency analysis) or between two numerical variables (like a correlation analysis, described in Chapter 16). Other computer-intensive methods for estimation and hypothesis testing are covered in Chapter 19. “Permutation” means rearrangement. In each step of a permutation test, we take the values of one of the two variables measured on the sample of individuals and randomly rearrange (permute) them. In other words, we randomly mix up the associations among the variables. This gives an idea of what values the test statistic would have if the two variables for an individual were not associated (except by chance). This method is often referred to as a randomization test, because the values for one of the two variables are repeatedly “randomized” during the test procedure. Permutation tests require us to make few assumptions about the frequency distributions of the variables and so are very versatile. In a permutation test, a test statistic is chosen that measures the association between the two variables in the data. Here, we use the difference between sample means for two groups as our test statistic in a two-sample test.12 This statistic is calculated on every permuted data set, each having the values of one variable randomly rearranged. Repeating the permutation procedure many times yields many values of the test statistic under the null hypothesis of no association. The frequency distribution of the test statistic is therefore used as an approximate null distribution. If the observed value of the test statistic calculated from the original data is unusual compared to the null distribution, we reject the null hypothesis of no association between the variables. A permutation test generates a null distribution for the association between two variables by repeatedly and randomly rearranging the values of one of the two variables in the data. We illustrate the permutation test with the data from Example 13.5, in which we compared the mean time to mating of female sagebrush crickets assigned to either of two treatment groups: starved and fed. The data are reproduced in Table 13.8-1 in a slightly different format. These data are not normally distributed, as we saw from the histograms in Figure 13.5-1. As a result, we cannot use a two-sample t-test. TABLE 13.8-1 Times to mating (in hours) of female sagebrush crickets that were recently starved or fed. Data from the two treatments are color-coded to more easily identifly the origin of each value later in Table 13.8-2. Treatment Time (hours) Treatment Time (hours) Starved Starved 1.9 2.1 Fed Fed 1.5 1.7 Starved 3.8 Fed 2.4 Starved Starved 9.0 9.6 Fed Fed 3.6 5.7 Starved Starved Starved Starved Starved 13.0 14.7 17.9 21.7 29.0 Fed Fed Fed Fed Fed 22.6 22.8 39.0 54.4 72.1 Starved 72.3 Fed Fed Fed 73.6 79.5 88.9 Previously we analyzed these data using the Mann-Whitney U-test, but here we will reanalyze them with a permutation test. There are a total of 24 data points, 11 from the starved treatment and 13 from the fed group. We are testing the following hypotheses. H0: Mean time to mating is the same for female crickets that were starved and for those that were fed. HA: Mean time to mating differs between these two groups. This is a two-sided test, so we reject H0 if the difference in the mean time to mating is much greater than zero or if it is much less than zero. To carry out a permutation test of the hypotheses, we need to decide on a test statistic to describe the difference between the two treatments. The simplest statistic is the observed difference between the sample means, Y¯1−Y¯2, where group 1 refers to the starved group. (We might convert this quantity to a t-statistic by dividing by the standard error of the difference, but there is little to be gained by doing so, because we won’t be using the tdistribution.) From the data, the difference in the mean time to mating is Y¯1−Y¯2=17.73−35.98=−18.26 To generate the null distribution of possible values of Y¯1−Y¯2, using permutation, follow these steps: 1. Creaie a permuied sei of daia in which ihe values of ihe response variables are randomly reordered. To do this, list all of the observations, as in Table 13.8-1. Now take all of the data values for one variable (here, time to mating) and randomly rearrange them among individuals, while leaving the other variable unchanged. An example is shown in Table 13.8-2. In other words, combine all 24 values for the time to mating into a single pool. Choose one value at random, and assign it to the first individual in the starved treatment (this measurement was 3.8 in our first permutation; see Table 13.8-2). Next, choose one of the remaining 23 measurements at random and assign to the second starved individual (ours was the measurement 9.0). Continue with this process, eliminating each real data point from the sampling pool as you use it, until all 24 measurements are used up.13 Let’s call the result a “permuted sample.” The size of each group in the permuted sample is the same as its size in the original data. Each original measurement is still present exactly once, but by chance it might be assigned to a different group than the group it came from. You will need a computer program to do this permutation. TABLE 13.8-2. Outcome of a single permutation. Response measurements (time to TABLE 13.8-2. Outcome of a single permutation. Response measurements (time to mating) are color-coded as in Table 13.8-1 to indicate their original groups. Treatment Time (hrs) Treatment Time(hrs) Starved Starved Starved Starved Starved Starved 3.8 9.0 3.6 79.5 17.9 22.8 Fed Fed Fed Fed Fed Fed 14.7 21.7 1.7 2.1 1.5 2.4 Starved 54.4 Fed 5.7 Starved Starved Starved Starved 13.0 9.6 1.9 22.6 Fed Fed Fed Fed Fed Fed 39.0 29.0 72.1 88.9 72.3 73.6 2. Calculate the measure of association for the permuted sample. For the single permuted sample shown in Table 13.8-2, the mean of the starved group is 21.65 and the mean of the fed group is 32.67. The difference is thus Y¯1−Y¯2=21.65−32.67=−11.02 This is the result from the first replicate of the permutation process. 3. Repeat the permutation process many times—at least 1000 or more. We repeated this permutation process a total of 10,000 times, recording the value of Y¯1−Y¯2 each time. Figure 13.8-1 shows the resulting distribution of values for Y¯1−Y¯2, from all of the 10,000 permutations. This distribution approximates the null distribution of the difference between the two groups. FIGURE 13.8-1 The null distribution of the test statistic Y¯1−Y¯2 from 10,000 replicates of the permutation process. Of the 10,000 permutations, only 712 had a value of Y¯1−Y¯2 equal to or less than the observed value, −18.26. To determine an approximate P-value for the test, we use the simulated null distribution for Y¯1−Y¯2 in the same way that we would use a theoretical null distribution described by a formula (had one been available). To begin, find the proportion of values in the null distribution that equal or lie further in the tail than the observed value of the test statistic, calculated from the data. For our example, this is the fraction of test statistics from all of the permuted samples that are less than or equal to the observed difference between the means, Y¯1−Y¯2=−18.26. We found that 712 of the 10,000 permuted data sets, a proportion of 0.0712, yielded such an outcome (Figure 13.8-1). To obtain P, we multiply this proportion by two to take into account the equally extreme outcomes at the other tail of the null distribution (remember, this is a two-tailed test). Therefore, the P-value for the test is approximately 2(0.0712) = 0.142. Since this value is greater than 0.05, we do not reject the null hypothesis. We are unable to conclude that starved and fed females differ in the mean time to mating. Because the permuted samples are generated by a random process, the null distribution would not be exactly the same if this test were repeated. As a result, the P-value would vary slightly from one test to another. Nevertheless, with a large number of permuted data sets, the P-value can be approximated with good precision. Assumptions of permutation tests Permutation tests make few assumptions and can be applied in a wide variety of circumstances, but some assumptions are required. First of all, the data must be a random sample from the population. Secondly, for permutation tests that compare means or medians between groups, the distribution of the variable must have the same shape in every population. Permutation tests are robust to violations of this assumption when sample sizes are large, more so than the MannWhitney U-test. Permutation tests have lower power (i.e., lower ability to reject a false null hypothesis) than parametric tests when the sample size is small, but they are more powerful than the Mann-Whitney U-test. They have similar power to parametric tests when sample sizes are large. Summary ■ Statistical methods—such as the one-sample, paired, and two-sample t-tests—that make assumptions about the distribution of variables are called parametric methods. Methods that do not make assumptions about the distribution of variables are called nonparametric methods. ■ When the assumptions of parametric tests are violated, there are at least four alternative solutions: ignore the violations, if they are minor; transform the data to better meet the assumptions; use nonparametric methods, if the previous two strategies are insufficient; and use permutation tests. (More options are discussed in Chapter 19.) ■ A statistical method is robust if violations of its assumptions do not greatly affect its results. ■ Methods to evaluate the fit of the normal distribution to a data set include visual inspection of histograms, normal quantile plots, and the Shapiro-Wilk test. ■ Parametric methods for comparing means are robust to minor violations of the assumption of normal populations, especially if the sample size is large. It is acceptable to ignore minor violations with large data sets and to proceed with the parametric tests. ■ The two-sample t-test and the confidence interval for the difference between two means are robust to minor violations of the assumption of equal standard deviations between populations, if the sample sizes are approximately equal in the two groups. ■ Many data can be transformed mathematically to a new scale on which the assumptions of parametric methods are met. The parametric method can then be performed on the transformed data. ■ The most common data transformations are the log transformation, the arcsine transformation, and the square-root transformation. Of these, the log transformation is used most often. ■ The sign test is a nonparametric alternative to the paired t-test or one-sample t-test. It is much less powerful (it is less likely to reject a false null hypothesis) than the corresponding parametric tests. ■ The Wilcoxon signed-rank test is a nonparametric alternative to the paired t-test. However, it assumes a symmetrical distribution, so it should be used with caution. ■ The Mann-Whitney U-test is a nonparametric alternative to the two-sample t-test. It tests the null hypothesis that the distributions are the same. Rejecting the null hypothesis allows us to conclude that the two distributions are different, but we cannot necessarily conclude that the two groups have different locations (means or medians). The U-test is less powerful than the two-sample t-test, but it is almost as powerful when the sample size is large. ■ To use the Mann-Whitney U-test to test whether the locations of two distributions (medians or means) are the same, we must assume that the distributions of the variable in the two populations have the same shape. ■ A permutation test is a method used to generate a null distribution for a measure of association between two variables (including a difference between groups) by randomly rearranging the observed values for one of the variables. The frequency distribution of test statistics calculated on many randomized data sets gives the null distribution of the test statistic. ■ The null distribution of a test statistic generated in a permutation test is used to calculate the P-value. 3.10 Quick Formula Summary Transformations Log: Y′=ln[Y] or Y′=ln[Y+1] . Arcsine: P′=arcsin[p]. Square root: Y′=Y+1/2. Back-transformations Log: Y=eY′. Arcsin: p=(sin[p′])2. Square root: Y=Y′2−12. Sign test What is it for? A nonparametric test of whether the median of a population equals a specified constant. What does it assume? Random samples. Test statistic: The number of measurements greater than (or less than) the median specified by the null hypothesis. Formula: Identical to a binomial test with H0: p = 0.5. Mann-Whitney U-test What is it for? A nonparametric test to compare the frequency distributions of two groups. What does it assume? Random samples. When used to compare the locations (means or medians) of two distributions, the U-test additionally assumes that the distributions of the variable in the two populations have the same shape. Test statistic: U. Sampling distribution under H0: The U-distribution, with sample sizes n1 and n2. Formula: U1=n1n2+n1(n1+1)2-R1 and U2 = n1n2 − U1, where R1 is the sum of the ranks for group 1. For a two-tailed test, U is the greater of U1 or U2. For n1 and n2 ≤ 10, use Statistical Table E. For large sample sizes, compare Z=2U-n1n2n1n2(n1+n2+1)/3 to the critical value from the standard normal distribution. PRACTICE PROBLEMS 1. Calculation practice: Confidence interval for the mean using log transformation. Refer to Practice Problem 19 in Chapter 10. Health spending per person from a random sample of 20 countries is given below. Country Per capita health expenditure in 2010 Bahrain Belarus Belize Brunei Darussalam Colombia Congo, Rep. Côte d’Ivoire Cuba Finland Germany Guinea-Bissau Guyana Jamaica Lesotho Malta Morocco 864 320 239 882 472 72 60 607 3984 4668 47 180 247 109 1697 148 Namibia Philippines Qatar Saudi Arabia 361 77 1489 680 We will use this sample to estimate the mean of log health expenditure, including a confidence interval. a. Visualize the frequency distribution of the data using a histogram. What feature or features of this distribution indicate that the data are likely not from a population having normal distribution? b. What features of this distribution make it a good candidate to try a log transformation? c. Calculate the natural log transformation for each data point in the sample. d. What is the sample size? e. What is the mean of the log health expenditure? f. What is the standard deviation of the mean log health expenditure? g. Calculate the standard error of the mean log health expenditure. h. Calculate the 95% confidence interval for mean log health expenditure. i. What are the limits of this confidence interval expressed on the original (i.e., non-log) scale? 2. Calculation practice: The sign test. Female goldeneyes (a kind of duck) lay eggs in other females’ nests, in addition to the eggs they produce and raise in their own nests. One advantage to a female of this parasitic behavior is that other females do the work of raising her offspring. Andersson and Åhlund (2012) measured which of the eggs produced by 14 female goldeneyes were laid in other nests and which were laid in their own nests. We have converted their data into a “parasitism first” index, given below. The index is positive if a female tends to lay her earliest eggs parasitically in others’ nests (and keeps the later ones for herself to raise). The index is negative if she tends to lay her last eggs in others’ nests (and keeps the earlier ones). The index is zero if she alternates and her parasitic eggs are neither earlier nor later on average than eggs she keeps in her own nest.14 Female name female 1 female 2 female 17 female 18 female 19 female 20 female 37 female 51 female 55 female 58 female 70 female 76 female 80 female 94 Parasitism first index −2.3 8 10.5 4.6 5.5 6 0 5 4.5 6.5 4.6 3.8 5 5 Using the following steps, test whether goldeneyes tend to lay their eggs in other’ nests before laying in their own. a. Plot these data. Do they look normally distributed? b. Why is a sign test suitable for these data? c. What is the null hypothesis and the alternative hypothesis for this test? d. Convert each data point into a positive or negative score, to express its value relative to the value stated in the null hypothesis. e. After discarding the values that equal the value stated in the null hypothesis, what is the remaining sample size? f. Conduct a binomial test using the observed positive and negative scores against the null expectation that half are positive and half are negative. g. What can you conclude about the golden-eyes’ behavior based on this test? 3. Calculation practice: Mann-Whitney U-test. Recycling paper has some obvious benefits, but it may have unintended consequences. For example, perhaps people are less careful about how much paper they use if they know that their waste will be recycled. Catlin and Wang (2013) tested this idea by measuring paper use in two groups of experimental participants. Each person was placed in a room alone with scissors, paper, and a trash can, and was told that he or she was testing the scissors. In the “recycling” group only, there was also a recycling bin in the room. The amount of paper used by each participant was measured in grams. The data from each person are listed below. No recycling bin: 4, 4, 4, 4, 4, 4, 4, 5, 8, 9, 9, 9, 9, 12, 12, 13, 14, 14, 14, 14, 15, 23. With recycling bin: 4, 5, 8, 8, 8, 9, 9, 9, 12, 14, 14, 15, 16, 19, 23, 28, 40, 43, 129, 130. a. Make and examine histograms of these data. Are the frequency distributions of paper use in the two treatment groups similar in shape and spread? b. Based on your results in part (a), discuss your options for testing a difference between these two groups in the amount of paper used. c. We will apply a Mann-Whitney U-test to test the hypothesis that these two treatments have the same distribution of paper use. State the null and alternative hypotheses clearly. d. Rank all the values of paper use from smallest to largest. Properly account for ties. e. For each treatment group, calculate the rank sum and the sample size. f. Calculate the Mann-Whitney U1 value for the treatment without the recycling bin. g. Using the result from part (f), calculate U2 for the treatment with the recycling bin. h. What is the value of the test statistic U? i. Calculate the P-value as accurately as you can, state the conclusion of the test, and interpret the results. 4. The four graphs shown below are normal quantile plots for four different data sets, each sampled randomly from a different population. For each graph, say whether the distribution is close to a normal distribution. If not, say how the data differs from what you might expect from a normal distribution. FIGURE FOR PROBLEM 4 5. Below are frequency distributions of four hypothetical populations. a. For each graph (i through iv), say whether the distribution appears to be a normal distribution. b. For each graph, imagine that you want to test the null hypothesis that the mean or median is zero based on a random sample. What statistical method would you use? If you suggest a transformation of the data, say why you chose that transformation. c. Match the distributions given in this problem to the quantile plots of samples in Practice Problem 4. 6. For each of the following sets of numbers, log-transform and calculate the sample mean and 95% confidence interval of the population mean. If that is impossible, say why it is impossible. a. 10.2, 0.105, 67.3, 827 b. 2.1, 8.3, 3.2, 30.1 c. 17, -14, 37, 12 d. 12; 1.2; 125; 12,300 e. 0, 1.2, 4.5, 3.2 7. As a species becomes very rare, opportunities for mating might become reduced. This could result in low offspring numbers and further reductions in population size. Widén (1993) studied these effects in the rare Senecio integrifolius, a daisy-like herb, in Sweden. Below are his measurements of average percent seed set (percent of flowers producing seeds) of these plants at six different field sites in 1981: 29.8, 44.2, 58.3, 83.0, 78.2, 72.0 percent a. Apply the arcsine transformation to these data and calculate the mean and standard deviation. b. Calculate a 95% confidence interval for the population mean of the arcsine-transformed measurements. c. Back-transform the upper and lower limits of the confidence interval to obtain the confidence interval on the percent scale. 8. When intruding male lions take over a pride of females, they often kill most or all of the infants in the group. This reduces the time until the females are again sexually receptive. This infanticide has many consequences for the biology of lions. It may be the reason, for example, that female lions band together in groups in the first place (to be better able to repel invading males).15 The period after the takeover of a pride by a new group of males is an uncertain time, when the stability of the pride is unpredictable. As a result, we might predict that females will delay ovulation until this uncertainty has passed. A long-term project working on the lions of Serengeti, Tanzania, measured the time to reproduction of female lions after losing cubs to infanticide and compared this to the time to reproduction of females that had lost their cubs to accidents (Packer and Pusey 1983). The data are given below in days. Does infanticide lead to a different mean delay to reproduction in females than when cubs die from other causes? The data are not normally distributed within groups, and we have been unable to find a transformation that makes them normal. Perform an appropriate statistical test. Accidental death: 110, 117, 133, 135, 140, 168, 171, 238, 255. Infanticide: 211, 232, 246, 251, 275. 9. The skin of the rough-skinned newt, Taricha granulosa, stores an extremely poisonous neurotoxin called tetrodotoxin (TTX). In some geographical areas, garter snakes, a newt predator, have evolved some resistance to this toxin. In these areas, the newts make up a substantial part of the snakes’ diet. As a first step to understanding the evolution of these traits, researchers compared resistance to TTX between two Oregon populations of garter snakes, one near Benton and the other near Warrenton (Geffeney et al. 2002). The data from 12 snakes are given in the accompanying table. Resistance is measured as the injected dose of TTX, in mass-adjusted mouse units (MAMUs), that causes a 50% reduction in crawl speed. Locality Resistance Benton Benton 0.29 0.77 Benton Benton Benton Benton Benton Warrenton Warrenton Warrenton Warrenton Warrenton 0.96 0.64 0.70 0.99 0.34 0.17 0.28 0.20 0.20 0.37 a. Calculate summary statistics on these data and draw an appropriate graph to examine the data. Why would a two-sample t-test be an inappropriate method to test for differences in mean resistance? b. List three appropriate methods that could be used to test a difference in resistance between these two populations. c. Use a log transformation to test the hypothesis that the mean resistance does not differ between the two populations. Why might a log transformation be appropriate? d. How big is the difference between the populations? Calculate a 95% confidence interval for the difference between populations in mean log-transformed resistance. e. Back-transform the confidence interval from part (d) to the original scale. Provide an interpretation of this back-transformed interval. 10. When producing a 95% confidence interval for the difference between the means of two groups, under what circumstances can a violation of the assumption of equal standard deviations be ignored? 11. The following are very small data sets of human birth weights (in kg) of either singleton births or individuals born with a twin. We are interested in the difference in mean weight between singleton babies and twin babies. Singleton: 3.5, 2.7, 2.6, 4.4. Twin: 3.4, 4.2, 1.7. a. Construct a valid permuted sample from these data for this difference. (Note that these samples sizes are in reality too small for an effective permutation test.) b. Assume that we wanted to test the difference in medians between these two groups. Would that change the way in which the permuted sample would be created? 12. For each of the sets of samples in the accompanying figures (a)-(e), state which approach is best if the goal is to test the difference between the means of group 1 (on the left) and group 2 (on the right). Pay careful attention to differences between samples in the scales of the xaxis. These graphs have not been drawn according to the best practices that we outlined in Chapter 2 (histograms of two samples should be displayed one above the other and on the same scale), but they are similar to those you might obtain if you plotted them separately with a computer program. FIGURE FOR PROBLEM 12 13. The nematode Caenorhabditis elegans is often used in studies of development and genetics. This species is an unusual animal, because most C. elegans individuals are hermaphrodites. That is, each worm has both ovaries and testes and acts as both a male and a female. Typically, a C. elegans individual will produce offspring by mating with itself. Very rarely, worms occur that have testes but not ovaries, and these males must mate with another individual to produce offspring. Therefore, it is important for the males to produce sperm that can compete well for access to eggs, and one way to do so might be to have larger sperm cells. LaMunyon and Ward (1998) measured the size of spermatids in 211 males and 700 hermaphrodites. Histograms of their findings are as follows: The distribution of spermatid size was significantly different from a normal distribution, so the researchers decided to perform a MannWhitney U-test. They wanted a two-sided test of the difference in spermatid size between the two types of worms. They got as far as calculating the value for U, which was U = 35,910. a. What were their hypotheses? b. Find the P-value for their test. c. Why might a Mann-Whitney U-test with these data be inappropriate for a test of the difference between medians? 14. Only about 6% of plant species have separate male and female individuals (a syndrome called dioecy). The rest of the species have individuals with both male and female parts (called monoecy). Why are there so many more monoecious than dioecious species of plants? One possibility is that dioecious plants have low speciation rates or high extinction rates. To test this, Heilbuth (2000) compared the numbers of species in pairs of plant taxa of similar age. In each pair, one group was monoecious and the other group was the most closely related taxon that is dioecious. The data are shown in the following table. Taxon pair Number of species in dioecious group Number of species in monoecious group 1 2 3 4 5 6 7 8 9 1 1 1150 9 1 4 11 2 650 7000 5000 4616 701 44 12 450 70 13 10 11 12 13 14 15 16 17 15 6 17 405 3 450 4 50 6 8 80 4 200 2770 11 53 18 19 20 21 22 23 24 25 26 27 28 370 2 50 7 700 10 2 400 400 1 2 2639 40 6 47 235 11 5 5772 72 6143 31 a. Before testing the difference, plot the data to help you decide which test to perform. b. Carry out an appropriate test to determine whether monoecious and dioecious groups differ in the number of species. 15. The distribution of body size of mosquitoes (as measured by weight) is known to be lognormal (that is, size is normally distributed if log-transformed). The (untransformed) weights of 11 female and nine male mosquitoes are given below in milligrams (Lounibos et al. 1995). Do the two sexes weigh the same on average? Females: 0.291, 0.208, 0.241, 0.437, 0.228, 0.256, 0.208, 0.234, 0.320, 0.340, 0.150. Males: 0.185, 0.222, 0.149, 0.187, 0.191, 0.219, 0.132, 0.144, 0.140. 16. One of the founders of modern population genetics, J. B. S. Haldane, was once asked if he could infer anything about God from his study of nature. He replied, “An inordinate fondness for beetles,” reflecting the beetles’ status as the most species-rich animal group. One hypothesis for why beetles are so common is that a large number of them are plant eaters, and they may have ridden the coattails of the flowering plants (angiosperms) as the number of flowering plant species increased over the past 140 million years. A test of this hypothesis (Farrell 1998) compared the numbers of beetles in groups that feed on angiosperms to the number in the most closely related group that feeds on gymnosperms (non-flowering seed plants). Five such pairs of groups were compared in this way. The data are as follows: Number of species in angiosperm Number of species in gymnosperm Pair eaters eaters 1 44,002 85 2 3 4 5 150 25,000 400 33,400 30 78 3 26 a. Are the differences in species number between these groups likely to come from a population with a normal distribution? b. Do the groups that eat angiosperms and the groups that eat gymnosperms have different numbers of species? Carry out a formal test that does not require the assumption of a normal distribution. c. Comment on the P-value and your conclusion in part (b) in light of the fact that the angiosperm-eating group had more species in all five pairs. What does this say about the power of your test? 17. Of the two scatter plots provided, one represents the real relationship between body size and brain size for 29 species of dinosaur (Hurlburt 1996). The other shows the first permuted data set created from that same data. Which do you think is the real data, and why? ASSIGNMENT PROBLEMS 18. Carotenoids are pigments responsible for much of the red we see in nature. For example, the bright red beak color of the male zebra finch is derived from carotenoids. Carotenoids are also important as antioxidants for humans and other animals, leading researchers to predict that carotenoids in birds may affect immune system function. To test this, a group of zebra finches was randomly divided into two groups (McGraw and Ardia 2003). Ten individual finches received supplemental carotenoids in their diet, and 10 individuals did not. All 20 birds were then measured using an assay that measures cellmediated immunocompetence (PHA) as well as an assay that measures humoral immunity (SRBC). The data are given below. The researchers had independent reasons to believe that neither assay score would be normally distributed, so they preferred a nonparametric test. a. Does PHA differ on average between the birds that received supplemental carotenoids and those that did not? b. Does SRBC differ on average between the birds that received supplemental carotenoids and those that did not? c. What assumptions did you require in parts (a) and (b)? Treatment PHA SRBC CAROT CAROT CAROT CAROT CAROT CAROT CAROT CAROT CAROT CAROT NO NO NO NO NO NO NO NO NO NO 1.511 1.311 1.460 1.352 1.491 1.599 1.653 1.390 1.779 1.721 1.454 1.226 1.198 1.139 1.277 1.490 0.912 1.316 1.234 1.332 2 2 3 4 5 4 8 5 7 4 4 2 3 4 3 0 3 4 2 1 19. The conventional wisdom is that people who play a lot of sports have more sexual partners than those who do not. To test this, a group of French researchers asked two groups of students how many sex partners they had had in the previous year (Faurie et al. 2004). One group was composed of 250 physical education majors who regularly participated in sports; the other group was composed of 50 biology majors16 who did not regularly do sports. The distribution of number of sex partners is shown in the graph below. Biology majors claimed an average of 1.24 sex partners in the previous year, while sports majors claimed an average of 2.4 partners in the previous year. As you can see, the distributions are very different from a normal distribution, and even with a log transformation the distributions are not normal. As a result, the researchers used a MannWhitney U-test to compare the median numbers of sex partners for the two groups. They found that the U1 value corresponding to the biology majors was 8500.5, while the U2 value corresponding to the sports group was 3999.5. a. Finish the Mann-Whitney U-test for them. State the hypotheses and reach a conclusion. b. Comment on your results in light of the assumptions of the Mann-Whitney U-test. c. These researchers would like to know the actual number of sex partners for individuals in the two groups, rather than just the claimed number. What improvements can you suggest to their study design to achieve this goal? 20. Sockeye salmon swim sometimes hundreds of miles from the Pacific Ocean, where they grow up, to rivers for spawning. Kokanee are a type of freshwater sockeye that spend their entire lives in lakes before swimming to rivers to mate. In both types of fish, the males are bright red during mating. This red coloration is caused by carotenoid pigments, which the fish cannot synthesize but get from their food. The ocean environment is much richer in carotenoids than the lake environment, which raises the question: how do kokanee males become as red as the sockeye? One hypothesis is that the kokanee are much more efficient than the sockeye at using available carotenoids. This hypothesis was tested by an experiment in which both sockeye and kokanee individuals were raised in the lab with low levels of carotenoids in their diets (Craig and Foote 2001). Their skin color was measured electronically (as a* units on an L*a*b* standard that correlates strongly with redness). The data are as follows and are plotted in the accompanying histograms: Kokanee: 1.11, 1.34, 1.55, 1.53, 1.50, 1.71, 1.87, 1.86, 1.82, 2.01, 1.95, 2.01, 1.66, 1.49, 1.59, 1.69, 1.80, 2.00, 2.30. Sockeye: 0.98, 0.88, 0.97, 0.99, 1.02, 1.03, 0.99, 0.97, 0.98, 1.03, 1.08, 1.15, 0.90, 0.95, 0.94, 0.99. a. List two methods that would be appropriate to test whether there was a difference in mean skin color between the two groups. b. Use a transformation to test whether there is a difference in mean between these two groups. Is there a difference in the mean of kokanee and sockeye skin color? 21. In a study of the Gouldian finch, Griffith et al. (2011) looked at stress caused by having an incompatible mate. There are two genetically distinct types of Gouldian finches, one having a red face and the other having a black face. Previous experiments have shown that female finches have a strong preference for mating with males with the same face color as themselves, and that when different face types of finch mate with one another, their offspring are less likely to survive than when both parents are the same type. Researchers paired females sequentially with males of both types in random order. In other words, each female bred twice, once with a compatible male and once with an incompatible male. Each time, females produced a brood of young with the assigned male. For each pairing, the researchers measured the blood corticosterone concentration (in units of ng/ml) as an index of the amount of stress the females experienced. The corticosterone data for 43 females are given in the accompanying table.17 With compatible male With incompatible male 10 10 39 30 6 21 0 4 16 130 105 91 82 77 65 64 60 56 10 23 22 8 19 22 10 10 11 22 19 21 14 11 6 1 51 45 50 49 48 44 44 44 42 41 37 37 37 36 36 35 2 9 11 12 3 3 4 6 32 31 30 30 30 30 30 29 5 7 6 8 8 21 7 8 7 8 29 29 28 26 25 25 25 24 21 7 a. Plot the distribution of differences in stress levels between females with compatible and incompatible mates. What trend is suggested? b. If we wished to carry out a test of the difference between treatment means, would a paired t-test be appropriate for these data? Why or why not? c. Would a paired t-test be appropriate after a log transformation of the differences between treatments? d. Would a sign test be appropriate for these data? Why or why not? e. Test the hypothesis that stress levels are the same between females with compatible and incompatible mates. Use a sign test. 22. Using the data in Practice Problem 11, state whether each of the following sets of numbers (items a through f) is a possible permuted sample for use in testing the difference between the means of singleton and twin birth weights. If not, explain why. a. b. c. d. e. f. Singletons Twins 3.5, 2.7, 2.6, 4.4 3.4, 4.2, 1.7, 3.5 2.7, 2.6, 4.4 3.5, 3.5, 3.5, 3.5 3.8, 3.8, 3.8, 3.8 3.4, 3.5, 4.4, 3.4 3.4, 4.2, 1.7 2.7, 2.6, 4.4 3.4, 4.2, 1.7, 3.4 3.5, 3.5, 3.5 3.8, 3.8, 3.8 4.4, 2.7, 2.6 23. One measure of the exposure of a person to tobacco smoke is the urinary cotinine-tocreatinine ratio (CCR). (Cotinine is formed in the body by breaking down nicotine.) Scientists measured this ratio in infants from smoking households (Blackburn et al. 2003). These households were divided according to their previous behavior into two groups: ones with strict controls to prevent exposure of the infant to smoke (31 babies) and another group with less strict controls (133 babies). The mean (and standard deviation) of the logtransformed CCR was 1.26 (1.58) in the babies from strict households and 2.58 (1.16) from babies from less-strict households. The distribution of the log-transformed CCR was approximately normally distributed in both groups. a. Do babies from households with strict controls differ significantly from those with lessstrict controls in their exposure to smoke? Perform an appropriate test. b. On the non-transformed scale, how much higher is the cotinine-to-creatinine ratio for babies in the less-strict group, as compared to those in the strict group? c. Is this an observational or experimental study? Explain. 24. Use a six-sided die to make 10 permuted data sets suitable for testing the difference between the medians of the following two groups:18 Group A: 2.1, 4.5, 7.8. Group B: 8.9, 10.8, 12.4. a. Write out the 10 permuted data sets. b. Using just these 10 permuted data sets, test the null hypothesis that the two groups have the same median, using α = 0.20 for the significance level. 25. Researchers have observed that rainforest areas next to clear-cuts (less than 100 meters away) have a reduced tree biomass compared to rainforest areas far from clear-cuts. To go further, Laurance et al. (1997) tested whether rainforest areas more distant from the clearcuts were also affected. They compiled data on the biomass change after clear-cutting (in tons/hectare/year) for 36 rainforest areas between 100 m and several kilometers from clearcuts. The data are as follows: −10.8, −4.9, −2.6, −1.6, −3, −6.2, −6.5, −9.2, −3.6, −1.8, −1.0, 0.2, 0.2, 0.1, −0.3, −1.4, −1.5, −0.8, 0.3, 0.6, 1.0, 1.2, 2.9, 3.5, 4.3, 4.7, 2.9, 2.8, 2.5, 1.7, 2.7, 1.2, 0.1, 1.3, 2.3, 0.5. These measurements are plotted in the following histogram: Test whether there is a change in biomass of rainforest areas following clear-cutting. 26. Male zebra finches have bright red beaks (see Assignment Problem 18), and experiments have shown that females prefer males with the reddest beaks. The red coloration in the beak comes from a class of pigments called carotenoids, which must be obtained in the diet of the birds. Pairs of brother finches were randomly assigned to alternative treatments: one was fed extra dietary carotenoids, the other was fed a diet low in carotenoids (Blount et al. 2003). Ten females were given a choice between brothers. Each female’s preference was measured by the percentage of time she sat next to the carotenoid-supplemented male, standardized so that zero indicates equal time for each brother. Here are the preference data for each of the 10 pairs: 23, 27, 57, 15, 15, 54, 34, 37, 65, 12. Choose an appropriate method and test whether females preferred one type of male over the other type. 27. The vuvuzela captured international attention during the 2010 World Cup in South Africa. In its modern incarnation, the vuvuzela is a plastic horn, about 65 cm long, that can produce a sound loud enough to cause permanent hearing damage. Blowing a vuvuzela requires a fair amount of air pressure, and Lai et al. (2011) were concerned that vuvuzela use by anyone carrying a pathogen would cause airborne contagions to be spread broadly through a crowd. They tested this idea with an experiment that compared the concentration of aerosol droplets or particles produced by people blowing vuvuzelas to that produced by the same people shouting instead. The data, measured as thousands of particles per liter, for 8 individuals are given in the table below. Particle concentration (1000s/Iiter) Person 1 2 3 4 5 6 7 8 Vuvuzela Shouting 606 1077 220 396 1197 178 645 944 6.1 6.4 1.3 1.8 6.0 1.5 2.9 2.9 a. Take the log transformation of each value before finding differences. Then calculate a 95% confidence interval for the mean difference in log particle concentration between vuvuzelas and shouting. b. Carry out an appropriate test for a mean difference in log particle concentration between the two forms of cheering. 28. The pseudoscorpion Cordylochernes scorpioides lives in tropical forests, where it rides on the backs of harlequin beetles to reach the decaying fig trees in which they live. Females of the species mate with multiple males over their short lifetimes, which is puzzling because mating just once provides all the sperm she needs to fertilize her eggs. A possible advantage is that by mating multiple times, a female increases the chances of mating with at least one sperm-compatible male, if incompatibilities are present in the population. To investigate, Newcomer et al. (1999) recorded the number of successful broods by female pseudoscorpions randomly assigned one of two treatments. Females were each mated to two different males (DM treatment), or they were each mated twice to the same male (SM). This design provided the same total amount of sperm to females in both treatments, but DM females received genetically more diverse sperm than SM females did. The number of successful broods of offspring for each female is listed below. The data were not normally distributed; to test the null hypothesis of no difference between treatments in the mean number of broods, we carried out a permutation test in which the data were randomly reshuffled 10,000 times on the computer. Our test statistic was the difference between groups in the mean number of broods (SM minus DM). The observed value of this difference was −0.841. The null distribution from the 10,000 permutations is shown in the upper panel of the figure on the next page. The far left tail of the null distribution is shown in the lower panel. Numbers below each bar give the exact values of the test statistic; numbers above give the frequency of each of the values in 10,000 permutations. Using these values, carry out the permutation test.19 SM treatment: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6. DM treatment: 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7. 29. Despite the voracious habits of army ants, multiple species of invertebrates have managed to penetrate and exploit their societies. For example, the silverfish Malayatelura ponerophila is an insect that lives only in colonies of the Southeast Asia army ant, Leptogenys distinguenda, where it steals food brought in by the ants. If detected, the silverfish becomes ant food (see photo on the next page), but how does it usually evade detection? Ants recognize other ants from their own colony with a chemical signature, a complex blend of cuticular hydrocarbons (CHCs) on the surface of their exoskeleton. The silverfish have been observed rubbing up against the ants,20 and isotope-labeling studies indicate that ant CHCs are transferred to the silverfish this way. Does chemical mimicry contribute to ant tolerance of silverfish? In one of a series of experiments, von Beeren et al. (2011) collected individual silverfish from ant colonies and isolated them for six days, after which most of the acquired CHCs had evaporated. The aggressive behavior of the ants toward these silverfish was then compared with behavior toward control silverfish not isolated for six days. The data below measure ant aggression toward the silver-fish on a scale from 0 to 1. Control: 0.04, 0.00, 0.22, 0.10, 0.11, 0.54. Isolated: 0.25, 1.00, 1.00, 0.42, 0.50, 1.00, 1.00, 1.00. a. Two commonly used methods for presenting the results are shown in the accompanying figure (with standard error bars). Which method is superior? Why? b. Without transforming the data, apply an appropriate method to test whether the aggression index by ants toward silverfish was affected by isolation. 30. Dengue infects tens of millions of people annually, resulting in more than 10,000 deaths. The disease is caused by an RNA virus, which is transmitted principally by the mosquito Aedes aegypti. Previous work in Drosophila has shown that the bacterium Wolbachia, an endosymbiont living in the cytoplasm of the insect’s cells, largely confers immunity from RNA viruses. Wolbachia also affects Drosophila sexual reproduction. By biasing its transmission to offspring, the bacterium spreads through Drosophila populations over multiple generations. Might it do the same to mosquitoes, and in the process rid the world of dengue?21 Wolbachia does not occur naturally in A. aegypti, so micro-injection was used to create a laboratory strain of the mosquito (called WB1) that harbors the endosymbiont. To examine the potential effects on transmission of dengue, Bian et al. (2010) infected mosquitoes from both the WB1 strain and the original wild strain with dengue. Fourteen days later, the mosquitoes were allowed to feed on an artificial food solution for 90 min. Viral titers in the food solution were then measured (in plaque feeding units [pfu] per ml). The results are given below. Mosquitoes were tested in groups, and so each data value is an average for the group, measured in pfu/ml. The data have already been log-transformed using log(Y + 1), to better visualize the differences. WB1: 8.0, 5.9, 4.4, 4.4, 2.4, 0.0, 0.0, 0.0. Wild: 11.3, 10.8, 9.4, 6.5, 6.3, 5.9, 4.7, 4.2. a. Plot the data with a strip chart. What trends are suggested? What features of the data might lead you to conclude that even after the log transformation, they do not fit a normal distribution? b. Rank all the above data from small to large. c. Carry out a Mann-Whitney U-test of whether WB1 differs from the wild strain in potential dengue transmission. d. Why do you think we used log(F + 1) rather than log(F) when transforming the data? e. Would your results in (c) have been different had the data not been log-transformed? Explain. 31. The cane toad (Bufo marinus), a large, toxic toad introduced to Australia in 1935, has been rapidly spreading across the continent at over 50 km/year. Are the fastest-moving toads at the leading edge of the expanding range morphologically different from other toads that move less rapidly? Phillips et al. (2006) compared the leg lengths of individually marked toads with the distances that they moved in a three-day period. In the following pair of graphs, the relative length of the leg is compared to the movement distance (each measurement is displayed as a deviation from the mean). One of the following two graphs is the original data, and the other shows the first permuted data set. Which is more likely to be the permuted data set? Explain your reasoning. 32. T and B lymphocytes are normal components of the immune system, but in multiple sclerosis they become autoreactive and attack the central nervous system. What triggers the autoimmune process? One hypothesis is that the disease is initiated by environmental factors, especially microbial infection. However, recent work by Berer et al. (2011) on the mouse model of the disease suggests that the autoimmune process is triggered by nonpathogenic microbes living in the gut. They compared onset of autoimmune encephalomyelitis in two treatment groups of mice from a strain that carries transgenic human CD4+ T cells, which initiate the disease. One group (GF) was kept free of nonpathogenic gut microbes and all pathogens. The other (SPF) was only pathogen-free and served as controls. The following data are measurements of the percentage of T cells producing the molecule, interleukin-17, in tissue samples from 16 mice in the two groups. SPF: 18.87, 15.65, 13.45, 12.95, 6.01, 5.84, 3.56, 3.46. GF: 6.64, 4.51, 1.12, 0.62, 0.37, 0.61, 0.71, 0.82. a. Use a box plot to visualize the data. What trend is suggested? In what way do the frequency distributions violate the assumptions of the two-sample t-test? b. No transformations were effective, so we tested the difference between the medians of the two populations with a permutation test. Ten thousand randomizations were carried out, and the difference between the medians was calculated each time (SPF minus GF). Below we list the 100 largest values for the difference in median (sorted, out of 10,000). Using these values, complete the permutation test. 7.66, 7.71, 7.71, 7.71, 7.76, 7.76, 7.76, 8.425, 8.425, 8.425, 8.425, 8.425, 8.425, 8.425, 8.425, 8.425, 8.425, 8.48, 8.51, 8.51, 8.51, 8.51, 8.51, 8.51, 8.51, 8.51, 8.51, 8.51, 8.51, 8.51, 8.51, 8.51, 8.51, 8.51, 8.51, 8.565, 8.63, 8.63, 8.63, 8.63, 8.715, 8.715, 8.715, 8.715, 8.715, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.825, 8.88, 8.88, 8.88, 8.88, 8.88, 8.88, 8.88, 8.88, 8.88, 9.03, 9.03, 9.03, 9.03, 9.03, 9.03, 9.03, 9.03, 9.03, 9.03, 9.03, 9.03, 9.03, 9.03, 9.03. Review Problems 2 1. Women with mutations in their BRCA1 or BCRA2 genes (“carriers”) represent about 0.5% of the U.S. population (Malone et al. 2006). Women who are carriers have an 80% chance of developing breast cancer in their lifetimes (Schubert 1997).22 Those with normal versions of these genes have a 12% chance of developing breast cancer in their lifetime. a. What is the probability that a randomly sampled woman from the U.S. population is a carrier of either BRCA1 or BRCA2? b. If 10 women are sampled from the U.S. population, what is the probability that none are carriers? c. If 20 women are sampled from the U.S. population, what is the probability that at least one is a carrier? d. What is the probability that a randomly sampled woman from the U.S. population both is a carrier of a mutant gene and develops breast cancer in her lifetime? e. What is the probability that a woman from the U.S. population develops breast cancer in her lifetime? f. What is the probability that a randomly chosen U.S. woman who develops breast cancer in her lifetime is a carrier? 2. Some researchers bought a new microgram scale. Before using it for new experiments, they wanted to ensure that the readings on the scale were accurate. They obtained a 10-µg standard weight, and then weighed this standard 30 times. Their measures were approximately normally distributed, with an average of 10.01 µg and standard deviation 0.2 µg. Test whether the scale is accurate, using these data. 3. Some genera have far more species than others. Is this just luck, or have some genera hit upon a “key innovation” that gives them a benefit and allows more species to accumulate? One possible key innovation in plants is the ability to climb like a vine, which makes it possible to reach above other plants to compete for light. A study counted the number of species in 48 genera that had all evolved climbing (Gianoli 2004). For each of these genera, the researchers also found the most closely related nonclimbing genus and counted the number of species in each. The numbers of species in these pairs of closely related genera are listed in the table below. The number of species does not have a normal distribution in either climbing genera or nonclimbing genera, and the differences between climbing genera and their most closely related non-climbing genera are also not normally distributed. Climbing 14 20 13 49 Nonclimbing 3 23 3 42 Climbing 267 124 17 42 Nonclimbing 160 36 4 260 300 358 302 43 19 3 75 2 9 26 197 1 62 1 60 3 850 61 293 441 92 190 1041 13 34 240 87 29 2 50 262 8 22 157 115 35 130 15 650 845 6 30 307 2888 11 39 7 12 5 1 25 34 1 7 427 710 1600 18 4 150 350 142 525 24 180 320 625 306 1 2 3 3228 318 9 2 54 702 635 31 6 a. Plot a histogram of the difference in number of species for each pair. b. Carry out an appropriate test of the difference in the number of species in climbing and non-climbing genera. 4. In the early days of genetics, scientists realized that Mendel’s laws of segregation could be used to predict that the second generation after a cross between two pure strains (the F2) ought to have greater variance than the first generation after the cross (the F1). One early test of this prediction was done by measuring flower length in a cross between two varieties of tobacco (East 1916). The following is a frequency table of the resulting individuals in both the first and second generations: Flower length (mm) Number of F1 plants Number of F2 plants 52 55 58 61 64 67 70 73 76 0 4 10 41 75 40 3 0 0 3 9 18 47 55 93 75 60 43 79 82 85 88 0 0 0 0 25 7 8 1 n Mean 173 63.53 444 68.76 Variance 8.62 42.37 a. Show how the means and variances were calculated for the F1 data. b. Test whether the variances of the two groups of plants differ, making all necessary assumptions. Is one significantly greater than the other? If so, which one? 5. Males of the Australian butterfly Jalmenus evagoras search for females and mate with them just as the females emerge as adults from the pupae. Multiple males might discover the same female, in which case they all attempt to mate with the same female as she emerges, forming a “mating ball.” Females mate only once, and it is possible to record which male successfully mated with every female in a local population. In this way, Elgar and Pierce (1988) were able to track the mating success of 35 individual male J. evagoras butterflies over their complete life spans. Mating success is indicated in the accompanying table. Lifetime number of mates Frequency of males 0 1 2 3 4 5 6 7 20 9 1 1 2 1 0 1 a. Show these data in a graph. b. Which probability distribution is expected to fit these data if all males have an equal probability of mating and if mating events are independent? c. Calculate the expected frequencies of males having 0, 1, 2, ..., 7 mates under this probability distribution. d. Add the expected frequencies to the graph you drew in part (a). Describe the pattern of differences between the observed and expected frequencies. e. Are the differences between the observed and expected frequencies greater than we would expect by chance? Carry out a formal hypothesis test. 6. The males of some cichlid fish species are infertile until a few days after they become the socially dominant male in the presence of females. Males without a territory (and therefore without a hope of mating) have atrophied genitalia, whereas males who control a territory with females around have well-developed genitalia. They can shift from one state to the other in a matter of days. White et al. (2002) wanted to know the hormonal signal for this shift, and one candidate hormone was gonadotropin-releasing hormone (GnRH). They measured the levels of messenger RNA (mRNA) of GnRH for five territorial fish (T) and for six non-territorial fish (NT), in units of optical density relative to a known control. The data are in the accompanying table. Both distributions have positive skew (skewed to the right). Territorial status GnRH mRNA level NT NT NT NT NT NT T T T T T 0.504 0.432 0.744 0.792 0.672 1.344 1.152 1.272 2.328 3.288 0.888 a. What procedure or procedures could be used to test for a difference in the mean GnRH levels between the T and NT populations? b. Perform an appropriate hypothesis test for this difference. c. Calculate a 95% confidence interval for the difference in GnRH mRNA level means between these two groups. 7. Every year Britain has a No Smoking Day, when many people voluntarily stop smoking for a day. No Smoking Day occurs on the second Wednesday of March each year. Waters et al. (1998) used this event to investigate the influence of stopping smoking on nonfatal injuries on the job. They compared the injury rate each year on the Wednesday of No Smoking Day to the rate for the previous Wednesday of the same year. The idea was that this comparison would control for many of the other factors that affect injury rate, such as year, time of week, and so on. The data from 1987 to 1996 (number of injuries in one day) are listed in the following table: Year Injuries before No Smoking Day Injuries on No Smoking Day 1987 1988 1989 1990 1991 1992 1993 516 610 581 586 554 632 479 540 620 599 639 607 603 519 1994 1995 1996 583 445 522 560 515 556 a. How many more or fewer injuries are there on No Smoking Day, on average, compared with the normal day? b. What is the 99% confidence interval for this difference? c. In your own words, explain what the 99% confidence interval means. d. Test whether the accident rate changes on No Smoking Day. 8. In the women’s tennis finals at Wimbledon, the match winner is the first player to win two sets. Sets continue one after the other until there is a winner and loser; there are no ties. If one woman wins the first two sets, the match is finished at two sets. The maximum possible number of sets in a match is three. a. Imagine that two women are equal in ability, so the probability of each woman winning any single set is 0.50. Use a probability tree to find the probability that a match lasts exactly two sets. What is the probability that the match lasts exactly three sets? b. Imagine that one woman was better than the other, such that the probability of her victory in any set is 0.55. Use a probability tree to find the probability that the weaker woman would win the match. 9. The pitcher plant from Borneo, Nepenthes rafflesiana, has two kinds of pitchers, upper and lower, that it uses to trap and digest insects. The upper pitchers use fragrance to lure mainly flying insects. The lower pitchers trap mainly ants. Di Giusto et al. (2010) tested whether odors emitted by the lower pitchers were cues for ant attraction. In one experiment, they presented individual ants, Oecophylla smaragdina, with air from a bag containing a lower pitcher in one arm of a Y-tube. Humidified air was provided in the other arm as a control. In 14 of 19 independent trials, ants chose the arm containing the pitcher, whereas the remaining five ants chose the control arm. Do these data indicate a preference for one arm or the other? Use an exact test. 10. For each of the following scenarios, state the null hypothesis and identify the best statistical test to use to answer the question stated. (Don’t try to answer to the specific question.) a. Do stickleback fish occur with equal probability through all areas of a pond? b. A large number of Douglas-fir and Western hemlock trees were sampled, and the presence or absence of pine beetles on each tree was recorded. Do the tree species differ in pine beetle occurrence? c. A small number of Douglas-fir and Western hemlock trees were sampled, and the presence or absence of pine beetles was recorded. Do the species differ in occurrence of pine beetles? The expected number of infested fir trees is calculated as 2.3. d. Do patients change in mean body mass during a hospital stay? e. Does the amount of rainfall per day in a rainforest have a normal distribution? f. Which sex weighs more on average in bald eagles: males or females? Assume that the distribution of body weight in each sex is normally distributed, but that the two sexes have markedly different variances. g. Which sex travels more per day, on average, in sperm whales: males or females? Assume that the distribution of distance traveled is very different from a normal distribution in each sex (but similar between sexes) and that sample size is small. h. Do cats have greater strength in their dominant front paw (usually the right) than in their other front paw (usually the left)? The data are measurements of strength in dominant paw and other paw of a random sample of cats. i. Does the mean number of chirps per minute by male crickets differ when the same crickets are measured at 15°C and at 25°C? j. The data are water samples taken at the local beach. Does the mean number of bacteria per milliliter differ from 130 individuals per 100 ml? 11. What feature of an estimate—precision or accuracy—is most strongly affected when individuals differing in the variable of interest are not sampled independently? 12. Body size in female northern fur seals (Callorhinus ursinus), measured as total length, is approximately normally distributed with a mean of 124.6 cm and a standard deviation equal to 6.5 cm (Trites 1996). a. About what fraction of individuals have a total body length less than 110 cm? b. What fraction of female fur seals have a body length between 130 and 140 cm? c. What fraction have a body length between 120 and 125 cm? 13. Gesturing is common during human speech. Is this behavior learned via exposure? A measure was made of the number of gestures produced by each of 12 pairs of sighted individuals while talking to sighted individuals (Iverson and Goldin-Meadow 1998). This result was compared with the number of gestures produced while talking by each of 12 pairs of people who had been blind since birth and were therefore presumably unexposed to the gestures of others. The data23 are as follows: Blind: 0, 1, 1, 2, 1, 1, 1, 3, 1, 0, 1, 1 Sighted: 1, 0, 1, 2, 3, 0, 1, 2, 2, 0, 3, 1 Test the hypothesis that the number of gestures is related to sightedness, using a nonparametric test. 14. Kids are often told that they should not crack their knuckles, because otherwise all sorts of terrible things may befall them. It is commonly believed that knuckle cracking leads to arthritis, which de Weber et al. (2011) recently tested in a case-control study. Of 135 patients with osteoarthritis (cases), 24 had frequently cracked their knuckles. Of 80 control patients without osteoarthritis, 19 had frequently cracked their knuckles. a. What type of graph is ideal for displaying these results? b. What is the odds ratio for osteoarthritis, comparing knuckle-crackers to noncrackers? c. Give a 95% confidence interval for this odds ratio. d. Carry out a formal hypothesis test of whether knuckle cracking is associated with osteoarthritis. 15. Have you ever had the experience that driving somewhere seems to take a really long time, but the trip back home goes faster, even though it is the same distance in reverse? Van de Ven et al. (2011) wanted to investigate how common this subjective experience was. They interviewed 69 people who had just been on trips where the outbound and inbound travel time was the same and who had been awake the whole time. They asked the people to evaluate the trips on an 11-point numeric scale, from —5 (return trip was a lot shorter) to +5 (return trip was a lot longer). The data are given below in a frequency table. Return trip time score Frequency (number of respondents) −5 −4 −3 −2 −1 0 1 2 3 4 5 1 4 11 9 6 20 6 4 6 2 0 a. What is the mean of the return trip time score? b. Calculate the 95% confidence interval of the mean return trip time score. c. It is also interesting to know to what extent people experience this subjective impression about travel time in the same way. What is the 95% confidence interval of the variance in the return trip time score? 16. In Chapter 3, Assignment Problem 22, you produced a box plot for the following data from Norton et al. (2011). The data are measurements of the amount of time, in seconds, that individual zebrafish with and without the spiegeldanio (spd) mutation at the Fgfrla gene spent in aggressive activity over 5 minutes when presented with a mirror image of themselves. The researchers were interested in the role this gene plays in differences between individuals along the shy-bold behavioral spectrum. Spd mutant: 96, 97, 100, 127, 128, 156, 162, 170, 190, 195 Wild type: 0, 21, 22, 28, 60, 80, 99, 101, 106, 129, 168 a. With these data, estimate the magnitude of the effect of the mutation (the difference between the means) on the amount of time spent in aggressive activity. Put appropriate bounds on your estimate of the effect. b. What is the weight of evidence that this effect is not zero? Perform an appropriate statistical test of the difference. 14 Designing experiments Two types of investigations are carried out in biology: observational and experimental. In an experimental study, the researcher assigns treatments to units or subjects so that differences in response can be compared. In an observational study, on the other hand, nature does the assigning of treatments to subjects. The researcher has no influence over which subjects receive which treatment. What’s so important about the distinction? Whereas observational studies can identify associations between treatment and response variables, properly designed experimental studies can identify the causes of these associations. How do we best design an experiment to get the most information possible out of it? The short answer is that we must design to eliminate bias and to reduce the influence of sampling error. The present chapter outlines the basics on how to accomplish this feat. We also briefly discuss how to design an observational study: by taking the best features of experimental designs and incorporating as many of them as possible. Finally, we discuss how to plan the sample size needed in an experimental or observational study. Why do experiments? In an experimental study, there must be at least two treatments and the experimenter (rather than nature) must assign them to units or subjects. The crucial advantage of experiments derives from the random assignment of treatments to units. Random assignment, or randomization, minimizes the influence of confounding variables (Interleaf 4), allowing the experimenter to isolate the effects of the treatment variable. Confounding variables Studies in biology are usually carried out with the aim of deciding how an explanatory variable or treatment affects a response variable. How are injury rates in cats with “high-rise syndrome” affected by the number of stories fallen? What is the effect of marine reserves on fish biomass? How does the use of supplemental oxygen affect the probability of surviving an ascent of Mount Everest? The easiest way to address these questions is with an observational study—that is, to gather measurements of both variables of interest on a set of subjects and estimate the association between them. If the two variables are correlated or associated, then one may be the cause of the other. The limitation of the observational approach is that, by itself, it cannot distinguish between two completely different reasons behind an association between an explanatory variable X and a response variable Y. One possibility is that X really does cause a response in Y. For example, taking supplemental oxygen might increase the chance of survival during a climb of Mount Everest. The other possibility is that the explanatory variable X has no effect at all on the response variable Y; they are associated only because other variables affect both X and Y at the same time. For example, the use of supplemental oxygen might just be a benign indicator of a greater overall preparedness of the climbers who use it, and greater preparedness rather than oxygen use is the real cause of the enhanced survival. Variables (like preparedness) that distort the causal relationship between the measured variables of interest (oxygen use and survival) are called confounding variables. Recall from Interleaf 4, for example, that ice cream consumption and violent crime are correlated, but neither is the cause of the other. Instead, increases in both ice cream consumption and crime are caused by higher temperatures. Temperature is a confounding variable in this example. A confounding variable is a variable that masks or distorts the causal relationship between measured variables in a study. Confounding variables bias the estimate of the causal relationship between measured explanatory and response variables, sometimes even reversing the apparent effect of one on the other. For example, observational studies have indicated that breast-fed babies have lower weight at six and 12 months of age compared with formula-fed infants (Interleaf 4). But an experimental study using randomization found that mean infant weight was actually higher in breast-fed babies at six months of age and was not less than that in formula-fed babies at 12 months (Kramer et al. 2002). The observed relationship between breast feeding and infant growth was confounded by unmeasured variables such as the socioeconomic status of the parents. With an experiment, random assignment of treatments to participants allows researchers to tease apart the effects of the explanatory variable from those of confounding variables. With random assignment, no confounding variables will be associated with treatment except by chance. For example, if women who choose to breast-feed their babies have a different average socioeconomic background than women who choose to feed their infants formula, randomly assigning the treatments “breast feeding” and “formula feeding” to women in an experiment will break this connection, roughly equalizing the backgrounds of the two treatment groups. In this case, any resulting difference between groups in infant weight (beyond chance) must be caused by treatment. Experimental artifacts Unfortunately, experiments themselves might inadvertently create artificial conditions that distort cause and effect. Experiments should be designed to minimize artifacts. An experimental artifact is a bias in a measurement produced by unintended consequences of experimental procedures. For example, experiments conducted on aquatic birds have shown that their heart rates drop sharply when they are forcibly submerged in water, compared with individuals remaining above water. The drop in heart rate has been interpreted as an oxygen-saving response. Later studies using improved technology showed that voluntary dives do not produce such a large drop in heart rate (Kanwisher et al. 1981). This finding suggested that a component of the heart rate response in forced dives was induced by the stress of being forcibly dunked underwater, rather than the dive itself. The experimental conditions introduced an artifact that for a while went unrecognized. To prevent artifacts, experimental studies should be conducted under conditions that are as natural as possible. A potential drawback is that more natural conditions might introduce more sources of variation, reducing power and precision. Observational studies can provide important insight into what is the best setting for an experiment. Lessons from clinical trials The gold standard of experimental designs is the clinical trial, an experimental study in which two or more treatments are assigned to human participants. The design of clinical trials has been refined because the cost of making a mistake with human participants is so high. Experiments on nonhuman subjects are simply called “laboratory experiments” or “field experiments,” depending on where they take place. Experimental studies in all areas of biology have been greatly informed by procedures used in clinical trials. A clinical trial is an experimental study in which two or more treatments are applied to human participants. Before we dig into the logic of the main components of experimental design, let’s look at the clinical trial in Example 14.2, which incorporates many of these features. EXAMPLE 14.2 Reducing HIV transmission Transmission of the HIV-1 virus via sex workers contributes to the rapid spread of AIDS in Africa. How can this transmission be reduced? In laboratory experiments, the spermicide nonoxynol-9 had shown in vitro activity against HIV-1, shown schematically at the right. This finding motivated a clinical trial by van Damme et al. (2002), who tested whether a vaginal gel containing the chemical would reduce female sex workers’ risk of acquiring the disease. Data were gathered on a volunteer sample of 765 HIV-free sex workers in six clinics in Asia and Africa. Two gel treatments were assigned randomly to women at each clinic. One gel contained nonoxynol-9, and the other contained a placebo (an inactive compound that participants could not distinguish from the treatment of interest). Neither the participants nor the researchers making observations at the clinics knew who had received the treatment and who had received the placebo. (A system of numbered codes kept track of who got which treatment.) By the end of the experiment, 59 of 376 women in the nonoxynol-9 group (15.9 %) were HIV-positive (Table 14.2-1), compared with 45 out of 389 women in the placebo group (11.6 %). Thus, the odds of contracting HIV-1 were slightly higher in the nonoxynol-9 group compared with the placebo group—which was the opposite of the expected result. The reason seems to be that repeated use of nonoxynol-9 causes tissue damage that leads to higher risk. TABLE 14.2-1 Results of the clinical trial in Example 14.2 (n is the number of subjects). Nonoxynol-9 Clinic Placebo n Number infected n Number infected Abidjan Bangkok Cotonou 78 26 100 0 0 12 84 25 103 5 0 10 Durban Hat Yai 2 Hat Yai 3 94 22 56 42 0 5 93 25 59 30 0 0 376 59 389 45 Total Design components A good experiment is designed with two objectives in mind: ■ To reduce bias in estimating and testing treatment effects ■ To reduce the effects of sampling error The most significant elements in the design of the clinical trial in Example 14.2 addressed these two objectives. To reduce bias, the experiment included the following elements. 1. A simultaneous control group: the study included both the treatment of interest and a control group (the women receiving the placebo). 2. Randomization: treatments were randomly assigned to women at each clinic. 3. Blinding: neither the participants nor the clinicians knew which women were assigned which treatment. To reduce the effects of sampling error, the experiment included these elements. 1. Replication: the study was carried out on multiple independent participants. 2. Balance: the number of women was nearly equal in the two groups at every clinic. 3. Blocking: participants were grouped according to the clinic they attended, yielding multiple repetitions of the same experiment in different settings (i.e., “blocks”). The goal of experimental design is to eliminate bias and to reduce sampling error when estimating and testing the effects of one variable on another. In Section 14.3, we discuss the virtues of the three main strategies used to reduce bias— namely, simultaneous controls, randomization, and blinding. In Section 14.4, we explain the strategies used to reduce the effects of sampling error—namely, replication, balance, and blocking. As usual, we assume throughout that units or subjects have been randomly sampled from the population of interest. How to reduce bias We have seen how confounding variables in observational studies can bias the estimated effects of an explanatory variable on a response variable. The following experimental procedures are meant to eliminate bias. Simultaneous control group A control group is a group of subjects who are treated like all of the experimental subjects, except that the control group does not receive the treatment of interest. A control group is a group of subjects who do not receive the treatment of interest but who otherwise experience similar conditions as the treated subjects. In an uncontrolled experiment, a group of subjects are treated in some way and then measured to see how they have responded. Lacking a control group for comparison, such a study cannot determine whether the treatment of interest is the cause of any of the observed changes. There are several possible reasons for this, including the following: ■ Sick human participants selected for a medical treatment may tend to “bounce back” toward their average condition regardless of any effect of the treatment (Interleaf 6). ■ Stress and other impacts associated with administering the treatment (such as surgery or confinement) might themselves produce a response separate from the effect of the treatment of interest. ■ The health of human participants often improves after treatment merely because of their expectation that the treatment will have an effect. This phenomenon is known as the placebo effect (Interleaf 6). The solution to all of these problems is to include a control group of subjects measured for comparison. The treatment and control subjects should be tested simultaneously or in random order, to ensure that any temporal changes in experimental conditions do not affect the outcome. The appropriate control group will depend on the circumstance. Here are some examples: ■ In clinical trials, either a placebo or the currently accepted treatment should be provided, such as in Example 14.2. A placebo is an inactive treatment that subjects cannot distinguish from the main treatment of interest. ■ In experiments requiring intrusive methods to administer treatment, such as injections, surgery, restraint, or confinement, the control subjects should be perturbed in the same way as the other subjects, except for the treatment itself, as far as ethical considerations permit. The “sham operation,” in which surgery is carried out without the experimental treatment itself, is an example. Sham operations are very rare in human studies, but they are more common in animal experiments. ■ In field experiments, applying a treatment of interest may physically disturb the plots receiving it and the surrounding areas, perhaps by the researchers trampling. Ideally, the same disturbance should be applied to the control plots. Often it is desirable to have more than one control group. For example, two control groups, where one is a harmless placebo and the other is the best existing treatment, may be used in a study so that the total effect of the treatment and the improvement of the new treatment over the old may both be measured. However, using resources for multiple controls might reduce the power of the study to test its main hypotheses. Randomization Once the treatments have been chosen, the researcher should randomize their assignment to units or subjects in the sample. Randomization means that treatments are assigned to units at random, such as by flipping a coin. Chance rather than conscious or unconscious decision determines which units end up receiving the treatment of interest and which receive the control. A completely randomized design is an experimental design in which treatments are assigned to all units by randomization. Randomization is the random assignment of treatments to units in an experimental study. The virtue of randomization is that it breaks the association between possible confounding variables and the explanatory variable, allowing the causal relationship between the explanatory and response variables to be assessed. Randomization doesn’t eliminate the variation contributed by confounding variables, only their correlation with treatment. It ensures that variation from confounding variables is spread more evenly between the different treatment groups, and so it creates no bias. If randomization is done properly, any remaining influence of confounding variables occurs by chance alone, which statistical methods can account for. Randomization should be carried out using a random process. The following steps describe one way to assign treatments randomly: 1. List all n subjects, one per row, in a computer spreadsheet. 2. Use the computer to give each individual a random number.1 3. Assign treatment A to those subjects receiving the lowest numbers and treatment B to those with the highest numbers. This process is demonstrated in Figure 14.3-1, where eight subjects are assigned to two treatments, A and B. FIGURE 14.3-1 A procedure for randomization. Each of eight subjects was assigned a number between 0 and 99 that was drawn at random by a computer. Treatment A (colored red) was assigned to the four subjects with the lowest random numbers, whereas treatment B (gold) was assigned to the rest. Other ways of assigning treatments to subjects are almost always inferior, because they do not eliminate the effects of confounding variables. For example, the following methods can lead to problems: ■ Assign treatment A to all patients attending one clinic and treatment B to patients attending a second clinic. (Problem: All of the other differences between the two clinics become confounding variables. If one clinic is better than the other in general, then the difference in clinic quality would show up as a difference in treatments.) ■ Assign treatments to human participants alphabetically. (Problem: This might inadvertently group individuals having the same nationality, generating unwanted differences between treatments in health histories and genetic variables.) It is important to use a computer random-number generator or random-number tables to assign individuals randomly to treatments. “Haphazard” assignment, in which the researcher chooses a treatment while trying to make it random, has repeatedly been shown to be nonrandom and prone to bias.2 Blinding The process of concealing information from participants and researchers about which of them receive which treatments is called blinding. Blinding prevents participants and researchers from changing their behavior, consciously or unconsciously, based on their knowledge of which treatment they were receiving or administering. For example, a researcher who believes that acupuncture helps alleviate back pain might unconsciously interpret a patient’s report of pain differently if the researcher knows the patient was assigned the acupuncture treatment instead of a placebo. This might explain why studies that have shown acupuncture has a significant effect on back pain are limited to those without blinding (Ernst and White 1998). Studies implementing blinding have not found that acupuncture has an ameliorating effect on back pain. Blinding is the process of concealing information from participants (sometimes including researchers) about which individuals receive which treatment. In a single-blind experiment, participants are unaware of the treatment they have been assigned. This requires that the treatments be indistinguishable to the participants, a particular necessity in experiments involving humans. Single-blinding prevents participants from responding differently according to their knowledge of their treatment. This is not much of a concern in nonhuman studies. In a double-blind experiment, the researchers administering the treatments and measuring the response are also unaware of which subjects are receiving which treatments. This prevents researchers who are interacting with the subjects from behaving differently toward them according to their treatments. Researchers sometimes have pet hypotheses, and they might treat experimental subjects in different ways depending on their hopes for the outcome. Moreover, many response variables are difficult to measure and require some subjective interpretation, which makes the results prone to a bias in favor of the researchers’ wishes and expectations. Finally, researchers are naturally more interested in the treated subjects than the control subjects, and this increased attention can itself result in improved response. Reviews of medical studies have revealed that studies carried out without double-blinding exaggerated treatment effects by 16% on average, compared with studies carried out with double-blinding (Jüni et al. 2001). Experiments on nonhuman subjects are also prone to bias from lack of blinding. Bebarta et al. (2003) reviewed 290 two-treatment experiments carried out on animals or on cell lines. They found that the odds of detecting a positive effect of treatment were more than threefold higher in studies without blinding than in studies with blinding.3 Blinding can be incorporated into experiments on nonhuman subjects by using coded tags that identify the subject to a “blind” observer without revealing the treatment (and then the observer measures units from different treatments in random order). How to reduce the influence of sampling error Assuming we have designed our experiment to minimize sources of bias, there is still the problem of detecting any treatment effects against the background of variation between individuals (“noise”) caused by other variables. Such variability creates sampling error in the estimates, reducing precision and power. How can the effects of sampling error be minimized? One way to reduce noise is to make the experimental conditions constant. Fix the temperature, humidity, and other environmental conditions, for example, and use only participants who are of the same age, sex, genotype, and so on. In field experiments, however, highly constant experimental conditions might not be feasible. Constant conditions might not be desirable, either. By limiting the conditions of an experiment, we also limit the generality of the results—that is, the conclusions might apply only under the conditions tested and not more broadly. Until recently, a significant source of bias in medical practice stemmed from the fact that many clinical tests of the effects of medical treatments were carried out only on men, yet the treatments were subsequently applied to women as well (e.g., McCarthy 1994). In this section, we review replication, balance, and blocking, the three main statistical design procedures used to minimize the effects of sampling error. We also review a strategy to reduce the effect of noise by using extreme treatments. Replication Because of variability, replication—the repetition of every treatment on multiple experimental units—is essential. Without replication, we would not know whether response differences were due to the treatments or just chance differences between the treatments caused by other factors. Studies that use more units (i.e., that have larger sample sizes) will have smaller standard errors and a higher probability of getting the correct answer from a hypothesis test. Larger samples give more information, and more information gives better estimates and more powerful tests. Replication is the application of every treatment to multiple, independent experimental units. Replication is not just about the number of plants or animals used. True replication depends on the number of independent units in the experiment. An “experimental unit” is the independent unit to which treatments are assigned. Figure 14.4-1 shows three hypothetical experiments designed to compare the effects of two fertilizer treatments on plant growth. The lack of replication is obvious in the first design (top panel), because there is only one plant per treatment. You won’t see many published experiments like it. FIGURE 14.4-1 Three experimental designs used to compare plant growth under two fertilizer treatments (indicated by the shading of the pots). The upper (“two pots”) and middle (“two chambers”) designs are unreplicated. The lack of replication is less obvious in the second design (the middle panel of Figure 14.4-1). Although there are multiple plants per treatment in the second design, all plants in one treatment are confined to one chamber and all plants in the second treatment are confined to another chamber. If there are environmental differences between chambers (e.g., differences in light conditions or humidity) beyond those stemming from the treatment itself, then plants in the same chamber will be more similar in their responses than plants in different chambers, apart from any treatment effects. The plants in the same chamber are not independent. As a result, the chamber, not the plant, is the experimental unit in a test of fertilizer effects. Because there are only two chambers, one per treatment, the experiment is unreplicated. Only the third design (the bottom panel) in Figure 14.4-1 is properly replicated, because here treatments have been randomly assigned to individual plants. A giveaway indicator of replication in the third design is interspersion of experimental units assigned different treatments, which is an expected outcome of randomization. Such interspersion is lacking in the two-chamber design (the middle panel in Figure 14.4-1), which is a clear sign of a replication problem. An experimental unit might be a single animal or plant if individuals are randomly sampled and assigned treatments independently. Or, an experimental unit might be made up of a batch of individual organisms treated as a group, such as a field plot containing multiple individuals, a cage of animals, a household, a petri dish, or a family. Multiple individual organisms belonging to the same unit (e.g., plants in the same plot, bacteria in the same dish, members of the same family, and so on) should be considered together as a single replicate if they are likely to be more similar on average to each other than to individuals in separate units (apart from the effects of treatment). Correctly identifying replicates in an experiment is crucial to planning its design and analyzing the results. Erroneously treating the single organism as the independent replicate when the chamber (Figure 14.4-1) or field plot is the experimental unit will lead to calculations of standard errors and P-values that are too small. This is pseudoreplication, as discussed in Interleaf 2. From the standpoint of reducing sampling error, more replication is always better. As proof, examine the formula for the standard error of the difference between two sample mean responses to two treatments, Y¯1−Y¯2 : SEY¯1-Y¯2=sp2(1n1+1n2). The symbols n1 and n2 refer to the number of experimental units, or replicates, in each of the two treatments. Based on this equation, increasing n1 and n2 directly reduces the standard error, increasing precision. Increased precision yields narrower confidence intervals and more powerful tests of the difference between means. On the other hand, increasing sample size also has costs in terms of time, money, and even lives. We discuss how to plan a sufficient sample size in more detail in Section 14.7. Balance A study design is balanced if all treatments have the same sample size. Conversely, a design is unbalanced if there are unequal sample sizes between treatments. In a balanced experimental design, all treatments have equal sample size. Balance is a second way to reduce the influence of sampling error on estimation and hypothesis testing. To appreciate this, look again at the equation for the standard error of the difference between two treatment means (given on page 433). For a fixed total number of experimental units, n1 + n2, the standard error is smallest when the quantity (1n1+1n2) is smallest, which occurs when n1 and n2 are equal. Convince yourself that this is true by plugging in some numbers. For example, if the total number of units is 20, the quantity 1/n1 + 1/n2 is 0.2 when n1 = n2 = 10, but it is 1.05 when n1 = 19 and n2 = 1. With better balance, the standard error is much smaller. To estimate the difference between two groups, we need precise estimates of the means of both groups. With an unbalanced design, we may know the mean of one group with great precision, but this does not help us much if we have very little information about the other group that we’re comparing it with. Balance allocates the sampling effort in the optimal way. Nevertheless, the precision of an estimate of a difference between groups always increases with larger sample sizes, even if the sample size is increased in only one of two groups. But for a fixed total number of subjects, the optimal allocation is to have an equal number in each group. Balance has other benefits, which we discuss elsewhere in the book. For example, the methods based on the normal distribution for comparing population means are most robust to departures from the assumption of equal variances when designs are balanced or nearly so (see Chapters 12 and 15). Blocking Blocking is an experimental design strategy used to account for extraneous variation by dividing the experimental units into groups, called blocks or strata, that share common features. Within blocks, treatments are assigned randomly to experimental units. Blocking essentially repeats the same completely randomized experiment multiple times, once for each block, as shown schematically in Figure 14.4-2. Differences between treatments are evaluated only within blocks. In this way, much of the variation arising from differences between blocks is accounted for and won’t reduce the power of the study. Blocking is the grouping of experimental units that have similar properties. Within each block, treatments are randomly assigned to experimental units. FIGURE 14.4-2 An experimental design incorporating blocking to test effects of fertilizer on plant growth (see Figure 14.4-1). Shading of the pots indicates which fertilizer treatment each plant received. Chambers might differ in unknown ways and add unwanted noise to the experiment. To remove the effects of such variation, carry out the same completely randomized experiment separately within each chamber. In this design, each chamber represents one block. The women participating in the nonoxynol-9 HIV study discussed in Example 14.2 were grouped according to the clinic they attended. This made sense because there were age differences between women attending different clinics as well as differences in condom usage and sexual practices, all of which are likely to affect HIV transmission rates (van Damme et al. 2002). Blocking removes the variation in response among clinics, allowing more precise estimates and more powerful tests of the treatment effects. The paired design for two treatments (Chapter 12) is an example of blocking. In a paired design, both of two treatments are applied to each plot or other experimental unit representing a block. The difference between the two responses made on each block is the measure of the treatment effect. The randomized block design is analogous to the paired design, but it can have more than two treatments, as shown in Example 14.4A. The randomized block design is like a paired design but for more than two treatments. EXAMPLE 14.4A Holey waters The compact size of water-filled tree holes, which can harbor diverse communities of aquatic insect larvae, makes them useful microcosms for ecological experiments. Srivastava and Lawton (1998) made artificial tree holes from plastic that mimicked the buttress tree holes of European beech trees (see image on right). They placed the plastic holes next to trees in a forest in southern England to examine how the amount of decaying leaf litter present in the holes affected the number of insect eggs deposited (mainly by mosquitoes and hover flies) and the survival of the larvae emerging from those eggs. Leaf litter is the source of all nutrients in these holes, so increasing the amount of litter might result in more food for the insect larvae. There were three different treatments. In one treatment (LL), a low amount of leaf litter was provided. In a second treatment (HH), a high level of debris was provided. In the third treatment (LH), leaf litter amounts were initially low but were then made high after eggs had been deposited. A randomized block design was used in which artificial tree holes were laid out in triplets (blocks). Each block consisted of one LL tree hole, one HH tree hole, and one LH tree hole. The location of each treatment within a block was randomized, as shown in Figure 14.4-3. FIGURE 14.4-3 Schematic of the randomized block design used in the tree-hole study of Example 14.4A. Each block of three tree holes was placed next to its own beech tree in the woods. Within blocks, the three treatments were randomly assigned to tree holes. As in the paired design, treatment effects in a randomized block design are measured by differences between treatments exclusively within blocks, a strategy that minimizes the influence of variation among blocks. In the randomized block design, each treatment is applied once to every block. By accounting for some sources of sampling variation, such as the variation among trees, blocking can make differences between treatments stand out. In Chapter 18, we discuss in greater detail how to analyze data from a randomized block design. Blocking is worthwhile if units within blocks are relatively homogeneous, apart from treatment effects, and units belonging to different blocks vary because of environmental or other differences. For example, blocks can be made up of any of these units: ■ ■ ■ ■ ■ Field plots experiencing similar local environmental conditions Animals from the same litter Aquaria located on the same side of the room Patients attending the same clinic Runs of an experiment executed on the same day One potential drawback to blocking might occur if the effects of one treatment contaminate the effects of the other in the same block. For example, watering one half of a block might raise the soil humidity of the adjacent, unwatered half. Experiments should be designed carefully to minimize contamination. Extreme treatments Treatment effects are easiest to detect when they are large. Small differences between treatments are difficult to detect and require larger samples, whereas larger treatment differences are more likely to stand out against random variability within treatments. Therefore, one strategy to enhance the probability of detecting differences in an experiment is to include extreme treatments. Example 14.4B shows why this might be. EXAMPLE 14.4B Plastic hormones Bisphenol-A, or BPA, is an estrogenic compound found in plastics widely used to line food and drink containers and in dental sealants. Human daily exposures are typically in the range of 0.5−1 µg/kg body weight (Gray et al. 2004). Sakaue et al. (2001) measured sperm production of 13-week-old male rats exposed to fixed daily doses of BPA between 0 and 2000 µg/kg body weight for six days. The results are shown in a dose-response curve in Figure 14.4-4. FIGURE 14.4-4 A dose-response curve showing the results of an experiment measuring the rates of sperm production of male rats exposed to fixed daily doses of bisphenol-A (BPA) (Sakaue et al. 2001). Symbols are the mean ± 1 SE. This experiment included doses much higher than the typical doses faced by humans at risk, a strategy that enhanced the ability to detect an effect of BPA. For example, Figure 14.4-4 shows that there was a much larger difference in mean sperm production between the 0 and 2000 µg/kg groups than between the 0 and 0.002 µg/kg treatments. If the experimenter were to design a study to compare just one of these doses with the control, using 200 or 2000 µg/kg would yield the most power, because they show the largest difference in sperm production from the control. A larger dose, or stronger treatment, can increase the probability of detecting a response. But be aware that the effects of a treatment do not always scale linearly with the magnitude of a treatment. The effects of a large dose may be qualitatively different from those of a smaller, more realistic dose. Still, as a first step, extreme treatments can be a very good way to detect whether one variable has any effect at all on another variable. Experiments with more than one factor Up to now, we have considered only experiments that focus on measuring and testing the effects of a single factor. A factor is a single treatment variable whose effects are of interest to the researcher. However, many experiments in biology investigate more than one factor, because answering two questions from a single experiment rather than just one makes more efficient use of time, supplies, and other costs. A factor is a single treatment variable whose effects are of interest to the researcher. Another reason to consider experiments with multiple factors is that the factors might interact. When operating together, the factors might have synergistic or inhibitory effects not seen when each factor is tested alone. For example, human activity has driven global increases in atmospheric CO2 and temperature, as well as greater nitrogen deposition and precipitation. Increases in all of these factors have been shown to stimulate plant growth by experimental studies in which each treatment variable was examined separately. But what are the effects of these factors in combination? The only way to answer this is to design experiments in which more than one factor is manipulated simultaneously. If the climate variables interact when influencing plant growth, then their joint effects can be very different from their separate effects (Shaw et al. 2002). The factorial design is the most common experimental design used to investigate more than one treatment variable, or factor, at the same time. In a factorial design, every combination of treatments from two (or more) treatment variables is investigated. An experiment having a factorial design investigates all treatment combinations of two or more variables. A factorial design can measure interactions between treatment variables. The main purpose of a factorial design is to evaluate possible interactions between variables. An interaction between two explanatory variables means that the effect of one variable on the response depends on the state of a second variable. Example 14.5 illustrates an interaction in a factorial design. An interaction between two (or more) explanatory variables means that the effect of one variable depends upon the state of the other variable. EXAMPLE 14.5 Lethal combination Frog populations are declining everywhere, spawning research to identify the causes. Relyea (2003) looked at how a moderate dose (1.6 mg/l) of a commonly used pesticide, carbaryl (Sevin), affected bullfrog tadpole survival. In particular, the experiment asked how the effect of carbaryl depended on whether a native predator, the red-spotted newt, was also present. The newt was caged and could cause no direct harm, but it emitted visual and chemical cues that are known to affect tadpoles. The experiment was carried out in 10-liter tubs, each containing 10 tadpoles. The four combinations of pesticide treatment (carbaryl vs. water only) and predator treatment (present or absent) were randomly assigned to tubs. For each combination of treatments, there were four replicate tubs. The effects on tadpole survival are displayed in Figure 14.5-1. FIGURE 14.5-1 Interaction between the effects of the pesticide (carbaryl) and predator (red-spotted newt) treatments on tadpole survival. Each point gives the fraction of tadpoles in a tub that survived. Lines connect mean survival in the two pesticide treatments, separately for each predator treatment. The tub, not the individual tadpole, is the experimental unit, because tadpoles sharing the same tub are not independent. The results showed that survival was high, except when pesticide was applied together with the predator—neither treatment alone had much effect (Figure 14.51). Thus, the two treatments, predation and pesticide, seem to have interacted—that is, the effect of one variable depends on the state of the other variable. An experiment investigating the effects of the pesticide only would have measured little effect at this dose. Similarly, an experiment investigating the effect of the predator only would not have seen an effect on survival. A factorial design can still be worthwhile even if there is no interaction between explanatory variables. In this case, there are efficiency advantages because the same experimental units can be used to measure the effect of two (or more) variables simultaneously. What if you can’t do experiments? Experimental studies are not always feasible, in which case we must fall back upon observational studies. Observational studies can be very important, because they detect patterns and can help generate hypotheses. The best observational studies incorporate all of the features of experimental design used to minimize bias (e.g., simultaneous controls and blinding) and the impact of sampling error (e.g., replication, balance, blocking, and even extreme treatments), except for one: randomization. Randomization is out of the question because, in an observational study, the researcher does not assign treatments to subjects. Instead, the subjects come as they are. Match and adjust Without randomization, minimizing bias resulting from confounding variables is the greatest challenge of observational studies. Two types of strategies are used to limit the effects of confounding variables on a difference between treatments in a controlled observational study. One strategy, commonly used in epidemiological studies, is matching. With matching, every individual in the target group with a disease or other health condition is paired with a corresponding healthy individual who has the same measurements for known confounding variables, such as age, weight, sex, and ethnic background (Bland and Altman 1994). With matching, every individual in the treatment group is paired with a control individual having the same or closely similar values for the suspected confounding variables. Matching is often used when designing case-control studies. Recall from Chapter 9 that in a case-control study, exposure to one or more possible causal factors is compared between a sample of individuals having a disease (the cases) and a second sample of participants not having the disease (the controls). Matching ensures that the cases and controls are otherwise similar. For example, Dziekan et al. (2000) investigated possible causes of a hospital outbreak of antibiotic-resistant Staphylococcus. The 67 infected cases were each paired with a control individual matched for age, sex, hospital admission date, and admission department. Matching reduces bias by limiting the contribution of suspected confounding variables to differences between treatments. Unlike randomization in an experiment, matching in an observational study does not account for all confounding variables, only those explicitly used to match participants. Thus, while matching reduces bias, it does not eliminate bias. Matching also reduces sampling error by grouping experimental units into similar pairs, analogous to blocking in experimental studies. It is with such a matched case-control design that the link between smoking and lung cancer was convincingly demonstrated. In a weaker version of this approach, a comparison group is chosen that has a frequency distribution of measurements for each confounding variable that is similar to that of the treatment group, but no pairing takes place. For example, attention deficit/hyperactivity disorder (ADHD) is often treated with stimulants, such as amphetamines. Biederman et al. (2009) carried out an observational study to examine the psychiatric impacts later in life of stimulant treatment. A sample of ADHD youths who had been treated with stimulants was compared with a control sample of untreated ADHD individuals that was similar to the treated group in the distribution of ages, sex, ethnic background, sensorimotor function, other psychiatric conditions, and IQ. The second strategy used to limit the effects of confounding variables in a controlled observational study is adjustment, in which statistical methods such as analysis of covariance (Chapter 18) are used to correct for differences between treatment and control groups in suspected confounding variables. For example, LaCroix et al. (1996) compared the incidence of cardiovascular disease between two groups of older adults: those who walked more than four hours per week and those who walked less than one hour per week. The ages of the adults were not identical in the two groups, and this could affect the results. To compensate, the authors examined the relationship between cardiovascular disease and age within each exercise group, so that they could compare the predicted disease rates in the two groups for adults of the same age. This approach is discussed in more detail in Chapter 18. Choosing a sample size A key part of planning an experiment or observational study is to decide how many independent units or participants to include. There is no point in conducting a study whose sample size is too small to detect the expected treatment effect. Equally, there is no point in making an estimate if the confidence interval for the treatment effect is expected to be extremely broad because of a small sample size. Using too many participants is also undesirable, because each replicate costs time and money, and adding one more might put yet another individual in harm’s way, depending on the study. If the treatment is unsafe, as the spermicide nonoxynol-9 appears to be (Example 14.2), then we want to injure as few people or animals as possible in coming to this conclusion. Ethics boards and animal-care committees require researchers to justify the sample sizes for proposed experiments. How is the decision made? Here in Section 14.7, we answer this question for two objectives: when the goal is to achieve a predetermined level of precision of an estimate of treatment effect, or when we want to achieve predetermined power in a test of the null hypothesis of no treatment effect. We focus here on techniques for studies that compare the means of two groups. Formulas to help plan experiments for some other kinds of data are given in the Quick Formula Summary (Section 14.9). An important part of planning an experiment or observational study is choosing a sample size that will give sufficient power or precision. Plan for precision A frequent goal of studies in biology is to estimate the magnitude of the treatment effect as precisely as possible. Planning for precision involves choosing a sample size that yields a confidence interval of expected width. Typically, we hope to set the bounds as narrowly as we can afford. By way of example, let’s develop a plan for a two-treatment comparison of means. Let the unknown population mean of the response variable be µ1 in the treatment group of interest and µ2 in the control group. When the results are in, we will compute the sample means Y¯1 and Y¯2 and use them to calculate a 95% confidence interval for µ1 − µ2, the difference between the population means of the treatment and control groups. To simplify matters somewhat, we will assume that the sample sizes in both treatments are the same number, n. Let’s also assume that the measurement in the two populations is normally distributed and has the same standard deviation, σ. In this case, a 95% confidence interval for µ1 − µ2 will take the form Y¯1−Y¯2 ± margin of error, where “margin of error” is half the width of the confidence interval. Planning for precision involves deciding in advance how much uncertainty we can tolerate. Once we’ve decided that, then the sample size needed in each group is approximately n=8(σmargin of error)2. This formula is derived from the 2SE rule of thumb that was introduced in Section 4.3.4 According to this formula, a larger sample size is needed if σ, the standard deviation within groups, is large than if it is small. Additionally, a larger sample size is needed to achieve a high precision (a narrow confidence interval) than to achieve a low precision. A major challenge in planning sample size is that key factors, like σ, are not known. Typically, a researcher makes an educated guess for these unknown parameters based on pilot studies or previous investigations. (If no information is available, then consider carrying out a small pilot study first, before attempting a large experiment.) For example, let’s plan an experiment to measure the effect of diet on the eye span of male stalk-eyed flies (Example 11.2). The planned experiment will randomly place individual fly larvae into cups containing either corn or spinach. The target parameter is the difference between mean eye spans in the two diet treatments, µ1 − µ2. Assume that we would like to obtain a 95% confidence interval for this difference whose expected margin of error is 0.1 mm (i.e., the desired full width of the confidence interval is 0.2 mm). How many male flies should be used in each treatment to achieve this goal? Our sample estimate for σ was about 0.4, based on the sample of nine individuals in Example 11.2. Using these values gives n=8(σmargin of error)2=8(0.40.1)2=128. This is the sample size in each treatment, so the total number of male flies would be 256. At this point, we would need to decide whether this sample size is feasible in an experiment. If not, then there might be no point in carrying out the experiment. Alternatively, we could revisit the desired width of the 95% confidence interval. That is, could we be satisfied with a higher margin of error? If so, then we should decide on this new width and then recalculate n. After all this planning, imagine that the experiment is run and we now have our data. Will the confidence interval we calculate have the precision we planned for? There are two reasons that it probably won’t. First, 0.4 was just an educated guess for the value of σ to help our planning, and it was based on only nine individuals. The true value of σ in the population might be larger or smaller. Second, even if we were lucky and the true value of σ really is close to 0.4, the within-treatment standard deviation s from the experiment will not equal 0.4 because of sampling error. The resulting confidence interval will be narrower or wider accordingly. The probability that the width of the resulting confidence interval is less than or equal to the desired width is only about 0.5. To increase the probability of obtaining a confidence interval no wider than the desired interval width, we would need an even larger sample size. Figure 14.7-1 shows the general relationship between the expected precision of the 95% confidence interval and n, the sample size in each of two groups. The variable on the vertical axis is standardized as margin of error divided by s. The effect of sample size from n = 2 to n = 20 is shown. FIGURE 14.7-1 Expected precision of the 95% confidence interval for the difference between two treatment means depending on sample size n in each treatment. The vertical axis is given in standardized units, (margin of error)/σ. We calculated the expected confidence interval using the t-distribution, rather than with the 2SE approximation. The graph shows that very small sample sizes lead to very wide interval estimates of the difference between treatment means. More data give better precision. Note also that interval precision initially declines rapidly with increasing sample size (e.g., from n = 2 to n = 10), but it then declines more slowly (e.g., from n = 10 to n = 20). Precision is 0.63 at n = 20, but it drops to 0.40 by n = .50, to 0.28 by n = 100, and to 0.20 by n = 200. Thus, we get diminishing returns by increasing the sample size past a certain point. Plan for power Next we consider choosing a sample size based on a desired probability of rejecting a false null hypothesis—that is, planning a sample based on a desired power. Imagine, for example, that we want to test the following hypotheses on the effect of diet on eye span in stalk-eyed flies. H0: µ1 − µ2 = 0. HA: µ1 − µ2 ≠ 0. The null hypothesis is that diet has no effect on mean eye span. The power of this test is the probability of rejecting H0 if it is false. Planning for power involves choosing a sample size that would have a high probability of rejecting H0 if the absolute value of the difference between the means, |µ1 − µ2|, is at least as great as a specified value D. The value for D won’t be the true difference between the means; it is just the minimum we care about. By specifying a value for D in a sample size calculation, we are deciding that we aren’t much interested in rejecting the null hypothesis of no difference if |µ1 − µ2| is smaller than D. A conventional power to aim for is 0.80. That is, if H0 is false, we aim to demonstrate that it is false in 80% of the experiments (the other 20% of experiments would fail to reject H0 even though it is false). If we aim for a power of 0.80 and a conventional significance level of α = 0.05, then a quick approximation to the planned sample size n in each of two groups is n≈16(σD)2 (Lehr 1992). This formula assumes that the two populations are normally distributed and have the same standard deviation (σ), which we are forced to assume is known. A more exact formula is provided in the Quick Formula Summary (Section 14.9), which also allows you to choose other values for power and significance level. For a given power and significance level, a larger sample size is needed when the standard deviation σ within groups is large, or if the minimum difference that we wish to detect is small. Let’s return to our experiment to test the effect of diet on the eye span of male stalk-eyed flies. We would like to reject H0 at α = 0.05 with probability 0.80 if the absolute value of the difference between means were truly D = |µ1 − µ2| = 0.2 mm. How many males should be used in each treatment? Let’s assume again that σ = 0.4. Using this value in the equation for power gives n=16(0.40.2)2=64. This is the number in each treatment, so the total number of males needed in the experiment would be 128. These power calculations assume that we know the standard deviation (σ), which is stretching the truth. For this and other reasons, we must always view the results of power calculations with a great deal of caution. The calculations provide useful guidelines, but they do not give infallible answers. We have explored only the sample sizes needed to compare the means of two groups, but similar methods are available for other kinds of statistical comparisons as well. Sample sizes for desired precision and power are available for one- and two-sample means, proportions, and odds ratios in the Quick Formula Summary (Section 14.9). A variety of computer programs are available to calculate sample sizes when planning for power and precision. A good place to start investigating these programs is http://www.divms.uiowa.edu/~rlenth/Power/. Plan for data loss The methods given here in Section 14.7 for planning sample sizes refer to sample sizes still available at the end of the experiment. But some experimental individuals may die, leave the study, or be lost between the start and the end of the study. The starting sample sizes should be made even larger to compensate. Summary ■ In an experimental study, the researcher assigns treatments to subjects. ■ The purpose of an experimental study is to examine the causal relationship between an explanatory variable, such as treatment, and a response variable. The virtue of experiments is that the effect of treatment can be isolated by randomizing the effects of confounding variables. ■ A confounding variable masks or distorts the causal relationship between an explanatory variable and a response variable in a study. ■ A clinical trial is an experimental study involving human participants. ■ Experiments should be designed to minimize bias and limit the effects of sampling error. ■ Bias in experimental studies is reduced by the use of controls, by randomizing the assignment of treatments to experimental units, and by blinding. ■ In a completely randomized experiment, treatments are assigned to experimental units by randomization. Randomization reduces the bias caused by confounding variables by making nonexperimental variables equal (on average) between treatments. ■ The effect of sampling error in experimental studies is reduced by replication, by blocking, and by balanced designs. ■ A randomized block design is like a paired design but for more than two treatments. ■ The use of extreme treatments can increase the power of the experiment to detect a treatment effect. ■ Observational studies should employ as many of the strategies of experimental studies as possible to minimize bias and limit the effect of sampling error. ■ Although randomization is not possible in observational studies, the effects of confounding variables can be reduced by matching and by adjusting for differences between treatments in known confounding variables. ■ A factorial design is used to investigate the interaction between two or more treatment variables. The factorial design includes all possible combinations of the treatment variables. ■ When planning an experiment, the number of experimental units to include can be chosen so as to achieve the desired width of confidence interval for the difference between treatment means. ■ Alternatively, the number of experimental units to include when planning an experiment can be chosen so that the probability of rejecting a false H0 (power) is high for a specified magnitude of the difference between treatment means. ■ Compensate for possible data loss when planning sample sizes for an experiment. Quick Formula Summary Planning for precision Planned sample size for a 95% confidence interval of a proportion What is it for? To set the sample size of a planned experiment to achieve approximately a specified half-width (“margin of error”) of a 95% confidence interval for a proportion p. What does it assume? The population proportion p is not close to zero or one, and n is large. Formula: n≈4p(1-p)(margin of error)2, where p is the proportion being estimated and “margin of error” is the half-width of the confidence interval for the proportion p. For the most conservative scenario, set p = 0.50 when calculating n. The symbol ≈ stands for “is approximately equal to.” Planned sample size for a 95% confidence interval of a log-odds ratio What is it for? To set the sample size n in each of two groups for a planned experiment to achieve approximately a specified half-width (“margin of error”) of a 95% confidence interval for a log-odds ratio, ln(OR). What does it assume? Sample size n is the same in both groups. Formula: n≈4(margin of error)2(1p1+11-p1+1p2+11-p2), where “margin of error” is the half-width of the confidence interval for ln(OR), and p1 and p2 are the probabilities of success in the two treatment groups. Planned sample size for a 95% confidence interval of the difference between two proportions What is it for? To set the sample size n in each of two groups for a planned experiment to achieve approximately a specified half-width (“margin of error”) of a 95% confidence interval for a difference between two proportions, p1 − p2. This is an alternative approach to the one that uses a log-odds ratio to compare the proportion of successes in two treatment groups. What does it assume? Sample size n is the same in both groups. Formula: n≈8p¯(1-p¯)(margin of error)2, where “margin of error” is the halfwidth of the confidence interval for p1 − p2. p1 and p2 are the probabilities of success in the two treatment groups, and P¯ is the average of the two proportions—that is, P¯=(p1+p2)/2. Planned sample size for a 95% confidence interval of the mean What is it for? To set the sample size of a planned experiment to achieve approximately a specified half-width (“margin of error”) of a 95% confidence interval for a mean µ. What does it assume? The population is normally distributed with known standard deviation σ. Formula: n≈4(σmargin of error)2, where n is the planned sample size, and “margin of error” is the half-width of the confidence interval for the mean µ. Planned sample size for a 95% confidence interval of the difference between two means What is it for? To set the sample size of a planned experiment so as to achieve approximately a specified half-width (“margin of error”) of the 95% confidence interval for µ1 − µ2. What does it assume? Populations are normally distributed with equal standard deviations σ. The value of σ is known. Sample size n is the same in both groups. Formula: n≈8(σmargin of error)2, where n is the planned sample size within each group, and “margin of error” is the half-width of the confidence interval for the difference between means. Planning for power Planned sample size for a binomial test of 80% power at α = 0.05 What is it for? To set the sample size n of a planned experiment to achieve approximately a power of 0.80 in a binomial test at α = 0.05. What does it assume? The proportion p0 under the null hypothesis is not close to zero or one, and n is not small. Sample size n is the same in both groups. Formula: n≈8p0(1-p0)D2, where p0 is the proportion under the null hypothesis, and D = p − p0 is the predetermined difference we wish to be able to detect between the population parameter p and that specified under the null hypothesis. Planned sample size for 2 × 2 contingency test of 80% power at α = 0.05 What is it for? To set the sample size of a planned experiment so as to achieve approximately a power of 0.80 at α = 0.05 in a contingency test of the difference between the proportion of successes in two treatment groups (or, equivalently, a test that the odds ratio equals one). What does it assume? The average probability of success in the two treatment groups is known. Sample size n is the same in both treatment groups. Formula: n≈8p¯(1-p¯)D2, where P¯ is the average of the two probabilities of success [i.e., P¯=(p1+p2)/2 ], and D = p1 − p2 is the predetermined difference we wish to be able to detect between the two proportions. Planned sample size for a one-sample or paired t-test of 80% power at α = 0.05 What is it for? To set the sample size of a planned experiment so as to achieve approximately a power of 0.80 in a one-sample or paired t-test at α = 0.05. What does it assume? The population is normally distributed with standard deviation σ. The value of σ is known. Formula: n≈8(σD)2, where n is the sample size within each group, and D = µ is the predetermined value of the mean (or the mean difference in the case of a paired test) that we wish to detect. Planned sample size for a two-sample t-test of 80% power at α = 0.05 What is it for? To set the sample size of a planned experiment so as to achieve approximately a power of 0.80 in a two-sample t-test at α = 0.05. What does it assume? Populations are normally distributed with equal standard deviation σ. Sample size n is the same in both groups. Formula: n≈16(σD)2, where n is the sample size within each group, and D = |µ1 − µ2| is the predetermined difference between means we wish to detect. PRACTICE PROBLEMS 1. Identify which goal of experimental design (i.e., reducing bias or limiting sampling error) is aided by the following procedures: a. Using a genetically uniform animal stock to test treatment effects b. Using a completely randomized design c. Grouping related experimental units together d. Taking the response measurements while unaware of the treatments assigned to experimental units e. Using a computer to randomly assign treatments to experimental units within each block 2. Using a coin toss for each unit, assign two hypothetical treatments to eight experimental units. a. Write the sequence of eight assignments you ended up with. b. Did you end up with an equal number of units in each treatment? c. What is the probability of an unbalanced design using this approach? d. Recommend a procedure for randomly assigning treatments to units that always results in a balanced design. 3. A series of plots were placed in a large agricultural field in preparation for an experiment to investigate the effects of three fertilizers differing in their chemical composition. Before assigning treatments, it was noticed that plots differed along a moisture gradient. What strategy would you suggest the researchers implement to minimize the impact of this gradient on the ability to measure a treatment effect? Explain with an illustration the experimental design you would recommend. 4. You read the following statement in a journal article: “On the basis of an alpha level of 0.05 and a power of 80%, the planned sample size was 129 subjects in each treatment group.” State in plain language what this means. 5. Example 12.4 described a study in which salmon were introduced to 12 streams with and without brook trout to investigate the effect of brook trout on salmon survival. Is this an experimental study or an observational study? Explain the basis for your reasoning. 6. Identify the consequences (i.e., increase, decrease, or none) that the following procedures are likely to have on both bias and sampling error in an observational study. a. Matching sampling units between treatment and control b. Increasing sample size c. Ensuring that the frequency distribution of subject ages is the same in the two treatments d. Using a balanced design 7. In 1899, the British Medical Journal (page 933) reported the results of a medical procedure involving the subcutaneous infusion of a salt solution for the treatment of extremely severe pneumonia: “Dr. Clement Penrose has tried the effect of subcutaneous salt infusions as a last extremity in severe cases of pneumonia. He continues this treatment with inhalations of oxygen. He has had experience of three cases, all considered hopeless, and succeeded in saving one. In the other two the prolongation of life and the relief of symptoms were so marked that Dr. Penrose regretted that the treatment had not been employed earlier.” a. Is this an experimental study? Why or why not? b. What design components might Dr. Penrose have included in an experiment to test the effectiveness of his treatment? 8. In a study of the effects of marijuana on the risk of cancer in oral squamous cells, Rosenblatt et al. (2004) examined 407 recent cases of the cancer from western Washington state. They also randomly sampled 615 healthy people from the same region having similar frequency distributions of age and sex as the cancer cases. They found that a similar proportion of the cancer cases (25.6%) and healthy participants (24.4%) reported ever having used marijuana (odds ratio = 0.9; 95% confidence interval, 0.6 < OR < 1.3). a. What name is given to this type of study? Is it an experimental study or an observational study? Explain. b. Does this study include a control group? Explain. c. What was the purpose of ensuring that the healthy participants were similar in age and sex to the cancer cases? d. Can we conclude that marijuana does not cause cancer in oral squamous cells in this population? 9. After stinging its victim, the honeybee leaves behind the barbed stinger, poison sac, and muscles that continue to pump venom into the wound. Visscher et al. (1996) compared the effects of two methods of removing the stinger left behind: scraping off with a credit card or pinching off with thumb and index finger. A total of 40 stings were induced on volunteers. Twenty were removed with the credit card method, and 20 were removed with the pinching method. The size of the subsequent welt by each sting was measured after 10 minutes. All 40 measurements came from two volunteers (both authors of the study), each of whom received one treatment 10 times on one arm and the other treatment 10 times on the other arm. Pinching led to a slightly smaller average welt, but the difference between methods was not significant. a. All 40 measurements were combined to estimate means, standard errors, and the P-value for a two-sample t-test of the difference between treatment means. What is wrong with this approach? b. How should the data be analyzed? Describe how the quantities would be calculated and what type of statistical test would be used. c. Suggest two improvements to the experimental design. 10. What is the justification for including extreme doses well outside the range of exposures encountered by people at risk in a dose–response study on animals of the effects of a hazardous substance? What are the problems with this approach? 11. A strain of sweet corn has been genetically modified with a gene from the bacterium Bacillus thuringiensis (Bt) to express the protein Cry1Ab, which is toxic to caterpillars that eat the leaves. Unfortunately, the pollen of transformed corn plants contains the toxin, too. Corn pollen dusts the leaves of other plants growing nearby, where it might have negative effects on non-pest caterpillars. You are hired to conduct a study to measure the effects on monarch butterfly caterpillars of ingesting Bt-modified pollen that has landed on the leaves of milkweed, a plant commonly growing in or near cornfields. You decide to use a completely randomized design to compare the effect of two treatments on monarch pupal weight. In one treatment, you place potted milkweed plants in plots of Bt-modified corn, where their leaves receive pollen carrying the toxin. In the other treatment, you place milkweed plants in plots with ordinary corn that has not been transformed with the Bt gene. You place a monarch larva on every milkweed plant. Previous studies have estimated that the standard deviation of pupal weight in monarch butterflies is about 0.25 g. a. Suppose your goal at the end of the experiment is to calculate a 95% confidence interval for the difference between treatments in mean monarch pupal weights. How many plots would you plan in each treatment if your goal was to produce a confidence interval for the difference in mean pupal weights between treatments having a total width of 0.4 g? b. What sample size would you need if you decided that 0.4 was not precise enough, and that you wished to halve this interval to 0.2? c. Imagine that your permits allow you to plant only five plots of Bt-transformed corn, so that the only way you can increase the total sample size for the whole experiment is to increase the number of plots in the ordinary corn treatment. To achieve the same width of confidence interval as in part (a), would the total sample size needed (both treatments combined) likely be greater, smaller, or no different from that calculated in part (a)? Explain. d. In designing the experiment, why would you not simply place all the milkweed plants for one treatment at random locations in a single large Bt-transformed corn field, and all the milkweed plants for the other treatment at random locations in a single large normal corn field? 12. In the Bt and monarch study described in Practice Problem 11, how many plots would you plan per treatment if your goal were to carry out a test having 80% power to reject the null hypothesis of no treatment effects when the difference between treatments means is at least 0.25 g? 13. Consider the results of a six-year observational study that documented health changes related to homeopathic care (Spence and Thompson 2005). Homeopathic treatment was defined as “stimulating the body’s autoregulatory mechanisms using microdoses of toxins.” Every one of the 6544 patients in the study was assigned to a hospital outpatient unit for homeopathic treatment. Of these, 4627 patients (70.7%) reported positive health changes following treatment. Suggest a major improvement to the design of this study. 14. The fish species Astyanax mexicanus includes blind, cave-inhabiting populations whose eyes degenerate during embryonic development. To understand how eye degeneration worked, Yamamoto and Jeffery (2000) replaced the lens of the degenerate eye on one side (randomly chosen) of a blind cave fish embryo with a lens from the embryo of a “normal,” sighted fish. This procedure was repeated on all individuals in a sample of blind cave fish. Final eye size was measured on both sides of each experimental fish, after embryonic development was complete. Remarkably, a normal-sized eye was restored on the transplant eyes of blind cave fish but not on the unmanipulated side. Based on the preceding description of a laboratory experiment, identify which of the six main strategies of experimental design (listed in Section 14.2) were incorporated. 15. Blaustein et al. (1997) used a field experiment to investigate whether increased UV-B radiation was a cause of amphibian deformities (see the photo at the beginning of this chapter). They measured long-toed salamanders either exposed to or shielded from natural UV-B radiation. It was not possible to carry out all replicates simultaneously, so the researchers carried them out over several days. They made sure that both treatments were included on each day. In their analysis, they grouped replicates together that were carried out on the same day. a. By grouping experiments carried out on the same day, what experimental procedure were they using? b. What is the main reason for adopting this procedure in an experimental study? 16. In 1976, Ewan Cameron and Linus Pauling (the only person to have won two unshared Nobel Prizes) published a paper showing that vitamin C was an effective treatment for some kinds of cancer. They measured the life spans of a sample of 100 patients who were given extra doses of vitamin C. As a control, they pulled the records of several hundred patients from the same clinic who had died from the same types of terminal cancer, and who were matched to the vitamin C patients for their age, sex, and type of cancer. They found that the patients with extra vitamin C lived on average 2.7 times longer than the controls. A later study by Moertel et al. (1985) randomly assigned two treatments to cancer patients, supplemental vitamin C and control, and followed the patients with a double-blind study. This later study found no difference between the two groups for their life spans. a. Give plausible reasons why the two studies might have found different results. b. From the information given, which study is expected to give the most reliable results? Why? ASSIGNMENT PROBLEMS 17. Identify the consequences (i.e., increase, decrease, or none) that the following procedures are likely to have on both bias and sampling error in an experimental study. a. Assigning treatment to subjects alphabetically, not randomly b. Increasing sample size c. Calculating power d. Applying every treatment to every experimental unit in random order e. Using a sample of convenience instead of a random sample f. Testing only one treatment group, without a control group g. Using a balanced design h. Informing the human participants which treatment they will receive 18. The experiment described in Example 12.2 compared antibody production in 13 male redwinged blackbirds before and after testosterone implants. The units of antibody levels were log 10−3 optical density per minute (ln[mOD/min]). The mean change in antibody production was d¯=0.056, and the standard deviation was sd = 0.159. If you were assigned the task of repeating this experiment to test the hypothesis that testosterone changed antibody levels, what sample size (i.e., number of blackbirds) would you plan to ensure that a mean change of 0.05 units could be detected with probability 0.8? Explain the steps you took to determine this value. 19. Two clinical trials were designed to test the effectiveness of laser treatment for acne. Seaton et al. (2003) randomly divided participants into two groups. One group received the laser treatment, whereas the other group received a sham treatment. Orringer et al. (2004) used an alternative design in which laser treatment was applied to one side of the face, randomly chosen, and the sham treatment was applied to the other side. The number of facial lesions was the response variable. a. Identify the main component of experimental design that differs between the two studies. Give the statistical term identifying this component in experimental design. b. Under what circumstances would there be an advantage to using the “divided-face” design over the completely randomized (two-sample) design? c. Assuming that the advantage identified in part (b) is met, can you think of a disadvantage of the divided-face design?5 20. Identify the consequences (i.e., increase, decrease, or none) that the following procedures are likely to have on both bias and sampling error in an observational study. a. Planning for data loss b. Taking measurements of the subjects while unaware of which subjects belong to which group c. Including only one sex and age group in the study? d. Adjusting for body size using analysis of covariance 21. Identify which goal of experimental design (i.e., reducing bias or limiting effects of sampling error) is aided by the following procedures: a. Including extreme treatment levels b. Using a paired design c. Keeping room temperature constant in an experiment designed to test the effects of a pesticide on insect survival d. Eliminating artifacts when designing the treatment of interest e. Adding a sham operation group 22. Identify the particular feature that defines each of the following experimental designs, and list the specific advantages provided by the feature you identify. a. Factorial design b. Randomized block design c. Completely randomized design 23. Kirsch (2010) argues that in double-blind clinical trials to test the effects of antidepressants, a large fraction of patients figure out whether they have been given the antidepressant or the placebo by noticing the presence or absence of known side effects of the antidepressant. Doctors evaluating the patients are also able to determine which treatment patients are receiving. How might this situation affect the results of the clinical trial? Specifically, is the treatment effect (difference between the means of the antidepressant and placebo treatments) likely to be overestimated, underestimated, or unaffected by this knowledge? 24. Michalsen et al. (2003) conducted a study to examine the effects of “leech therapy” for pain resulting from osteoarthritis of the knee. Two treatments were randomly assigned to 51 patients with osteoarthritis of the knee. Patients in the leech treatment received 4-6 medicinal leeches applied to the soft tissue of the affected knee in a single session. The animals were left to feed ad libitum until they detached themselves, on average 70 minutes later. Patients in the control treatment were given diclofenac gel and were told to apply it twice daily to the affected area. Pain was assessed by a questionnaire given by personnel unaware of the treatments applied to each patient. The results showed that seven days after the start of treatment, pain was significantly lower in the leech group. a. Does this study include a control group? Explain. b. Is this an experimental study or an observational study? Explain. c. Is this a completely randomized design or a randomized block design? Explain. d. Which strategy for reducing bias was not adopted in this study? How might its absence have affected the results? 25. Design a study to compare the reaction times of the left and right hands of right-handed people using a computer mouse. Two design choices are available to you. In the first, a sample of right-handed participants is randomly divided into two groups. Reaction time with the left hand is measured in one group, and reaction time with the right hand is tested in the other group. a. What is the second design choice available to you? b. Under what circumstances would the second choice be the preferred choice? c. Assume that you decide to go with the completely randomized design and that, at the end of the experiment, you aim to calculate a 95% confidence interval for the difference between the mean reaction times of left and right hands. To achieve a confidence interval of specified width, what information would you require to plan an appropriate sample size? 26. Identify the single most significant flaw in each of the following experimental designs. Use statistical language to identify what’s missing. a. In a test of the effectiveness of acupuncture in treating migraine headaches, a random sample of patients at a migraine clinic were provided with a novel acupuncture treatment daily for six months. The patients were interviewed at the start of the study and at the end to determine whether there had been any change in the severity of their migraines. b. In a modified study, a second sample of patients were chosen after the acupuncture treatment was completed on the first set of patients. This second sample of patients received a placebo in pill form for six months. At the end of the study, perceived pain levels in the two groups were compared. c. In a modified study, the sample of patients was divided into two groups according to gender. The women received the acupuncture treatment, and the men received the placebo medication in pill form. At the end of the experiment, perceived pain levels in the two groups were compared. d. In a modified study, patients were randomly divided into two groups. One group received the acupuncture treatment, and the other received a fixed dose of placebo medication in pill form. At the end of the experiment, perceived pain levels in the two groups were compared. 27. Young et al. (2006) took measurements of subordinate female meerkats to determine the changes in reproductive physiology experienced by females that are evicted from their social groups. They compared evicted females and those not evicted in their level of plasma luteinizing hormone following a GnRH hormone challenge. They found that nine evicted females had a sample average of 6.2 mIU/ml (milli-International Units per milliliter) of plasma luteinizing hormone compared with 12.1 mIU/ml in 18 females that had not been evicted. The pooled sample variance was 28.4. a. Is this an experimental study or an observational study? Explain. b. The sample size was unequal between the two groups of females compared. How would this affect the power of a hypothesis test of the difference between group means compared with a more balanced design? Explain. c. How would the imbalance of the sample sizes affect the width of the confidence interval for the difference between group means compared with a more balanced design? Explain. d. If you were planning to repeat the comparison of plasma luteinizing hormone between these two groups of females, what sample size would you plan to achieve an expected halfwidth of 3 mIU/ml for a 95% confidence interval of the difference between means? Explain the steps you took to determine this value. 28. Diet restriction is known to extend life and reduce the occurrence of age-related diseases. To understand the mechanism better, you propose to carry out a study to look at the separate effects of age and diet restriction, and the interaction between age and diet restriction, on the activity of liver cells in rats. What experimental design should you consider employing? Why? 8 INTERLEAF Data dredging In the spoof journal Annals of Improbable Research, a satirical article reported on a study of the so-called butterfly effect (Inaudi et al. 1995). This effect, a mainstay of the popular representation of chaos theory, says that small initial causes, like the flapping of a butterfly’s wings, can ultimately have large effects, like a hurricane, on the other side of the world. The fearless researchers set out to measure this effect by capturing several dozen butterflies and holding them in captivity in Switzerland. Each day, they checked the butterflies and recorded whether or not they flapped their wings. Then, using the lab’s phone, they called their girlfriends in Paris each day to ask whether or not it was raining. At the end of the study, the students tested each butterfly for an association between its daily flapping behavior and the daily weather in Paris. They found that the flapping days of one of the butterflies closely matched the rainfall days in Paris (P < 0.05). They exulted, “Not only have we proven that the butterfly effect exists, we have found the butterfly.” These guys were clearly joking, but statistically speaking, where did they go wrong? The answer is that they went “data dredging.” They performed many statistical tests and eventually one of them was significant. Data dredging (also called “data snooping” or “data fishing”) is the carrying out of many statistical tests in hope of finding at least one statistically significant result. The problem with data dredging is that the probability of making at least one Type I error (i.e., of obtaining a false positive) is greater than the significance level α when many tests are conducted, if the null hypothesis is true (as it surely is in the butterfly example). Each hypothesis test has some chance of error, and these errors are compounded over multiple tests. There is a much larger probability of getting an error out of several tries than in any one try. By analogy, we might get away with playing Russian roulette once, but we would be unlikely to survive a month of playing once a night. It’s useful to do a few calculations to see how big the problem might be. The probability of making no Type I errors in N independent tests is (1 − α)N. Thus, the chance of making at least one Type I error from N independent tests is 1 − (1 − α)N. This means that, if we use α = 0.05 and carry out 20 independent tests of true null hypotheses, the probability that at least one of these tests will falsely reject the null hypothesis is about 65%. If we carry out 100 tests, then the chance of rejecting at least one of the null hypotheses becomes 99.4%, even if all the null hypotheses are true. With data dredging, a false positive result is almost inevitable. Nevertheless, multiple testing is common in biology, and for good reasons. A dedicated experimentalist on human participants might measure many conceivable responses (e.g., blood pressure, body temperature, red blood cell count, white blood cell count, speed of recovery, appetite, and weight change) and perhaps even a few extra variables that might be long shots. The result is that the clinician might end up carrying out 10 or 20 tests of treatment effects, raising the probability of a false positive result. This level of multiple testing pales next to that seen in gene mapping. Locating a gene for a single trait, such as a genetic disease, typically involves thousands of statistical tests (one for each section of the genome). What should be done about the soaring Type I error rates resulting from so much testing? The answer to this question depends on your goals. If your goal is simply to explore the data, to discover the possibilities but not to provide rigorous tests, then you need do nothing special about multiple testing except report the number of tests that you carried out and note which ones yielded a significant result. If you admit that you dredged the data, your results can still be useful. New hypotheses and unexpected discoveries can emerge from a thorough fishing expedition. However, the individual significant results that pop up from data dredging cannot yet be taken seriously, due to the high probability of one or more Type I errors. Some of the significant results might indeed be real, but it will be difficult to establish which ones. Rather, a new study must be carried out with new data to test any promising results that emerged from the exploratory approach. Another strategy sometimes used when exploring data is to divide the data randomly into two independent parts. One part is used for data dredging, and the other part is used to confirm any positive results suggested by the dredging. If your goals from multiple testing are more rigorous (e.g., you want to determine which variable really did respond to treatment in a clinical trial, or which location in the genome really does contain a gene for a heritable disease), then steps must be taken to correct for the inflation of Type I error rates that occurs with multiple testing. The simplest way to accomplish this is to use a more stringent significance level—that is, one smaller than the usual α = 0.05. The most common correction for multiple comparisons is the Bonferroni correction. In the simplest version of this method, each test uses a significance level α* rather than α, where α*=αnumber of tests For example, if we typically adopt the significance level α = 0.05 when carrying out a single test, then to carry out 12 separate tests we should use the significance level α* = 0.05/12 = 0.00417 instead. In this case, we would reject H0 in each test only if P were less than or equal to 0.00417. With the Bonferroni correction, the probability of getting at least one Type I error during the course of carrying out all 12 tests is approximately equal to the initial α-value (i.e., 0.05 in this case). Keep in mind, though, that applying the Bonferroni correction greatly reduces the power of single tests. This is the price paid for asking many questions of the data. More than ever, we should be mindful not to “accept the null hypothesis.” It is okay to be skeptical when a null hypothesis is not rejected and power is so limited, but there is little to do about it except to repeat the study and look again. Another, increasingly popular approach to correct for multiple comparisons is called the false discovery rate (FDR). To use this approach, carry out all of the multiple tests at a fixed significance level α (e.g., the usual 0.05). Gather all of the tests that yield a statistically significant result (i.e., all of the tests for which P ≤ α). We can call this subset of tests the “discoveries.” The FDR estimates the proportion of discoveries that are false positives. In other words, the FDR is the proportion of tests for which the null hypothesis was rejected yet the null hypothesis was true. For example, Brem et al. (2005) carried out hundreds of statistical hypothesis tests of interactions between pairs of yeast genes. Of these tests, 225 yielded a statistically significant result (the “discoveries”). Using the false discovery rate method, they estimated that 12 of these 225 tests were false positives, leaving 213 “true” discoveries. An extension of the FDR calculates a quantity called the q-value for each discovery. The qvalue is analogous to a P-value, providing a measure of the strength of support from the data that the null hypothesis is false in a specific test. The smaller the q-value, the stronger is the evidence that H0 is false and should be rejected. Unlike the P-value, the q-value takes into account other tests carried out at the same time. The idea is that, by choosing to reject H0 only if the q-value is 0.05 or less, we reject the null hypothesis falsely in only 5% of tests. FDR and q-values are a more powerful approach to dealing with multiple comparisons, and we expect their use to increase in biological applications over the next decade. Consult Benjamini and Hochberg (1995) or Storey and Tibshirani (2003) for more details. 15 Comparing means of more than two groups How would we analyze the results of a clinical trial that randomly assigns not two but three treatments to a sample of patients? Two of the treatments might be different medications and the third a placebo control. Such a design can answer more questions than a two-treatment experiment because more comparisons can be made in the same experiment. For example, are both medications better than the placebo? If they are, then by how much? Is one medication superior to the other? If so, how much better is it? How do we compare the means of the three groups? At first glance, it might seem reasonable to compare them two at a time: first compare the means of groups 1 and 2, then compare groups 2 and 3, and finally, compare groups 1 and 3. This analysis-by-twos quickly runs into problems, because testing multiple pairs of means inflates the probability of committing at least one Type I error (recall the data dredging discussed in Interleaf 8). The danger is modest when comparing only three groups, but it escalates rapidly with an increasing number of groups. Comparing five groups would require 10 tests, which would give as much as a 40% chance of falsely rejecting at least one of those null hypotheses if they were all true. The best solution is the analysis of variance, or ANOVA, which compares means of multiple groups simultaneously in a single analysis. Analysis of variance might seem like a misnomer, given our intention to compare means, but testing for variation among groups is equivalent to asking whether the means differ. ANOVA was originally developed by the biologist and statistician R. A. Fisher, who was first mentioned in Interleaf 1. This chapter discusses one-way or single-factor analysis of variance, which investigates the means of several groups differing by one explanatory variable or factor. Two-way or two-factor ANOVA is discussed in Chapter 18. The analysis of variance Analysis of variance is the most powerful approach known for simultaneously testing whether the means of k groups are equal. It works by assessing whether individuals chosen from different groups are, on average, more different than individuals chosen from the same group. Example 15.1 introduces the method. EXAMPLE 15.1 The knees who say night Traveling to a different time zone can cause jet lag, but people adjust as the schedule of light to their eyes in the new time zone gradually resets their internal, circadian clock. This change in their internal clock is called a phase shift. Campbell and Murphy (1998) reported that the human circadian clock can also be reset by exposing the back of the knee to light, a finding met with skepticism by some, but hailed as a major discovery by others. Aspects of the experimental design were subsequently challenged.1 The data in Table 15.1-1 are from a later experiment by Wright and Czeisler (2002) that re-examined the phenomenon. The new experiment measured circadian rhythm by the daily cycle of melatonin production in 22 people randomly assigned to one of three light treatments. Participants were awakened from sleep and subjected to a single three-hour episode of bright lights applied to the eyes only, to the knees only, or to neither (the control group). Effects of treatment on the circadian rhythm were measured two days later by the magnitude of phase shift in each participant’s daily cycle of melatonin production. Results are plotted in Figure 15.1-1. A negative measurement indicates a delay in melatonin production, which is the predicted effect of light treatment; a positive number indicates an advance. Does light treatment affect phase shift? TABLE 15.1-1 Raw data and descriptive statistics of phase shift, in hours, for the circadian rhythm experiment. Treatment Data (h) Y¯ s n Control Knees Eyes 0.53, 0.36, 0.20, −0.37, −0.60, −0.64, −0.68, −0.3088 0.6176 8 −1.27 0.73, 0.31, 0.03, −0.29, −0.56, −0.96, −1.61 −0.3357 0.7908 7 −0.78, −0.86, −1.35 −1.48 −1.52, −2.04, −1.5514 0.7063 7 −2.83 FIGURE 15.1-1 Strip chart showing the phase shift in the circadian rhythm of melatonin production in 22 experimental participants given alternative light treatments (open circles). Filled dots and vertical lines (error bars) are group means ±1 standard error. Hypotheses The null hypothesis of ANOVA is that the population means µi are the same for all treatments. (Throughout this chapter, we’ll use subscripts on various quantities to indicate the group that each refers to. Generically, we’ll refer to population i, which will have mean µi; for example, the mean of group 3 will be written as µ3.) Under the null hypothesis, the sample means Y¯i differ from each other solely because of random sampling error. The alternative hypothesis is that the mean phase shift is not the same in all three light-treatment populations. H0: µ1 = µ2 =µ3. HA: At least one µi is different from the others. The alternative hypothesis does not state that every mean is different from all the others, but only that at least one mean stands apart. Rejecting H0 in ANOVA is evidence that the mean of at least one group is different from the others. ANOVA in a nutshell Even if all the groups in a study had the same true mean, the data would likely show a different sample mean for each group. This is because of sampling error—the chance difference between a sample estimate and the true value of a population parameter caused by random sampling. Thus, we expect to see variation among sample means taken from different groups even when the null hypothesis is true and the groups all have the same mean. The key insight of ANOVA is that we can estimate how much variation among group means ought to be present from sampling error alone if the null hypothesis is true. If there really are no differences among the populations, then taking a random sample from each population is equivalent to taking the same number of random samples from a single population. As we learned in Chapter 4, when we talked about the sampling distribution of the mean, the amount of variation we expect to see among the sample means of repeated random samples from a population is directly related to sample size and amount of variation we see among subjects within the samples.2 In contrast, if the null hypotheis is not true, then there are real differences in means among the groups. There are still differences among sample means caused by chance, but on top of that there are differences among sample means caused by real variation among the population means. ANOVA lets us determine whether there is more variance among the sample means than we would expect by chance alone. If so, then we can infer that there are real differences among the population means. That’s the basic idea behind ANOVA. To take it one step further, we introduce the two measures of variation that are calculated from the data and compared in a test of the null hypothesis. In the terminology of this chapter, both quantities are called “mean squares.”3 The group mean square (MSgroups) is proportional to the observed amount of variation among the group sample means. You can think of this quantity as representing the variation among the sampled subjects that belong to different groups. The error mean square (MSerror) estimates the variance among subjects that belong to the same group. It is analogous to the pooled sample variance in two-sample comparisons. Under the null hypothesis that the true means of groups do not differ, individuals belonging to different groups will on average be no more different from one another than individuals belonging to the same groups. The group mean square and the error mean square should be equal (except by chance). But if the null hypothesis is false, we expect the group mean square to exceed the error mean square. In this case the variation among individuals belonging to different groups is expected to be greater than the variation among subjects belonging to the same group. The comparison of mean squares is done with an F-ratio.4 F=group mean squareerror mean square=MSgroupsMSerror. If the null hypothesis is true, and the means do not differ, the group and error mean squares on average will be similar and F should be close to 1. If the null hypothesis is false, the real differences among group means should inflate the group mean square and F is expected to exceed 1. Those are the only two possibilities. In the following sections, we’ll show how to calculate the mean squares and carry out the formal test using the F-distribution. ANOVA tables If you run an analysis of variance on a data set using the computer, the results will likely include an ANOVA table. In Table 15.1-2, we show the ANOVA table for the circadian rhythm experiment discussed in Example 15.1. TABLE 15.1-2 ANOVA table for the results of the circadian rhythm experiment (Example 15.1). Source of variation Sum of squares df Mean squares F-ratio Groups (treatment) Error 7.224 9.415 2 19 3.6122 0.4955 7.29 P 0.004 Total 16.639 21 The table organizes all the computations leading to a test of the null hypothesis of no differences among population means. It includes the group and error mean squares as well as their ratio, F. The mean squares are computed from other quantities that we’ll learn about in the next section: sums of squares and their degrees of freedom (df). We’ll see in the next two sections how these numbers are calculated and interpreted. ANOVA tables also show the P-value for the test of the null hypothesis. For the circadian rhythm data, the P-value is 0.004, allowing us to reject the null hypothesis that the mean phase shift is the same for all three treatment populations. Partitioning the sum of squares We’ll show you the ANOVA calculations in stages, beginning with the sums of squares. This step separates the two sources of variation in the data: within groups and among groups. In the formulas that follow, we refer to an individual observation as Yij. Here, i refers to the group to which the individual belongs, and j refers generally to the jth individual in the group. As you can see in Figure 15.1-2, the deviation between each observation Yij and the grand mean Y¯ (the mean of all the observations) can be split into its two parts, Yij−Y¯=(Yij−Y¯i)+(Y¯i−Y¯). The first part of the split, (Yij−Y¯i), is the deviation between each observation and its group mean. These deviations are illustrated with vertical lines in the right panel of Figure 15.1-2. The second part of the split, (Y¯i−Y¯), is the deviation between the mean of the group to which the observation belongs and the grand mean. These deviations are illustrated for the circadian rhythm data in the middle panel of Figure 15.1-2. FIGURE 15.1-2 Depiction of the portions of variation in the circadian rhythm data (Example 15.1). Each dot is the measurement of phase shift in a single subject. The long horizontal line in black is the grand mean, Y¯. Short horizontal lines in red are the sample means of the three light-treatment groups. Vertical lines represent deviations. Repeating this split for all observations, and then squaring the deviations and taking their sum, partitions the total variation in the data set into its within- and among-group components, SStotal=∑i∑j(Yij-Y¯)2=∑i∑j(Yij-Y¯i)2+∑ini(Y¯i-Y¯)2, where ni is the number of observations in group i.5 The term on the far left of the equation is the total sum of squares, symbolized as SStotal. The two parts on the right of the equal sign are the error sum of squares, SSerror, and the group sum of squares, SSgroups. In other words, SStotal=SSerror+SSgroups. To begin our calculations we also need the grand mean Y¯ (the mean of all the data from all groups combined): Y¯=∑iniY¯iN, where N is the total sample size, N=∑ni. This equation for the grand mean is equivalent to adding up all the individual observations from all the groups and dividing by the total number of measurements. The grand mean is not the same as the average of the group sample means if the sample size is not the same in every group. For the circadian rhythm data, Y¯=8(-0.3087)+7(-0.3357)+7(-1.5514)22=-0.7127. The group sum of squares is then calculated as SSgroups=∑ini(Y¯i-Y¯)2. For the circadian rhythm data, this is SSgroups=8[-0.3087-(-0.7127)]2+7[-0.3357-(-0.7127)]2+7[-1.5514-(-0.7127)]2=7.224. The error sum of squares can be calculated as SSerror=∑i∑j(Yij-Y¯i)2=∑isi2(ni-1), where si is the sample standard deviations from group i. For the circadian rhythm data, SSerror=(0.6176)2(8-1)+(0.7908)2(7-1)+(0.7063)2(7-1)=9.415. The value of SStotal is then calculated from the other two quantities by addition: SStotal=SSerror+SSgroups=9.415+7.224=16.639. These sums of squares go into the second column of the ANOVA table (Table 15.1-2). Calculating the mean squares The group mean square, symbolized by MSgroups, is calculated from the deviations of group sample means (Y¯i) around the grand mean of all the measurements (Y¯) : MSgroups=SSgroupsdfgroups, where dfgroups is the number of degrees of freedom for the groups. The degrees of freedom for groups is one less than the number of groups: dfgroups=k−1, where k is the number of groups. MSgroups measures the observed amount of variation among the sample means from different groups. It represents the variation among subjects that belong to different groups. For the circadian rhythm data, MSgroups=SSgroupsdfgroups=7.2243-1=3.6122. The group mean square of ANOVA represents variation among the sampled individuals belonging to different groups. It will on average be similar to the error mean square if population means are equal. The error mean square, symbolized as MSerror, measures variance within groups. ANOVA assumes that σ2 (i.e., the variance of Y) is the same in every population. In other words, we assume that σ2=σ12=σ22=...=σk2 for all k groups. The best estimate of this variance within groups is the pooled sample variance, just as in the two-sample t-test (Chapter 12). In ANOVA, the pooled sample variance is called the error mean square, and it is calculated as MSerror=SSerrordferror. We showed how to calculate the error sum of squares (SSerror) on p. 465. The denominator of this formula is the number of degrees of freedom for error, and it is just the sum of the degrees of freedom for the different groups: dferror=∑(ni-1)=N-k. N is the total number of data points in all groups combined: N=∑ni For the circadian rhythm data in Table 15.1-1, MSerror=SSerrordferror=9.41522-3=0.4955. The degrees of freedom and the mean squares go into the ANOVA table next to the sums of squares (Table 15.1-2). The error mean square of ANOVA is the pooled sample variance, a measure of the variation among individuals within the same groups. The variance ratio, F Under the null hypothesis that the population means of all groups are the same, the variation among individuals belonging to different groups (represented by MSgroups) will on average be the same as the variation among individuals belonging to the same group (estimated by MSerror). In ANOVA, therefore, we test for a difference by calculating the ratio of MSgroups over MSerror: F=MSgroupsMSerror. This F-ratio is the test statistic in analysis of variance. Under the null hypothesis, F will on average lie close to one, differing from it only because of sampling variation in the numerator and denominator. If the null hypothesis is false, however, and the alternative hypothesis is correct, then MSgroups should exceed MSerror and we expect F to be greater than one. For the circadian rhythm data, F=3.61220.4955=7.29. To calculate the P-value, we need the sampling distribution for the F-statistic under H0. This null distribution for the F-statistic is called the F-distribution. The F-distribution has a pair of degrees of freedom, one for the numerator (top) of the F-ratio and a second for the denominator (bottom). The numerator (MSgroups) has k − 1 degrees of freedom, and the denominator (MSerror) has N − k degrees of freedom. For the circadian rhythm data, N = 22 and k = 3, so there are k − 1 = 3 − 1 = 2 degrees of freedom for the numerator and N − k = 22 − 3 = 19 degrees of freedom for the denominator. The F-distribution with 2 and 19 df is therefore the appropriate null distribution for F. The number of degrees of freedom for the numerator is always presented first when specifying the F-distribution. This is important because the F-distribution with 2 and 19 degrees of freedom is not the same as the F-distribution having 19 and 2 degrees of freedom. Figure 15.1-3 illustrates the F-distribution with 2 and 19 degrees of freedom. The distribution ranges from zero to positive infinity. The right tail of the curve is the part we are interested in. This is because if H0 is false, then MSgroups should exceed MSerror and lead to an F-ratio in the right tail of the F-distribution. F-ratios less than one might occur, but only by chance. FIGURE 15.1-3 The F-distribution with 2 and 19 degrees of freedom. The value of F ranges from zero to positive Infinity. The area under the curve to the right of the critical value F = 3.52 (shaded) is 0.05. (See Statistical Table D.) The critical value corresponding to the area of 0.05 in the right tail of the F-distribution is found in Statistical Table D. An excerpt is shown in Table 15.1-3. TABLE 15.1-3 An excerpt from Statistical Table D, with critical values of the F-distribution corresponding to the significance level α(l) = 0.05. Numerator df Denominator df 1 2 3 4 5 6 7 8 9 10 10 11 12 13 14 15 16 17 4.96 4.84 4.75 4.67 4.60 4.54 4.49 4.45 4.10 3.98 3.89 3.81 3.74 3.68 3.63 3.59 3.71 3.59 3.49 3.41 3.34 3.29 3.24 3.20 3.48 3.36 3.26 3.18 3.11 3.06 3.01 2.96 3.33 3.20 3.11 3.03 2.96 2.90 2.85 2.81 3.22 3.09 3.00 2.92 2.85 2.79 2.74 2.70 3.14 3.01 2.91 2.83 2.76 2.71 2.66 2.61 3.07 2.95 2.85 2.77 2.70 2.64 2.59 2.55 3.02 2.90 2.80 2.71 2.65 2.59 2.54 2.49 2.98 2.85 2.75 2.67 2.60 2.54 2.49 2.45 18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 19 20 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 To find the critical value of the F-distribution having 2 and 19 degrees of freedom, locate the cell in the table corresponding to 2 df in the numerator and 19 df in the denominator. This value, 3.52, is highlighted in Table 15.1-3. We write it as F0.05(1),2,19=3.52. In this formula, “(1)” indicates that we are looking only at the right tail of the Fdistribution. In other words, the area under the curve in Figure 15.1-3 to the right of 3.52 is 0.05. Because our observed value of F (i.e., 7.29) is larger than 3.52, it lies farther out in the right tail of the F-distribution, so P must be less than 0.05. Therefore, we reject the null hypothesis. The exact P-value is the area under the curve of the F-distribution to the right of the observed F-value: P=Pr[F>7.29]. If you analyzed the data with a statistics package on the computer, you would obtain this probability directly: P = 0.004. Again, P < 0.05, so we reject H0. We have evidence that mean phase shift differs among light treatments. Rejecting H0 indicates only that at least one of the population means µi is different from the others, not that every µi is different from all of the others. We show in Section 15.4 how to take this process further to decide which means are different. Variation explained: R2 The R2 value (“R-squared”) is used in ANOVA to summarize the contribution of group differences to the total variation in the data. The quantity is based on the fact that the total sum of squares can be split into its two parts, the error sum of squares and the group sum of squares: SStotal=SSerror+SSgroups. R2 is the group portion of variation expressed as a fraction of the total: R2=SSgroupsSStotal. It can be thought of, loosely, as the “fraction of the variation in Y that is explained by groups.” It is a reflection of how much narrower the scatter of measurements is around the group means compared with the scatter around the grand mean (compare the first and third panels of Figure 15.1-2). R2 takes on values between zero and one. When R2 is close to zero, the group means are all very similar and most of the variability is within groups—that is, the explanatory variable defining the groups explains very little of the variation in Y. Conversely, an R2 close to one indicates that little variation in Y is left over after the different group means are taken into account—that is, the explanatory variable explains most of the variation in Y. For the circadian rhythm data, R2=7.22416.639=0.43. In other words, 43% of the total sum of squares among subjects in the magnitude of phase shift is explained by differences among them in light treatment. The remaining 57% of the variability among subjects is “error”—variance unexplained by the explanatory variable, light treatment. R2 measures the fraction of the variation in Y that is explained by group differences. ANOVA with two groups The analysis of variance works even when k = 2. ANOVA and the two-sample t-test give identical results6 when testing the null hypothesis H0: µ1 = µ2. An advantage of the two-sample t-test is that it generalizes more easily to other hypothesized differences between means, such as H0: µ1 − µ2 = 10. Welch’s t-test (Section 12.3) is additionally useful when the variances are very different between groups, whereas ANOVA requires more similar variances. Assumptions and alternatives The assumptions of analysis of variance are the same as those of the two-sample t-test, but they must hold for all k groups. To review: ■ The measurements in every group represent a random sample from the corresponding population. ■ The variable is normally distributed in each of the k populations. ■ The variance is the same in all k populations. Methods to evaluate these assumptions were discussed in Chapter 13. The robustness of ANOVA The ANOVA is surprisingly robust to deviations from the assumption of normality, particularly when the sample sizes are large. This robustness stems from a property of sample means described by the central limit theorem (Section 10.6)—that is, within each group, the sampling distribution of means is approximately normal when the sample size is large, even if the variable itself does not have a normal distribution (Chapter 13). ANOVA is also robust to departures from the assumption of equal variance in the k populations, but only if the samples are all large, about the same size, and if there is no more than about a tenfold difference among the variances. Data transformations If the data do not conform to the assumptions of ANOVA, they can be transformed as described in Chapter 13. Any of the transformations discussed there (e.g., log transformations and arcsine transformations) can be applied to ANOVA. If you get lucky, transforming the data will simultaneously make the data more normal and make the variances more equal, but this does not always happen. Nonparametric alternatives to ANOVA If the normality assumption of ANOVA is not met and transformations are unsuccessful, then there is a nonparametric alternative to single-factor ANOVA. The Kruskal-Wallis test, a nonparametric method based on ranks, is the equivalent of the Mann-Whitney U-test (Section 13.5) when there are more than two groups. It is sometimes referred to as analysis of variance based on ranks. It makes all the same assumptions as the Mann-Whitney U-test, but it is applied to more than two groups: ■ All group samples are random samples from the corresponding populations. ■ To use Kruskal-Wallis as a test of differences among populations in means or medians, the distribution of the variable must have the same shape in every population. As in the Mann-Whitney U-test, the Kruskal-Wallis test begins by ranking the data from all groups together (employing the strategy for tied observations first described in Section 13.5). The sum of the ranks for each group, Ri, is then used to calculate the Kruskal-Wallis test statistic, H. The formula is given in the Quick Formula Summary (Section 15.8). In general, we recommend using a computer program for this procedure. Under the null hypothesis of no difference among populations, the sampling distribution of H is approximately χ2 with k − 1 degrees of freedom, where k is the number of groups. The null hypothesis is rejected if H is greater than or equal to the critical value from the appropriate χ2 distribution, χα,k-12. As with the Mann-Whitney U-test, the Kruskal-Wallis test has little power when sample sizes are very small. Remember that power is the probability of rejecting a false null hypothesis, so more power is better. Therefore, ANOVA is preferred if its assumptions can be met. The Kruskal-Wallis test has nearly the same power as ANOVA when sample sizes are large. Planned comparisons Analysis of variance is the start, but not necessarily the end, of efforts to compare the means of more than two groups. Researchers might want to answer two additional questions, such as “Which means are different?” and “What is the magnitude of the difference between means?” ANOVA by itself answers neither of these questions. There are two approaches to figuring out which means are different and by how much— namely, planned and unplanned comparisons of means. A planned comparison is a comparison between means identified as being of crucial interest during the design of the study, identified prior to obtaining the data. A planned comparison must have a strong prior justification, such as an expectation from theory or a prior study.7 Only one or a small number of planned comparisons is allowed, to minimize inflating the Type I error rate. An unplanned comparison is one of multiple comparisons, such as between all pairs of means, carried out to help determine where differences between means lie. Unplanned comparisons represent a kind of data dredging (Interleaf 8), so it is necessary to protect against rising Type I errors. Here we briefly describe an example of a planned comparison. Unplanned comparisons are covered in Section 15.4. A planned comparison is a comparison between means planned during the design of the study, identified before the data are examined. Planned comparison between two means A good example of a planned comparison between two means comes from the circadian rhythm experiment (Example 15.1). The main point of that experiment was to contrast the mean phase shift of melatonin production between the knee-treatment group and the control group, which received no extra light. Because the experiment was built around this contrast, we are justified in using methods for planned comparisons to ask, “How big is the difference between the knee-treatment and control groups?” and “Is the difference between these two groups statistically significant?” The method for a planned comparison between two means is almost the same as the twosample comparison based on the t-distribution that we learned about in Chapter 12. Only the standard error is calculated differently: the planned comparison uses the pooled sample variance (the error mean square) based on all k groups (and the corresponding error degrees of freedom), rather than that based only on the two groups being compared. This step increases precision and power. The planned comparison method assumes, just as in ANOVA, that the variance is the same within all groups. Details of the modified formulas for a planned comparison are provided in the Quick Formula Summary (Section 15.8). For example, let’s examine the confidence interval for the difference between the means of the “knee” and “control” treatment groups. The difference between the sample means (knee treatment minus control) is small: Y¯2−Y¯1=(−0.336)−(−0.309)=−0.027 h. The standard error for this difference is SE=MSerror(1n1+1n2). For the knee versus control comparison, the standard error of the difference between these two means is SE=0.364, which has N − k = 22 − 3 = 19 degrees of freedom, the same as that for the error mean square (Table 15.1-2). The critical value from the t-distribution is t0.05(2),19 = 2.09 (Statistical Table C), which leads to the planned 95% confidence interval for the difference between the population means: (−0.027)−0.364(2.09)<μ2−μ1<(−0.027)+0.364(2.09) or −0.788<μ2−μ1<0.734, where the units are in hours. If we avoided using the planned-comparison method and simply used the two-sample method for a confidence interval introduced in Chapter 12, we would obtain the following 95% confidence interval instead: −0.813<μ2−μ1<0.759. Thus, the planned-comparison method has a slightly higher precision. Similarly, the planned-comparison method for testing the null hypothesis of no difference between these two means has slightly higher power. These are the main advantages of using the plannedcomparison methods, provided that the assumption of equal variance is met. Planned comparisons make all of the same assumptions as ANOVA—namely, random samples from populations, a normal distribution of the variable in every population, and equal variances in all populations. Because each comparison typically involves only one pair of means, planned comparisons are not as robust as ANOVA to violations of the assumptions. Unplanned comparisons The formulas for planned comparisons are not valid for unplanned comparisons, because unplanned comparisons need to make adjustments for the Inflated falsepositive rate (Type I errors) that accompanies multiple testing (Interleaf 8). The number of possible comparisons involving k groups is potentially large. Unplanned comparisons basically represent data dredging or “snooping” because we are poring through data to find differences. This is not inherently bad, provided that the method used corrects for the number of comparisons. Testing all pairs of means to find out which groups stand apart from the others is the most common type of unplanned comparison, and the Tukey-Kramer test is the most commonly used procedure for accomplishing this.8 The method assumes that we have already carried out an ANOVA and that the null hypothesis of no differences among means has been rejected. Example 15.4 illustrates unplanned comparisons with the Tukey-Kramer method. EXAMPLE 15.4 Wood wide web Most plants have underground associations with fungi, called mycorrhizae, that provide minerals and antibiotics to the plant in exchange for sugars. The fungi’s hyphae (branching filaments) extend through the soil and may connect to other plants, even to other plant species, creating an underground network that might cause a trickle of nutrients to flow from plant to plant. Simard et al. (1997) measured the flow of carbon between seedlings of birch and Douglas-fir and tested whether the carbon flow rate depended on shading. Shaded trees may draw more carbon via the mycorrhizae than trees in full sun. In each of three shade treatments, five pairs of birch and Douglas-fir seedlings were planted and allowed to grow for one year. Then, each birch was covered in a sealed bag for two hours and supplied with carbon dioxide (CO2) whose carbon consisted entirely of the carbon-13 isotope (atmospheric CO2 is made up almost entirely of carbon-12). The same was done to its partnered fir seedling, except that CO2 with carbon-14 was used. The amounts of carbon-13 and carbon-14 present in the tissues of both plants of each pair were measured nine days later. Because different isotopes of carbon were used on birch and Douglas-fir, it was possible to calculate the amount of carbon transferred from each plant to the other. Most transfer occurred from birch to Douglas-fir. Descriptive statistics for the average net carbon gain by Douglas-fir, in milligrams, are given in Table 15.4-1. TABLE 15.4-1 Summary of the net amount of carbon transferred from birch to Douglas-fir (Example 15.4). Sample standard deviation, Shade Sample mean Y¯i si ni treatment (mg) Deep shade Partial shade 18.33 8.29 6.98 4.76 5 5 No shade 5.21 3.00 5 The ANOVA results for these data, testing the null hypothesis for no differences among treatment means, are shown in Table 15.4-2. The different shade treatments indeed led to differences in the mean net carbon gain by Douglas-fir seedlings, as shown by the high F-ratio and low P-value. It remains to be seen, though, which means are different. That is, are all of the means detectably different from each other? If not, which of the means are different from the others? TABLE 15.4-2 ANOVA table summarizing results of the Douglas-fir carbontransfer data Source of variation Sum of squares df Mean squares F-ratio P Groups (treatments) Error 470.704 321.512 2 12 Total 792.216 14 235.352 26.793 8.784 0.004 Testing all pairs of means using the Tukey-Kramer method The Tukey-Kramer method works like a series of two-sample t-tests, but it uses a larger critical value to limit the Type I error rate. With the Tukey-Kramer test, the probability of making at least one Type I error throughout the course of testing all pairs of means is no greater than our significance level α. To carry out the Tukey-Kramer procedure, the means must first be ordered from smallest to largest: No shade Y¯3 5.21 Partial shade Y¯2 8.29 Deep shade Y¯1 18.33 Then, compare every pair of means in turn. For example, the hypotheses for the comparison between the means for “deep shade” (call it µ1) and “partial shade” (call it µ2) are as follows. H0: µ1 − µ2 = 0. HA: µ1 − µ2 ≠ 0. The test statistic (q) is calculated using a standard error based on the MSerror. This test statistic is then compared with the q-distribution having k and N − k degrees of freedom. Critical values are provided in Statistical Table F. Calculation details are given in the Quick Formula Summary (Section 15.8). The results for all the pairwise tests in the carbon-transfer data are given in Table 15.4-3. TABLE 15.4-3 Summary of Tukey-Kramer tests carried out on the results of Example 15.4. Critical Test value Group Group Critical q0.05,3,12 Conclusion i j Y¯i−Y¯j SE q Deep No 13.12 3.2737 4.008 2.67 Reject H0 Deep Partial 10.04 3.2737 3.067 2.67 Reject H0 3.08 3.2737 0.941 2.67 Do not reject H0 Partial No The results are unambiguous. The mean of the deep-shade group is different from that of both the partial-shade and no-shade groups, whereas the means of the partial-shade and noshade groups are not significantly different from each other. The results of the Tukey-Kramer procedure are often indicated using symbols, such as those shown in Figure 15.4-1 for displaying the means. Groups in the figure are assigned the same symbol if their means are not significantly different, based on the unplanned comparisons (e.g., the partial-shade and no-shade treatments are both assigned the symbol “b” in Figure 15.4-1). Sample means in the figure are assigned a unique symbol if they are different from all other means (e.g., the deep-shade treatment is given the symbol “a” in Figure 15.4-1).9 FIGURE 15.4-1 Using symbols to indicate the outcome of Tukey-Kramer tests of all pairs of means. Dots and error bars indicate means ±1 SE of the three treatment groups in Example 15.4. Two means are assigned a different symbol if they are significantly different (“a” vs. “b”), whereas means are assigned the same symbol (“b” in this case) if they are not significantly different. The Tukey-Kramer results do not imply that the partial-shade and no-shade treatments have the same mean carbon transfer rates. Assuming that they do would be making the mistake of accepting a null hypothesis. The results merely indicate that a test of their differences did not reject H0. With the Tukey-Kramer method, the probability of making at least one Type I error throughout the course of testing all pairs of means is no greater than the significance level α. Assumptions The Tukey-Kramer method makes all of the same assumptions as ANOVA—namely, random samples, a normal distribution of the variable in every population, and equal variances in all groups. Because each comparison tests only one pair of means at a time, rather than all means simultaneously, the method is not as robust as ANOVA to violations of these assumptions. The P-value for the Tukey-Kramer test is exact when the experimental design is balanced— that is, when the sample size is the same in every group (n1 = n2 = . . . = nk). If the sample sizes are different, then the Tukey-Kramer test is conservative, which means that the real probability of making at least one Type I error, when testing all pairs of means, is smaller than the stated α. This makes it harder to reject H0, which is why the test is deemed “conservative.” Fixed and random effects Up to now, we have been analyzing fixed groups—namely, studies in which the different categories of the explanatory variable are predetermined, of direct interest, and repeatable. ANOVA on fixed groups is called fixed-effects analysis of variance. Other examples of fixed effects include ■ ■ ■ ■ Alternative medical treatments in a clinical trial Fixed doses of a toxin Different heights above low tide in the intertidal zone Different sexes or age categories of individuals Any conclusions reached about differences among fixed groups apply only to those fixed groups. If a difference among drug treatments was found in some response variable, for example, we could not generalize the results to other drugs not included in the study. By contrast, there is a second type of ANOVA in which the groups are not fixed, but instead are randomly chosen. Randomly chosen groups are not predetermined, but instead are randomly sampled from a much larger “population” of groups. ANOVA applied to random groups is called random-effects ANOVA.10 Examples of random effects include ■ Family, in a study of resemblance among relatives in IQ scores ■ Subject or individual, in a study involving repeated measurements of individuals Because random effects are randomly sampled from a population, conclusions reached about differences among groups can be generalized to the whole population of groups. With random effects, the specific groups included in a study are ephemeral and would not typically be reused. For example, consider a study to investigate whether families differ from one another in the mean IQ scores of their children. This study would begin with a random sample of families from a population of families. “Family” would be used as the group variable in an ANOVA, and the replicates would be the different children making up each family. A later study attempting to address the same question in the same population would not attempt to relocate the same families used in the previous study, because the population, not the families themselves, is the target of study. Rather, a new study would begin with a new random sample of families and would again be able to generalize the results to the whole population. In contrast, a new study of the effects of light treatment—a fixed effect—on circadian rhythm could quite easily use the same light treatments as previous studies. An explanatory variable is called a fixed effect if the groups are predefined and are of direct interest. An explanatory variable is called a random effect if the groups are randomly sampled from a population of possible groups. F-tests of differences among group means for fixed effects are changed when ANOVA is expanded to examine the effects of more than one explanatory variable simultaneously. We discuss some of these issues in Chapter 18. ANOVA with randomly chosen groups Because the groups are not of specific interest in a random-effects ANOVA, planned and unplanned comparisons aren’t particularly useful. The main use of random-effects ANOVA is to estimate variance components—the amount of the variance in the data that is among random groups and the amount that is within groups. Among other uses, variance components are employed in animal and plant breeding to identify the contributions of genes and the environment to variance in traits. Variance components are also useful for quantifying measurement error in data, as shown in Example 15.6. EXAMPLE 15.6 Walking-stick limbs The walking stick Timema cristinae is a wingless herbivorous insect that lives on plants in chaparral habitats of California. In a study of the insect’s adaptations to different plant species, Nosil and Crespi (2006) measured a variety of traits using digital photographs of specimens collected from a study site in California. They used a computer to measure various traits on the photographs. Because the researchers were concerned about measurement error, they took two separate photographs of each specimen. After taking the first photo, they returned the insect to storage and then retrieved it again for the second photograph. After measuring traits on one set of photographs, they repeated the measurements on the second set. Very often the result was different the second time around, indicating measurement error. How large was the measurement error compared with real variation among individuals in the trait? Figure 15.6-1 illustrates the two measurements of femur length made on 25 specimens. The data are listed in Table 15.6-1. FIGURE 15.6-1 Strip chart showing the pair of femur length measurements (connected by a line segment) obtained from separate photographs of each of 25 walking sticks. TABLE 15.6-1 Femur length, in centimeters, measured from separate photographs of 25 walking sticks. Two measurements were made per specimen to evaluate measurement error. Specimen Femur length (cm) 1 2 3 0.26, 0.26 0.23, 0.19 0.25, 0.23 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 0.26, 0.26 0.23, 0.22 0.23, 0.23 0.22, 0.23 0.21, 0.28 0.24, 0.26 0.24, 0.20 0.29, 0.25 0.23, 0.23 0.18, 0.19 0.27, 0.24 0.23, 0.29 0.23, 0.23 0.14, 0.15 0.19, 0.19 0.31, 0.27 0.23, 0.24 0.16, 0.15 0.22, 0.20 0.19, 0.18 0.21, 0.21 0.19, 0.20 ANOVA calculations The ANOVA results, which are needed to calculate the variance components of the walkingstick data, are presented in Table 15.6-2. With only one explanatory variable, the calculations for random-effects ANOVA are the same as for fixed effects. The “groups” are the individual insects, and the replicates are the repeat measurements made of each insect. Because our goal here is to estimate the variance components, we refrain from hypothesis testing. TABLE 15.6-2 ANOVA table with results of repeat femur-length measurements of 25 walking sticks. Source of variation Sum of squares df Mean squares Groups (individual insects) 0.059132 24 0.002464 Error 0.008900 25 0.000356 Total 0.068032 49 Variance components Single-factor ANOVA with random effects differs from fixed-effects ANOVA by having two levels of random variation11 in the response variable Y. The first level is variation within groups, and the second level is variation between groups. In the stick insect study (Example 15.6), the first level (variation within groups) is the variance among repeat measurements made on the same individual, which is exclusively due to measurement error in the study. ANOVA assumes that the “true” variance between measurements is the same in every group. In Example 15.6, this means that measurement error is the same for every individual insect, except by chance. We use the symbol σ2 to indicate the value of the variance within groups in the population. The single best estimate of σ2 is MSerror, the error mean square. In the walking sticks, MSerror = 0.000356 (Table 15.6-2). To evaluate the second level of random variation (i.e., variation between groups), each group is assumed to have its own mean. For the walking sticks, the mean femur length of an individual insect is its “true” femur length—the length we would obtain if we measured its femur a great many times and took the average measurement. Random-effects ANOVA assumes that group means have a normal probability distribution in the population with a grand mean µA (the mean of all the group means) and a variance σA2 (the variance among the group means in the population of groups). In our study of the walking sticks, we assume that true femur length varies between individual insects in the population according to a normal distribution. The parameter σA2 is the variance among insects in the population in their average femur length. In random-effects ANOVA, the parameters σ2 and σA2 are called variance components. Together they describe all the variance in the response variable Y—that is, σ2 describes the variance within groups (e.g., the measurement error in the walking-stick study), and σA2 describes the variance among groups (e.g., the differences between the true femur lengths of individual insects). The variance components in a random-effects ANOVA make up all the variance in the response variable: variance within groups (σ2) and variance among groups (σA2 ). If sample size is the same in all groups (the design is balanced), the group mean square from the ANOVA results (Table 15.6-2) can be used to estimate σA2. Using the symbol sA2 to indicate the estimate of σA2, sA2=MSgroups-MSerrorn, where n is the number of measurements taken within each group.12 There were n = 2 measurements for each insect specimen in the data for Example 15.6, so sA2=0.002464-0.0003562=0.00105cm2. Thus, the best estimate of the variance in femur length among individuals in this population is 0.00105. Repeatability Repeatability is the fraction of the summed variance that is present among groups: Repeatability=sA2sA2+MSerror. The denominator of the repeatability equation (i.e., sA2 + MSerror) estimates the total amount of measurement variance in the population, summing the variance among groups and the variance within groups. Repeatability measures the overall similarity of repeat measurements made on the same group.13 A repeatability near zero, the lowest possible value, indicates that nearly all of the variance in the response variable results from differences between separate measurements made on the same group. In our example, this would mean that measurement error greatly dominates the variation found in the data. In contrast, a repeatability near one, the maximum value, indicates that repeated measurements on the same group give nearly the same answer every time. Calculated on the walking-stick data, Repeatability=0.001050.00105+0.000356=0.75. In other words, an estimated 75% of the total variance in femur length measurements in the population is the result of differences in true femur length between individual insects. The remaining 25% is the result of measurement error. The repeatability of femur length is not one, because the walking sticks are small and their femurs are even smaller. Slight variation in the position of the insect when a photograph is taken can lead to different length measurements. It is important, therefore, to report repeatability estimates in your research papers. If measurement error is present, then it is best to take several measurements of each specimen and take their average. Don’t confuse repeatability with the R2 value (Section 15.1). Repeatability reflects the magnitudes of variance components, which estimate specific population parameters. It applies only to random effects. R2, on the other hand, isn’t an estimate of any population parameter. R2 just measures the reduction in the amount of scatter around the group means compared with that around the grand mean. R2 is based on the sums of squares rather than variances, and it can be applied to fixed or random effects. Assumptions Random-effects ANOVA makes all of the same assumptions as fixed-effects ANOVA, but adds two more. It assumes that groups are randomly sampled. Also, random-effects ANOVA assumes that the group means have a normal distribution in the population. Summary ■ The analysis of variance (ANOVA) tests differences among the means of multiple groups. ■ ANOVA works by comparing the variance among subjects within groups (the error mean square, MSerror) with the variation among the sampled individuals belonging to different groups (represented by the group mean square, MSgroups). ■ The test statistic in ANOVA is the F-ratio, where F = MSgroups/MSerror. ■ Under H0, the F-ratio should be close to one except by chance. Under HA, the F-ratio is expected to exceed one. H0 is rejected if F is much larger than one. When F is significantly greater than one, it implies that there is real variance among group means, more than is expected by sampling error alone. ■ The quantities needed to test hypotheses or estimate variance components in single-factor ANOVA are summarized in an ANOVA table. ■ R2 measures the fraction of the variability explained by the explanatory (group) variable in analysis of variance. It measures the reduction in the amount of scatter of the measurements around their group means compared with that around the grand mean. ■ ANOVA assumes that the F-variable has a normal distribution in each population, with equal variance in all populations. ■ ANOVA is robust to departures from the assumption of normal populations, especially if the sample size is large. ANOVA is also robust to moderate departures from the assumption of equal standard deviations, if the study design is balanced with large samples. ■ The Kruskal-Wallis test is an alternative, nonparametric method used to test the null hypothesis that the distributions of the variable in different groups are the same. It can be used if the normality assumption of ANOVA cannot be met. ■ When used to test the null hypothesis that population means or medians differ between groups, the Kruskal-Wallis test assumes that the frequency distributions of measurements in different groups have the same shape (i.e., the same assumption as the Mann-Whitney U-test). ■ Planned comparisons between means are few in number and represent only comparisons identified as crucial before the data are collected and analyzed. Unplanned comparisons are a more comprehensive set of comparisons done in search of interesting patterns. Unplanned comparisons require special methods to protect against high Type I error rates. ■ The Tukey-Kramer method to compare all pairs of means is the most commonly used method for unplanned comparisons. ■ In fixed-effects ANOVA, the groups are predetermined, repeatable categories of direct interest. The results of ANOVA apply only to those groups included in the study. ■ In random-effects ANOVA, groups are a random sample from a population of groups. The results of random-effects ANOVA can be generalized to the population of groups. ■ The repeatability estimates the fraction of the total variance that is present among groups in random-effects ANOVA. Repeatability is frequently used to evaluate the importance of measurement error. Quick Formula Summary Analysis of variance (ANOVA) What is it for? Testing the difference among means of k groups simultaneously. What does it assume? The variable is normally distributed with equal standard deviations (and therefore equal variances) in all k populations. Each sample is a random sample. Test statistic: F Sampling distribution under H0: F-distribution with k − 1 and N − k degrees of freedom. Use the right tail of the F-distribution in ANOVA. Formulas: Source of variation Groups Sum of squares SSgroups=∑ini(Y¯i-Y¯)2 Error SSerror=∑isi2(ni-1) Total SSgroups + SSerror df Mean squares F-ratio k − SSgroupsdfgroups 1 N − SSerrordferror k MS N − 1 In these formulas, ni is the sample size in group i, k is the number of groups, Y¯=∑ini(Y¯i)N is the grand mean, and N=∑ini is the total sample size. R squared (R2) What is it for? Measuring the fraction of the variation in Y that is “explained” by differences among groups. Formula: R2=SSgroupsSStotal, where SSgroups is the sum of squares for groups and SStotal is the total sum of squares. Kruskal-Wallis test What is it for? Testing differences among k populations in the means or medians of their distributions. What does it assume? Random samples. The frequency distributions of measurements in the different groups have the same shape. Test statistic: H Sampling distribution under H0: Approximately χ2 with k − 1 degrees of freedom. Formula: H=12N(N+1)∑iRi2ni-3(N+1), where N is the total sample size and Ri is the sum of the ranks for group i. Observations are ranked from small to large, as described in Section 13.5 for the MannWhitney [−test. Planned confidence interval for the difference between two of k means What is it for? Estimating the difference between means of two out of k populations when the comparison is planned before the experiment. What does it assume? Random samples. The variable is normally distributed with equal variances in all k populations. Estimate: Y¯i−Y¯j, where i and j are the two populations, with i ≠ j. Parameter: µi − µj Formula: (Y¯i−Y¯j)−SEt0.05(2),N−k<μi−μj<(Y¯i−Y¯j)+SEt0.05(2),N−k, where SE=MSerror(1ni+1nj). Planned test of the difference between two of k means What is it for? Testing the difference between two out of k population means when the comparison is planned. What does it assume? Random samples. The variable is normally distributed with equal variances in all k populations. Test statistic: t Distribution under H0: The t-distribution with N − k degrees of freedom. Formulae: t=(Y¯i-Y¯j)SE, where Y¯i−Y¯j is the observed difference between the means of two groups i and j (with i ≠ j), and SE=MSerror(1ni+1nj) difference between the two means. Tukey-Kramer test of all pairs of means is the standard error of the What is it for? Testing the differences of all pairs of k means. These tests are unplanned.? What does it assume? Random samples. The variable is normally distributed with equal variances in all k populations. An ANOVA has already rejected the null hypothesis of equal means for all k groups. Test statistic: q Sampling distribution under H0: The q-distribution with k means and N − k degrees of freedom. Formula: q=(Y¯i-Y¯j)SE, where Y¯i−Y¯j is the observed difference between the means of two groups i and j (with i ≠ j), SE=MSerror(1ni+1nj), the two groups i and j, respectively. and ni and nj are the sample sizes for Repeatability and variance components What is it for? Repeatability is the fraction of the total variance that is among groups in a random-effects ANOVA. What does it assume? Groups are a random sample from a population of groups. Repeat measurements within groups are randomly sampled and normally distributed, with an equal standard deviation (and variance) in all groups. Group means have a normal distribution in the population. Formula: Repeatability=sA2sA2+MSerror, where sA2=MSgroups-MSerrorn, (assumed to be equal for our formulas). and n is the sample size within each group PRACTICE PROBLEMS 1. Calculation practice: Analysis of variance. Many humans like the effect of caffeine, but it occurs in plants as a deterrent to herbivory by animals. Caffeine is also found in flower nectar, and nectar is meant as a reward for pollinators, not a deterrent. How does caffeine in nectar affect visitation by pollinators? Singaravelan et al. (2005) set up feeding stations where bees were offered a choice between a control solution with 20% sucrose or a caffeinated solution with 20% sucrose plus some quantity of caffeine. Over the course of the experiment, four different concentrations of caffeine were provided: 50, 100, 150, and 200 ppm. The response variable was the difference between the amount of nectar consumed from the caffeine feeders and that removed from the control feeders at the same station (in grams). Here are the data and strip chart, including standard error bars: 50 ppm caffeine: −0.4, 0.34, 0.19, 0.05, −0.14 100 ppm caffeine: 0.01, −0.39, −0.08, −0.09, −0.31 150 ppm caffeine: 0.65, 0.53, 0.39, −0.15, 0.46 200 ppm caffeine: 0.24, 0.44, 0.13, 1.03, 0.05 Does the mean amount of nectar taken depend on the concentraton of caffeine in the nectar? We will carry out an analysis of variance to find out. a. State the null and alternate hypotheses appropriate for this question. b. Calculate the following summary statistics for each group (i.e., for each caffeine treatment): ni, Y¯i, and si. c. Set up an ANOVA table to keep track of your results. Add to this table through the remaining steps. d. What is the mean square error MSerror? e. How many degrees of freedom are associated with error? f. Calculate the estimate of the grand mean. g. h. i. j. Calculate the group sum of squares. Calculate the group degrees of freedom and the group mean square. What is F for this example? Use Statistical Table D (or a computer) to find the P-value for this test. 2. Calculation practice: Analyze results from a Tukey-Kramer test. Using the same data as in Practice Problem 1, use the results from a Tukey-Kramer test to illustrate which of the pairs of groups are significantly different in their means. The Tukey-Kramer test computations are given in the table at the bottom of this page. a. What null hypotheses are being tested by the Tukey-Kramer procedure? b. On a single line of text, write the sample mean of the four treatment groups, from smallest to largest. c. Begin from the largest sample mean on the right. Draw a line underneath this sample mean, and continue the line to the left until it underlines all sample means whose group means are not significantly different from the largest mean. d. Now move to the next largest mean, second from right. Draw a line underneath this sample mean to the left until it underlines all sample means whose group means are not significantly different. Keep this line if it extends farther to the left than the line from part (c); otherwise, discard it because it contains no new Information. e. Now move to the next largest mean, third from right. Repeat the procedure outlined in part (d). Continue the procedure in (c) and (d) until it has been carried out on all the means. When you reach the smallest mean, create a new underline if this group mean has not yet been underlined in a previous step. TABLE FOR PROBLEM 2 Group i Group j Y¯i−Y¯j 200 200 200 150 150 50 100 50 150 100 50 100 0.550 0.370 0.002 0.548 0.368 0.180 SE q 0.190 0.190 0.190 0.190 0.190 0.190 2.89 1.94 0.01 2.88 1.93 0.95 q0.05,4,12 Conclusion 2.86 2.86 2.86 2.86 2.86 2.86 Reject Do not reject Do not reject Reject Do not reject Do not reject f. Assign each unique line a symbol such as “a,” “b,” and so on. Under each of the four sample means, write the symbols for all the lines that underline it. Give a group more than one symbol (e.g., “a,b”) if it is underlined by more than one unique underline. g. Use the underlines and symbols from part (f) to summarize in words the overall results of the Tukey-Kramer test. That is, state which means group together on the basis of statistical significance. 3. Calculation practice: Repeatability. The following anonymized data show the midterm and final exam grades (%) for eight undergraduate students from a biostatistics class at a major university. The partial ANOVA table provides sums of squares and mean squares. What is the repeatability of grade performance? Individual Midterm grade Final exam grade 1 78 81 2 3 4 5 6 7 8 84 94 82 58 62 81 80 65 75 62 60 86 92 89 Source Sum of squares df Mean squares Individual Error 1145.94 956.50 7 8 163.71 119.56 Total 2102.44 a. Calculate sA2 from the mean squares and the sample size. What variance component does this quantity estimate? b. Calculate the repeatability using sA2 and MSerror. c. Interpret the repeatability you just calculated. What fraction of the variance among test scores is estimated to reflect true differences among students in performance, and what fraction is measurement variance from test to test within students? d. What assumptions are you making when estimating repeatability to test scores? 4. An important issue in conservation biology is how dispersal among populations influences the persistence of species in a fragmented landscape. Molofsky and Ferdy (2005) measured this in the annual plant Cardamine pensylvanica, a weed that produces explosively dispersed seeds. Four treatments were used to manipulate seed dispersal by changing the distance among experimental plant populations. These treatments were adjacent (continuous treatment), separated by 23.2 cm (medium), separated by 49.5 cm (long), or separated by partitions that blocked all seed dispersal among populations (isolated). Treatments were randomly assigned to plant populations. The data below are the number of generations that the populations persisted in four replicates of each treatment. Treatment Generations persisted Isolated Medium Long Continuous 13, 8, 8, 8 14, 12, 16, 16 13, 9, 10, 11 9, 13, 13, 16 a. What is the explanatory variable in this analysis? What is the response variable? b. Is this an experimental study or an observational study? Explain. c. Calculate the sample means for each group, and then calculate a confidence interval for the mean of each group. What assumptions have you made? d. Display the data along with the means and confidence intervals in a graph. 5. Analysis of variance carried out on the Cardamine data of Practice Problem 4 yielded the following results: Source of variation Groups Error Sum of squares df Mean squares F-ratio P 63.188 63.250 Total What are the null and alternative hypotheses being tested? Fill in the rest of the table. What is the sampling distribution of the F-statistic under the null hypothesis? What does P measure? In words, explain what each “sum of squares” measures. If you wanted to determine the fraction of the variation in generations persisted that is “explained” by treatment, what quantity would you use? g. Calculate the quantity identified in part (f). a. b. c. d. e. f. 6. The Tukey-Kramer procedure was carried out on the results of Practice Problems 4 and 5, yielding the results at the bottom of this page. a. What are the null and alternative hypotheses being tested? b. Are these comparisons considered planned or unplanned? Why? c. Only the largest pairwise difference between means, that between the “medium” and “isolated” treatments, is statistically significant. How is this possible, given that neither of these two means is significantly different from the means of the other two groups? d. Using symbols, summarize the results of the Tukey-Kramer tests on the graph you created in part (d) of Practice Problem 4. e. Explain in words why the critical value for each test (2.97) is larger than the critical value of the t-distribution having 12 degrees of freedom (2.18). 7. Imagine a hypothetical experiment with multiple treatments but relatively small sample sizes within treatments. The goal is to test whether the treatment means are equal. Calculations and graphical analysis indicate that the data differ markedly from a normal distribution. a. What other two options are available to carry out a test? b. Which of the other two options should be attempted first? Why? 8. Finding the causes of the major mental illnesses, schizophrenia and bipolar disorder, is a major activity in medical research. Tkachev et al. (2003) compared expression levels of several genes involved in the production and activity of myelin, a tissue important in nerve function, in the brains of 15 persons with schizophrenia, 15 with bipolar illness, and 15 control individuals. The results presented in the accompanying table summarize the expression levels of the proteo-lipid protein 1 gene (PLP1) in the 45 brains. a. The main objective of the study was to compare PLP1 gene expression in persons having schizophrenia with that of control individuals. Using a planned comparison approach, compute a 95% confidence interval for the difference between the means of these two groups. TABLE FOR PROBLEM 6 Y¯i−Y¯j Group j Medium Isolated 5.25 1.623 3.234 2.97 Medium Long 3.75 1.623 2.310 2.97 Do not reject H0 Medium Continuous 1.75 1.623 1.078 2.97 Do not reject H0 Continuous Isolated 3.50 1.623 2.156 2.97 Do not reject H0 Continuous Long 2.00 1.623 1.232 2.97 Do not reject H0 Long 1.50 1.623 0.924 2.97 Do not reject H0 Isolated SE q q0.05,4,12 Group i Conclusion Reject H0 TABLE FOR PROBLEM 8 Group Raw data (normalized units) Standard Mean deviation −0.02, −0.27, −0.11, 0.09, 0.25, −0.02, Control 0.48, −0.24, 0.06, 0.07, −0.30, −0.18, −0.004 0.04, −0.16, 0.25 −0.1, −0.31, −0.05, 0.11, −0.38, 0.23, Schizophrenia −0.23, −0.28, -0.36, −0.22, −0.40, −0.195 −0.19, −0.34, −0.29, −0.12 −0.34, −0.39, −0.22, −0.32, −0.32, Bipolar −0.05, −0.43, -0.33, −0.41, −0.36, −0.263 −0.25, −0.29, 0.06, −0.30, 0.01 0.218 0.182 0.151 b. Is a planned comparison appropriate in part (a)? Explain. c. What are your assumptions in part (a)? 9. Use the data from Practice Problem 8 to solve this problem. a. Test whether mean PLP1 gene expression differs among the schizophrenia, bipolar, and control groups. b. What are your assumptions in part (a)? c. Is the analysis in part (a) a random-effects or fixed-effects ANOVA? Explain. d. What quantity would you use to describe the fraction of the variation in expression levels explained by group differences? Calculate this quantity for the data from Practice Problem 8. e. What method would you use after ANOVA to determine which group means were different from each other? 10. The bright yellow head of the adult Egyptian vulture (see the photo at the beginning of the chapter) requires carotenoid pigments. These pigments cannot be synthesized by the vultures, so they must be obtained through their diet. Unfortunately, carotenoids are scarce in rotten flesh and bones, but they are readily available in the dung of ungulates. Perhaps for this reason, Egyptian vultures are frequently seen eating the droppings of cows, goats, and sheep in Spain, where they have been studied.14 Ungulates are common in some areas but not in others. Negro et al. (2002) measured plasma carotenoids in wild-caught vultures at four randomly chosen locations in Spain as part of a study to determine the causes of variation among sites in Spain in carotenoid availability. Site Mean concentration (µg/ml) Standard deviation n 1 2 1.86 5.75 1.22 2.46 22 72 3 4 6.44 11.37 3.42 1.96 77 11 a. Use the data provided in the table to test whether the mean plasma concentration of carotenoids in wild Egyptian vultures differs among sites. b. What are the assumptions of your analysis in part (a)? 11. One way to assess whether a trait in males has a genetic basis is to determine how similar the measurements of that trait are among his offspring born to different, randomly chosen females. In a lab experiment, Kotiaho et al. (2001) randomly sampled 12 male dung beetles, Onthophagus taurus, and mated each of them to three different virgin females. The average body-condition score of offspring born to each of the three females is listed for each male in the following table. Male 1 2 3 4 5 6 7 8 9 10 11 12 Offspring body-condition scores 0.82, 0.44, 0.92 0.35, 0.19, 1.39 0.12, 0.84, 0.16 0.49, 0.59, −0.23 0.44, 0.33, 0.07 0.00, 0.29, 0.30 0.69, −0.49, −0.60 0.13, −0.43, 0.66 0.21, −0.34, −0.09 −0.35, −1.40, −0.45 −1.04, −0.34, −0.82 −1.27, −0.75, −0.86 ANOVA was used to test whether males differed in the mean condition of their offspring using the three measurements for each male. The results were SSgroups = 9.940 and SSerror = 4.682. a. Is this a random-effects or a fixed-effects ANOVA? Explain the reason behind your answer. b. What is the repeatability of the offspring condition of males mated to different females?15 12. Imagine that you are a statistical consultant and a researcher comes to you with a problem. The researcher has a random sample of data from each of six groups, and she wants to test whether the means of the six groups differ. In each of the following situations, recommend the best procedure. a. The data are approximately normally distributed with equal standard deviations in all groups. Sample size is small. b. The data are not normally distributed and do not have equal standard deviations in all groups. Sample size is small. c. The data are not normally distributed. Groups have nearly equal standard deviations. Sample size is large and equal among groups. d. The null hypothesis of no difference among group means was rejected. Now the researcher wants to find out which means are different. The data are approximately normally distributed with equal standard deviations in all groups, and the sample size is small. 13. Pea aphids, Acyrthosiphon pisum, can be red or green. Weirdly, red aphids make carotenoids (red pigments) with genes that jumped from a fungus into the aphid genome some time during recent evolutionary history. What’s more, some red aphids start out red and then change to green later in life. Observation suggested that color changers were Infected with a bacterium, Rickettsiella. To test whether Rickettsiella was the cause of color change, Tsuchida et al. (2010) experimentally injected the bacterium into a sample of red aphids. The data below are color measurements of genetically identical and bacteria-free red aphids that were either uninjected (original), injected successfully with Rickettsiella (infected), or injected but the bacterium failed to establish (uninfected). Color was measured as hue angle in degrees; for these data, small angles indicate red, whereas larger angles represent green. Original: 30.7, 25.4, 26.2, 23.0, 20.9, 20.7, 15.8, 17.4, 17.6, 17.0, 16.5, 15.3 Uninfected: 25.2, 22.3, 18.5, 15.4, 15.3, 17.0, 16.6, 18.6, 19.0 Infected: 43.3, 42.3, 40.7, 41.2, 39.6, 39.5, 36.2, 36.2, 34.4, 30.7, 31.9 a. Show the data in a graph. What trend is suggested? b. By eye, describe how the data might depart from the assumptions of ANOVA. c. The data were analyzed using a Kruskal-Wallis test. What are the null and alternative hypotheses for this test? d. The results of the test were as follows: H = 21.1. What is the conclusion? e. H is not calculated directly from the original measurements. What does it use instead? f. Under what assumption are we able to use the results of the Kruskal-Wallis test to conclude that the means differ among the three groups? Is this assumption met here? 14. Tsetse flies are the vectors of human sleeping sickness and animal trypanosomiasis in Africa. The tsetse fly species Glossina palpalis feeds on the blood of a variety of animals, including humans, and an important question is whether the feeding preferences of individuals can be affected by learning. To investigate this, Bouyer et al. (2007) provided cohorts of male tsetse flies with a first blood meal of either cows or lizards. After two days, the flies were offered a second blood meal of cows only. The data below measure the proportion of flies in each cohort that took a meal from the cows (the remaining individuals chose not to feed). Treatment: first blood meal Lizard Cow Proportion of flies taking second meal from cow 0.66, 0.58, 0.52, 0.37, 0.35, 0.34, 0.29 1.00, 0.98, 0.97, 0.96, 0.87, 0.83 a. Display the results of the study in a graph. b. What assumptions must be met before using ANOVA to test for differences between the two treatment groups in the mean proportion of flies taking the second blood meal from cows? In view of your results in part (a), do you see a problem meeting these assumptions? c. Consider using a transformation to fix the problems identified in part (b). Given the data, what transformation should be attempted first? Try this transformation on the data. Did it fix the problems? d. Using the transformed data, test whether the means of the two blood-meal groups are different. 15. In a field experiment designed to investigate the role of genetic diversity in ecosystems, Reusch et al. (2005) planted eelgrass in plots in a shallow estuary in the Baltic Sea. Eighteen eelgrass shoots were planted in every plot. Some plots (randomly chosen) were planted with only one eelgrass genotype, others were planted with three genotypes, and others were planted with six different eelgrass genotypes. At the end of the experiment, the total number of shoots was counted in each plot. The results from 32 plots are given in the following table: Treatment (number of genotypes) 1 3 6 a. b. c. d. Number of shoots at end of experiment 11, 14, 21, 27, 28, 30, 32, 36, 38, 49, 61, 71 20, 35, 36, 41, 46, 47, 52, 53, 58, 67 31, 45, 45, 47, 48, 62, 64, 69, 84, 86 What are the hypotheses for the test? Carry out the test using ANOVA. Summarize your results in an ANOVA table. What assumptions have you made? Is this a fixed-effects or random-effects ANOVA? Explain your reasoning. ASSIGNMENT PROBLEMS 16. Tukey-Kramer tests carried out on the results in Practice Problem 15 yielded the accompanying table of results. Groups refer to the number of genotypes in the corresponding treatment. a. Fill in the conclusions in the table. b. Write the sample means of the three treatment groups, from smallest to largest, and use symbols to indicate the groups to which the means belong. Summarize the result in words. Group i Group j Y¯i−Y¯j 6 6 3 1 3 1 23.26 12.60 10.67 SE q 7.13 3.26 7.45 1.69 7.13 1.50 q0.05,k,N−k Conclusion 2.47 2.47 2.47 c. Are these planned or unplanned comparisons? Explain. d. Why not use a series of two-sample t-tests instead of the Tukey-Kramer method? e. In the preceding analysis, what is the probability of making at least one Type I error during the course of carrying out all of the pairwise tests of differences between means? 17. Dormant eggs of the zooplankton Daphnia survive in lake sediments for decades, making it possible to measure their physiological traits in past years. Hairston et al. (1999) extracted Daphnia eggs from sediment cores of Lake Constance in Europe to examine trends in resistance to dietary cyanobacteria, a toxic food type that has increased in density since 1960 in response to increased nutrients in the lake. The data and accompanying histogram give the resistance level of 32 Daphnia clones, each initiated from single eggs extracted from deposits laid down during years of low, medium, and high cyanobacteria density between 1962 and 1997. Resistance is the average growth rate of individuals fed cyanobacteria divided by the growth rate when individuals from the same clone are fed a high-quality algal food instead. We wish to test whether resistance differs among Daphnia clones from the three cyanobacteria density groups. Cyanobacterium density Low Medium High Measurements of resistance 0.56, 0.57, 0.58, 0.62, 0.64, 0.65, 0.67, 0.68, 0.74, 0.78, 0.85, 0.86 0.70, 0.74, 0.75, 0.76, 0.78, 0.79, 0.80, 0.82, 0.83, 0.86 0.65, 0.73, 0.74, 0.76, 0.81, 0.82, 0.85, 0.86, 0.88, 0.90 a. Examine the histograms of the data. Give the two main reasons why caution is warranted before using ANOVA to test for differences among group means. b. The data were analyzed using a Kruskal-Wallis test. What are the null and alternative hypotheses for this test? c. The results of the test were as follows: H = 8.20. What is the conclusion? d. H is not calculated directly from the original measurements. What does it use instead? e. Under what assumption would we be able to use the results of the test to draw a conclusion about whether the means or medians are the same in the three groups? Is this assumption met here? 18. When using analysis of variance, what are the main advantages of the following factors? a. Large sample size b. Balanced design 19. Examine Figure 15.4-1 (p. 476), which shows means and standard errors for three treatment groups. If this graph was to be the only graph to display the results in a scientific report, what additional feature(s) would you recommend to ensure it follows the principles of good graph design? 20. As the “baby boom” generation ages, interest in finding treatments that extend life span has surged. Experimental research mainly uses nonhuman animals. In a recent experiment, Evason et al. (2005) tested the influence of the anticonvulsant medication trimethadione on the life span of the nematode worm, Caenorhabditis elegans. The study compared the effects of three trimethadione treatments (provided at the larval stage, at the adult stage, and at both stages) and a water treatment (the control). The resulting life spans are shown in the following histograms for 50 worms in each treatment. Assume that each worm was treated independently. The data are available at whitlockschluter.zoology.ubc.ca. a. ANOVA would be the preferred method to compare the means of each group. What problem or problems do you see in applying this method to the data shown in the preceding histograms? b. The data were analyzed with a Kruskal-Wallis test. What are the null and alternative hypotheses? c. The Kruskal-Wallis test yielded the following rank sums. None (water) 4201 Life stage of trimethadione treatment Larval Adult stage stage 3672 6003.5 Both stages 6223.5 H = 29.27. Can we conclude from this result that the means of the treatment groups are unequal? Explain. 21. An observational study gathered data on the rate of progression of multiple sclerosis in patients diagnosed with the disease at different ages. Differences in the mean rate of progression were tested among several groups that differed by age-of-diagnosis using ANOVA. The results gave P = 0.12. From the following list, choose all of the correct conclusions that follow from this result (Borenstein 1997). Explain the basis of your answers. a. The mean rate of progression does not differ among age groups. b. The study has failed to show a difference among means of age groups, but the existence of a difference cannot be ruled out. c. If a difference among age groups exists, then it is probably small. d. If the study had included a larger sample size, it probably would have detected a significant difference among age groups. 22. Head width was measured twice, in cm, on the random sample of 25 walking-stick insects described in Example 15.6. The data are as follows: a. b. c. d. Specimen Head width (cm) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 0.15, 0.15 0.18, 0.18 0.17, 0.18 0.21, 0.21 0.15, 0.16 0.19, 0.19 0.18, 0.17 0.18, 0.23 0.18, 0.17 0.19, 0.21 0.17, 0.20 0.19, 0.21 0.16, 0.18 0.15, 0.18 0.19, 0.21 0.18, 0.19 0.15, 0.15 0.17, 0.17 0.20, 0.21 0.18, 0.18 0.18, 0.16 0.17, 0.18 0.16, 0.13 0.18, 0.16 0.17, 0.17 Use ANOVA calculations to estimate the variance within groups for head width. Calculate the estimate of the variance among groups. What is the repeatability of the head-width measurements? Compare your result in part (c) with that for femur length analyzed in Example 15.6. Which trait has higher repeatability? Which trait is more affected by measurement error? 23. The accompanying table presents mean cone size (mass) of lodgepole pine in 16 study sites in three types of environments in western North America (Edelaar and Benkman 2006). The three environments were islands of lodgepole pines in which pine squirrels were absent (an “island” here refers to a patch of lodgepole pine surrounded by other habitat and separated from the large tracts of contiguous lodgepole pine forests), islands with squirrels present, and sites within the large areas of extensive lodgepole pines (“mainland”) that all have squirrels. Habitat type Island, squirrels absent Island, squirrels present Mainland, squirrels present Raw data (g) Mean SD 9.6, 9.4, 8.9, 8.8, 8.5, 8.2 6.8, 6.6, 6.0, 5.7, 5.3 6.7, 6.4, 6.2, 5.7, 5.6 8.90 0.53 6.08 0.62 6.12 0.47 The main comparison of interest in this study, identified before the data were gathered, was the comparison between islands with and without squirrels, because this comparison controls for any effects of forest isolation on the mass of lodgepole pine cones. a. What do we label this type of comparison? b. Taking into account the type of comparison identified in part (a), calculate a 95% confidence interval for the difference in cone mass between islands with and without squirrels. Assume that sites were randomly sampled. c. Using these data, carry out a test of the differences among the means of all three groups. 24. People with an autoimmune disease like lupus produce antibodies that react to their own tissues. Research has shown that lupus-prone strains of mice have B cells with reduced expression levels of the receptor gene FcyRIIB, suggesting that low expression of this gene might contribute to the autoimmune reaction. To test this, McGaha et al. (2005) experimentally enhanced expression of the FcgRIIB gene in bone marrow taken from a lupus-prone mouse strain and transplanted it back into irradiated mice of the same strain. Other mice of the same strain were subjected to the same procedures but received bone marrow that was not enhanced for FcgRIIB expression (i.e., the sham treatment). Mice in a third group were left untreated. Autoimmune reactivity was measured six months later. The following is a frequency table indicating the highest dilution of blood serum, in a fixed series of dilutions, at which reactivity could be detected (a high dilution reflects a high autoimmune reactivity). Number of mice Dilution Enhanced Sham-treated Untreated 100 200 400 800 Total 3 4 2 0 9 0 0 3 2 5 0 0 2 4 6 Is this a balanced design? Explain. What distinct purposes do the sham-treated and untreated groups serve in this experiment? Calculate the mean and standard deviation of dilution measurements in each group. We would like to test whether the mean dilutions of the three groups are different. Based on your answers to parts (a) and (c), why should we be cautious about employing ANOVA? e. Choose a transformation that overcomes the main difficulty in part (d). Display your resulting sample means and standard deviations. f. Using the transformed data, test whether there is a difference among treatment groups in the mean of dilution measurements. g. What method would we use next to help decide which group means differed from the others? a. b. c. d. 25. Huey and Dunham (1987) measured the running speed of fence lizards, Sceloporus merriami, in Big Bend National Park in Texas. Individual lizards were captured and placed in a 2.3-meter raceway, where their running speeds were measured. Lizards were then tagged and released. The researchers returned to the park the following year, captured many of the same lizards again, and measured their sprint speed in the same way as previously. The pair of measurements for 34 individual lizards is as follows: Lizard Sprint Lizard speed (m/s) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1.43, 1.37 1.56, 1.30 1.64, 1.36 2.13, 1.54 1.96, 1.82 1.89, 1.79 1.72, 1.72 1.80, 1.80 1.87, 1.87 1.61, 1.88 1.60, 1.98 1.71, 2.08 1.83, 2.16 1.92, 2.08 1.90, 2.01 2.06, 2.03 17 18 19 20 21 22 23 24 2.06, 1.97 2.28, 2.05 2.44, 1.92 2.23, 2.12 2.53, 2.11 2.20, 2.22 2.16, 2.27 2.25, 2.39 25 26 27 28 29 30 31 32 33 34 2.42, 2.33 2.61, 2.33 2.62, 2.39 3.09, 2.17 2.13, 2.54 2.44, 2.63 2.76, 2.69 2.96, 2.64 3.13, 2.81 3.27, 2.88 a. With these data, calculate the repeatability of running speed. b. What does repeatability measure? 26. Mosquitoes contribute to more human deaths than any other organism, because they transmit diseases such as malaria, dengue fever, and yellow fever. Some of these diseases develop or grow inside the mosquito—a process that can take some time. Therefore, one possible strategy to reduce transmission of disease is to cause mosquitos to die slightly sooner, leaving insufficient time for the disease to develop. Fang et al. (2011) tested the idea by infecting mosquitos with a fungus (Metarhizum anisopliae) that reduces the life span of the insect. In addition, they developed a transgenic strain of fungus that carries a gene for scorpine, a protein from scorpion venom known to inhibit the gamete stages of malaria. They compared three groups of mosquitoes: a “control” group that was not treated with fungus, a “wild type” group that was infected with unmodified fungus, and a “scorpine” group that was infected with the transgenic fungus. Each mosquito was infected with malaria. The response variable was the log number of sporozoites (infectious cells of malaria) in the salivary glands of the mosquitoes. Here are the data: Control: 7.2, 7.4, 7.4, 7.7, 7.9, 7.9, 8.0, 8.2, 8.3, 8.4, 8.4, 8.5, 9.1, 9.2, 9.2 Wild type: 5.6, 6.5, 6.7, 7.0, 7.5, 7.9, 7.9, 8.0, 8.0, 8.2, 8.4, 9.0, 9.1, 9.0, 9.1 Scorpine: 0.0, 4.4, 5.3, 5.6, 4.1, 5.3, 5.9, 6.0, 6.0, 6.1, 6.2, 7.0, 7.5 a. Show the data in a graph. What pattern is suggested? b. Examine the frequency distributions of the data. What statistical approach would be the most appropriate to determine whether these treatments vary in their number of sporozoites? Why? 27. Does adding math to a scientific paper make readers think that it has more value? Erikssno (2012) sent two abstracts of scientific papers to 200 people with postgraduate degrees. For each participant, one of the abstracts was randomly chosen and had a meaningless sentence inserted describing an unrelated mathematical model, while the other had no mathematical addition. The sentence had no conceptual connection to the subject matter of the abstract; it was just meaningless mathematics in that context. Participants were asked to rate the quality of the research in each abstract on a scale from 1 to 100, and the differences between the scores of their two abstracts—score of the abstract with math minus score of abstract without math— were recorded. Participants were also asked for the subject matter of their postgraduate degree: math, science, technology (MST); medicine (M): humanities, social science (HS); or other (O). A box plot of the data and summaries of the results for each group are given below; the full data set can be found at whitlockschluter.zoology.ubc.ca. Degree subject Mean score difference SD score difference n MST -1.28 19.24 69 M HS O 3.06 6.60 13.90 15.99 21.15 23.31 16 84 31 a. Examine the graph and judge by eye how well the data likely fit the assumptions of ANOVA. b. Test whether the subject background of the participants affected how much the added math changed their views of the abstracts on average. c. Is the relationship between degree subject and score difference strong? Answer using R2. 28. The parasitoid wasp, Leptopilina heterotoma, injects eggs into young larvae of fruit flies, Drosophila melanogaster. One reaction by the flies is to self-medicate by consuming alcohol (ethanol), which is naturally present in the decaying fruits where they live. The ethanol reduces oviposition by wasps, and it increases death rates of wasp larvae within parasitized flies. Kacsoh et al. (2013) investigated whether the presence of the wasp Influences where female fruit flies prefer to lay their eggs. They presented female flies in cages with two dishes of fly food, one having 6% ethanol and the other with 0% ethanol. They recorded the proportion of eggs laid in the 6% ethanol dish when females were placed with female wasps, with male wasps, or with no wasps. The data below give the proportion of eggs laid in the ethanol dish for multiple replicates of each wasp treatment. No wasp: 0.25, 0.40, 0.46, 0.44 Male wasp: 0.42, 0.47, 0.31, 0.52 Female wasp: 0.89, 0.83, 0.92, 0.93 a. Proportion data often show differences in standard deviations between groups that differ in the mean proportion (tending to be smaller in groups whose means are close to 0 or 1). Do these data show such a trend? To answer, make a table of the means and standard deviations of groups (include sample sizes). b. Carry out a transformation suitable for proportion data and then make a new table. Does the transformation reduce heterogeneity among groups in the standard deviation? c. Test the null hypothesis of no differences among group means using the transformed data. d. What fraction of the variation in the response variable is explained by treatment? 29. Refer to Assignment Problem 28. a. Illustrate the (transformed) data in a graph. Add means and error bars showing standard errors of means. b. The table at the bottom of the page shows partial results of Tukey-Kramer multiple comparisons of means. Complete the table by adding the test conclusions. c. Use symbols to illustrate the results of the Tukey-Kramer test. Add the symbols to your graph in part (a) to show which means group together on the basis of statistical significance. 30. Fiddler crabs are so called because males have a greatly enlarged “major” claw, which is used to attract females and to defend a burrow. Darnell and Munguia (2011) recently suggested that this appendage might also act as a heat sink, keeping males cooler while out of the burrow on hot days. To test this, they placed four groups of crabs into separate plastic cups and supplied a source of radiant heat (60-watt light bulb) from above. The four groups were intact male crabs; male crabs with the major claw removed; male crabs with the other (minor) claw removed (control); and intact female fiddler crabs. They measured body temperature of crabs every 10 minutes for 1.5 hours. These measurements were used to calculate a rate of heat gain for every individual crab in degrees C/log minute. Rates of heat gain for all crabs are provided below. Female: 1.9, 1.6, 1.4, 1.1, 1.6, 1.8, 1.9, 1.7, 1.5, 1.8, 1.7, 1.7, 1.8, 1.7, 1.8, 2.0, 1.8, 1.7, 1.6, 1.6, 1.5 Intact male: 1.9, 1.2, 1.0, 0.9, 1.4, 1.0, 1.3, 1.4, 1.1, 1.0, 1.4, 1.2, 1.4, 1.4, 1.5, 1.5, 1.1, 1.4, 1.3, 1.3, 1.3 Male minor removed: 1.2, 1.0, 0.9, 0.8, 1.2, 0.9, 1.1, 1.1, 1.3, 1.3, 1.3, 1.1, 1.4, 1.5, 1.4, 1.4, 1.2, 1.4, 1.3, 1.2, 1.4 Male major removed: 1.2, 0.9, 1.4, 1.2, 1.2, 1.6, 1.9, 1.4, 1.4, 1.4, 1.6, 1.4, 1.7, 1.3, 1.5, 1.2, 1.3, 1.6, 1.5, 1.5, 1.5 a. Show these data in a graph. What trends are suggested? b. Use ANOVA to test whether mean rate of heat gain differs among groups. 31. Refer to Assignment Problem 30 on fiddler crab claws. a. The main comparison of interest, which was identified before carrying out the experiment, was to test for a difference between the two male groups “Major removed” and “Minor removed.” What test method is justified in this case? b. The table at the bottom of the page shows partial results of Tukey-Kramer multiple comparisons of means. In what way does this method differ from the method identified in part (a)? c. Complete the table by adding the test conclusions. d. Use symbols to illustrate the results of the Tukey-Kramer test. Describe in words which population means are grouped together based on statistical significance. TABLE FOR PROBLEM 29 Group i Group j Y¯i−Y¯j Female No Female Male Male No SE 0.572 0.527 0.044 q0.05,4,12 Conclusion q 0.0626 9.13 0.0626 8.42 0.0626 0.71 2.79 2.79 2.79 TABLE FOR PROBLEM 31 Group i Female Female Female Major rem. Major rem. Group j Y¯i−Y¯j SE q q0.05,4,12 Conclusion Minor rem. Intact male Major rem. Minor rem. Intact male 0.4667 0.3905 0.2619 0.2048 0.1286 0.0642 0.0642 0.0642 0.0642 0.0642 7.263 6.077 4.076 3.187 2.001 2.62 2.62 2.62 2.62 2.62 Intact male Minor rem. 0.0762 0.0642 1.186 2.62 32. The graphs at the right are from a study investigating hippocampal volume loss in 107 patients with drug-resistant epilepsy (Cook et al. 1993). The graphs depict the association between hippocampal volume loss (measured using MRI as the volume of the smaller half of the hippocampus divided by the volume of the larger half, expressed as a percentage) and patient history. Patients were grouped on the basis of whether they had a record of childhood febrile seizures (CFS), childhood non-febrile seizures (no CFS) and no childhood seizures. a. Which accompanying graph, the box plot (top) or the bar graph (bottom; indicating means and SEs), best depicts the patterns in the data? Why? b. Which statistical method would you recommend to test whether groups differed in hippocampal volume loss? Why? 9 INTERLEAF Experimental and statistical mistakes Science is mostly done by intelligent people who want to get it right. There are cases of outright fraud, but these are somewhat rare (Panel on Scientific Responsibility and the Conduct of Research, 1992; Fanelli 2009). On average, scientists are hard-working and careful, and they pay attention to details. But sometimes things don’t go as planned. Sometimes replicates get lost, vials get swapped, labels get blurred, tired hands write down the wrong numbers, the wrong button is pushed, or the wrong statistical test is applied. Many of these mistakes are caught by the researchers or by the peer review process, but not all. Sometimes, as Richard Nixon once said, “Mistakes were made.” “Experimental mistakes” are errors made during the process of an experiment, in which the protocol actually followed was not the one intended. No one knows how frequently important mistakes are made in research, but we do know that they happen. For example, a study of the dopamine neurotoxicity of MDMA (“ecstasy”) found significant toxic effects in the brains of 15 monkeys (Ricaurte et al. 2002). Before this study, this sort of brain damage produced by MDMA was thought to be unlikely, whereas brain damage was a well-known side effect of using methamphetamine. The Ricaurte et al. (2002) paper was published in Science, a highprofile scientific weekly. But as it turned out, the lab that did the study had received its shipment of MDMA at the same time that it had received some methamphetamine as well, and the labels on the two bottles were swapped! Unbeknownst to them, the entire experiment had been done with methamphetamine rather than MDMA (Ricaurte 2003). The mistake was caught by careful follow-up work by this lab, but most mistakes of this type would not have been caught. Fundamental experimental errors sometimes make their way into the literature, even in the most prominent journals. Life is like a sewer. What you get out of it depends on what you put into it. —Tom Lehrer It is difficult, if not impossible, to know how often experimental mistakes are made or how important they are. It is a little easier to investigate how often statistical mistakes are made when analyzing the data. A few studies have looked at the rates of statistical errors in published papers. The number of mistakes that are found is sobering. Surveys of papers in medical journals routinely find that one-third to one-half of papers that use statistics make at least one minor mistake (Gore et al. 1977; Kanter and Taylor 1994; McGuigan 1995). This is likely to be an underestimate, since not all statistical mistakes made are detectable in the actual published paper. More importantly, though, about 8% of medical papers make statistical mistakes important enough to alter the conclusions of the paper (Gore et al. 1977). These statistical errors are not limited to the medical literature. Even in a field like ecology, which prides itself on its sophisticated use of statistics, a survey found statistical mistakes in about half of the papers (Hurlbert and White 1993). These mistakes are sometimes jokingly referred to as “Type III errors.” It is interesting to compare the 8% rate at which conclusions are changed by statistical mistakes to the 5% rate of Type I error that we normally tolerate. If we demand this level of confidence against errors of chance, we should also keep in mind the relatively high rates of other errors that might affect the result. It is one more reason why all studies should be repeated. The moral is, be careful when doing science and when reading about science. Trust the authors to have done a good job, but don’t expect their work to be perfect. Watch for mistakes. Very often, great scientific advances come from spotting, and fixing, the mistakes of our predecessors. 16 Correlation between numerical variables When two numerical variables are associated, we say that they are correlated. For example, brain size and body size are positively correlated across mammal species, as the graph1 on the next page demonstrates. Large-bodied species tend to have large brains, and small-bodied species tend to have small brains. The correlation coefficient is a quantity that describes the strength and direction of an association between two numerical variables measured on a sample of subjects or units. Correlation reflects the amount of “scatter” in a scatter plot of the two variables. Unlike linear regression (Chapter 17), correlation fits no line to the data. It does not measure how steeply one variable changes when the other is varied. In this chapter, we show how to measure a correlation, put confidence bounds on estimates of a correlation coefficient, and test hypotheses about correlation. Estimating a linear correlation coefficient The linear correlation coefficient2 measures the tendency of two numerical variables (call them X and Y) to co-vary—that is, to change together along a line. We use the lowercase Greek letter ρ (rho, pronounced “row”3) to represent the correlation between X and Y in the population. We use r to represent the correlation between X and Y in a sample taken from the population. The correlation coefficient measures the strength and direction of the association between two numerical variables. The correlation coefficient The formula for the sample correlation coefficient r has three parts, two of which may look familiar and one of which is new: r=∑i(Xi−X¯)(Yi−Y¯)∑i(Xi−X¯)2∑i(Yi−Y¯)2 The term in the numerator of the formula is called the sum of products,4 and it measures how deviations in X and Y vary together. A deviation is the difference between an observation and its mean (Figure 16.1-1). If an observation i is above the mean for both X and Y (upper right corner of the figure), then its deviations are both positive and so the product of its deviations (Xi−X¯)(Yi−Y¯) is a positive number. If an observation is below the mean in both X and Y (lower left corner of the figure), then both its deviations are negative, and so the product of the deviations (Xi−X¯)(Yi−Y¯) is again positive. Observations lying in the other two corners of the plane have a positive deviation for one variable and a negative deviation for the other, so they have a negative product of deviations. The sum of products adds all the products of deviations. This sum will be positive if most of the observations are in the lower left and upper right corners of the plane, like those shown in Figure 16.1-1. The sum will be negative if most observations lie in the upper left and lower right corners. If the scatter of observations fills all four corners of the plane, then the positive and negative values cancel in the sum, yielding a sum of products close to zero. FIGURE 16.1-1 The position of X and Y observations in relation to the means, X¯ and Y¯ (indicated by an open square at the intersection of the two dashed lines). Observations lying in the upper right corner are above both X¯ and Y¯, and so have positive deviations in X and Y. Those lying in the lower left corner fall below both X¯ and Y¯, and so have negative deviations in X and Y. In the other two corners, observations have a positive deviation for one trait and a negative deviation for the other. The denominator in the formula for r includes the sums of squares for X and for Y (under square-root signs). You will have calculated quantities like these in Section 3.1 as part of calculating a standard deviation.5 The population correlation coefficient, ρ, is calculated using the same formula as r except it is measured on all individuals in the population. The correlation coefficient ρ and its sample estimate r lie between −1 and 1. The correlation coefficient has no units, which means it is readily interpretable whatever the variables (provided they are numerical). A negative correlation means that one variable decreases as the other increases, whereas a positive correlation means that both variables increase and decrease together (Figure 16.1-2). FIGURE 16.1-2 Scatter plots illustrating correlations between two numerical variables. In each case, n = 25. The strongest correlations (i.e., r = 1.0 or r = −1.0) occur when all points lie along a straight line (e.g., the left panel in Figure 16.1-3), which is why we refer to the correlation as linear. Two variables might be strongly associated yet have no correlation (i.e., r = 0) if the relationship between them is nonlinear (the right panel in Figure 16.1-3). Example 16.1 illustrates the use of the correlation coefficient. FIGURE 16.1-3 Correlation between two variables that are strongly associated. On the left, measurements of X and Y lie along a straight line, producing the maximum correlation possible (r = 1). On the right, the relationship is nonlinear and exhibits no correlation (r = 0). EXAMPLE 16.1 Flipping the bird Adults who mistreat children were often the target of maltreatment themselves when they were young. Does a similar association occur in nonhuman animals, where the causes might be more readily studied? Müller et al. (2011) investigated this possibility in the Nazca booby (Sula granti), a colonial nesting seabird of the Galápagos Islands. Unattended chicks in nests frequently receive visits from unrelated adults, who behave mainly aggressively toward them. The researchers counted the number of such visits to nests of 24 booby chicks. These chicks were given unique numbered rings on their legs, which allowed the researchers to observe their behavior years later when they had become adults. The first variable in Table 16.1 gives the number of non-parent adult visits experienced by the 24 focal birds while they were growing up in the nest. The second variable measures the number of visits to nests of unrelated chicks by these same birds when they were adults. The second variable has been corrected for other variables measured by the researchers and so is not on the same scale as the first variable. TABLE 16.1 Number of non-parent adult visits experienced by boobies as nestlings compared to the number of similar behaviors performed by the same birds when an adult. n = 24. Number of visits Future aggressive behavior 1 7 15 4 −0.80 −0.92 −0.80 −0.46 11 14 23 14 −0.47 −0.46 −0.23 −0.16 9 5 4 10 13 13 14 12 13 9 8 18 22 22 23 31 −0.23 −0.23 −0.16 −0.10 −0.10 0.04 0.13 0.19 0.25 0.23 0.15 0.23 0.31 0.18 0.17 0.39 A scatter plot of the data, shown in Figure 16.1-4, suggests that previous experience of nestlings at the hands of adult boobies is positively associated with the behavior of the same birds when they become adults themselves, although the association does not seem strong. The larger the number of visits received as nestlings, the more the birds perform such events toward unrelated nestlings when they are adults. Correlation is appropiate to measure the strength of this association, since the data consist of two numerical variables measured on a sample of individuals. By itself this association does not imply that one variable is the cause of the other, however—correlation is not causation (Interleaf 4). Further studies including experiments would be needed to establish causation. FIGURE 16.1-4 Scatter plot of the relationship between the number of visits experienced by nestling boobies and the future behavior of the same individuals as adults. n = 24 How strong is the association between past experience and future booby behavior? The correlation coefficient r quantifies the strength of the association between the two variables. Let’s use X to refer to the variable “number of visits” by the boobies, and Y to refer to their future aggressive behavior. To calculate r, we obtained the following three quantities: ∑i(Xi-X¯)(Yi-Y¯)=33.086∑i(Xi-X¯)2=1194.625∑i(Yi-Y¯)2=3.217, yielding r=33.0861194.6253.217=0.534. The sample correlation coefficient r between the two variables is 0.534. Standard error The data are merely a sample taken to estimate the correlation between the same two variables in a population, ρ. The standard error of r—that is, the standard deviation of its sampling distribution—is one way to assess how close our estimate is likely to be to the population parameter ρ. The standard error is SEr=1-r2n-2. For our example data set, the standard error is SEr=1-(0.534)224-2=0.180. It is not ideal to use this quantity to calculate a confidence interval, because the sampling distribution of r is not normal. However, we will use SEr later in hypothesis testing. Calculating a confidence interval requires the modified method shown next. Approximate confidence interval The 95% confidence interval for ρ puts bounds on our estimate of the population correlation, identifying the range of values that are compatible with the data. Fisher discovered an approximate confidence interval for ρ. The approximation is best when sample size is large. With this method, we convert r to a new quantity called z that approximately follows a normal sampling distribution. The Fisher’s z-transformation is z=0.5 ln(1+r1-r), where ln is the natural logarithm. The z-transform of ρ is symbolized as ζ (the lowercase Greek letter zeta), so z represents the value in a sample, and ζ is the true value in the population. The standard error of the sampling distribution for z is approximately σz=1n-3. For the booby aggression ratio data (Example 16.1), z=0.5ln(1+0.5341-0.534)=0.595, and σz=124-3=0.218. The sampling distribution of the statistic z is approximately normal, so we can use the standard normal distribution to generate the 95% confidence interval for ζ: z−1.96 σz<ξ<z+1.96 σz. The quantity 1.96 is the value of Zcrit, the two-tailed critical value of the standard normal distribution corresponding to α = 0.05 (Statistical Table B; Pr[Z > 1.96] = 0.025). More generally, to obtain a 1 − α confidence interval, find the value of Zcrit such that Pr[Z>Zcrit]=α/2. For the booby aggression ratio data, the 95% confidence interval for ζ is 0.595−1.96 (0.218)<ξ<0.595+1.96 (0.218) 0.168<ξ<1.23. To complete the analysis, we convert the lower and upper bounds of this confidence interval back to the original correlation scale using the inverse of Fisher’s transformation, r=e2z-1e2z+1, where e is the base of the natural logarithm. For our example, this yields e2(0.168)-1e2(0.168)+1<ρ<e2(1.023)-1e2(1.023)+1 or 0.166<ρ<0.771, which, when rounded to two digits, is 0.17<ρ<0.77. This calculation shows that the data are consistent with a fairly broad range of values for the population correlation between the number of visits experienced as a nestling and future aggressive behavior. At the same time, we can be reasonably confident that ρ is greater than zero and is well below one. The formula for σz gives only an approximation to the standard deviation of the sampling distribution for z, so the 95% confidence interval for ζ (and therefore ρ) is also an approximation. Testing the null hypothesis of zero correlation The most common use of hypothesis testing in correlation analysis is to test the null hypothesis that the population correlation ρ is exactly zero:6 H0: ρ = 0. HA: ρ ≠ 0. Example 16.2 shows how the method works. EXAMPLE 16.2 What big inbreeding coefficients you have By 1970, the wolf (Canis lupus) had been wiped out in Norway and Sweden. Around 1980, two wolves immigrated from farther east and founded a new population. By 2002, the new population totaled approximately 100 wolves. A new population started by so few individuals, however, might be expected to suffer problems caused by inbreeding. Liberg et al. (2005) compiled observations on reproduction between 1983 and 2002 and constructed the pedigree of the wolves in the small population. The data listed in Table 16.2-1 show the inbreeding coefficients of litters produced by mated pairs and the number of pups of each litter surviving their first winter. An inbreeding coefficient is zero if parents of the litter were unrelated, 0.25 if parents were brother and sister whose own parents were unrelated, and greater than 0.25 if inbreeding had continued for more generations. TABLE 16.2-1 Inbreeding coefficients of litters of mated wolf pairs and the number of pups surviving their first winter. n = 24 litters. Inbreeding coefficient Number of pups 0.00 0.00 0.13 0.13 0.13 0.19 0.19 0.19 6 6 7 5 4 8 7 4 0.22 4 0.24 0.24 0.24 0.24 0.24 0.24 0.25 3 3 3 3 2 2 6 0.27 0.30 0.30 0.30 0.30 0.36 0.37 0.40 3 5 3 2 1 3 2 3 Are inbreeding coefficients of the litters associated with the number of pups surviving their first winter? A scatter plot of the data, shown in Figure 16.2-1, suggests a negative association between inbreeding coefficient and number of surviving pups. FIGURE 16.2-1 The number of surviving wolf pups in a litter and inbreeding coefficient. Overlapping points have been offset slightly to render them visible. The total number of litters, n = 24. The correlation coefficient between inbreeding coefficient and number of pups can be calculated from the following quantities: ∑i(Xi-X¯)(Yi-Y¯)=-2.612∑i(Xi-X¯)2=0.228∑i(Yi-Y¯)2=80.958. Putting these into the formula for r gives r=-2.6120.22880.958=-0.608. The observed correlation coefficient is less than zero, but we want to test whether this correlation is sufficiently strong to warrant rejection of the null hypothesis that the population correlation ρ is zero. Our two hypotheses are as follows: H0: There is no relationship between the inbreeding coefficient and the number of pups (ρ = 0). HA: Inbreeding depression and the number of pups are correlated (ρ ≠ 0). To test the hypotheses, we calculate the t-statistic, t=rSEr, where the standard error (SEr) is calculated as SEr=1-r2n-2. Under the null hypothesis of zero correlation, the sampling distribution of the t-statistic is a Student’s t-distribution with n − 2 degrees of freedom.7 For the wolf data, SEr=1-(-0.608)224-2=0.169, and so t=-0.6080.169=-3.60. The P-value for this t-statistic is P = 0.002, obtained using a computer. Using Statistical Table C, instead, the critical value for the t-distribution having 22 degrees of freedom is t0.05(2), 22 = 2.075 with α = 0.05. Since t = −3.60 is less than −2.075 and farther from the value expected by the null hypothesis, P must be less than 0.05. With P = 0.002, we can be confident that inbreeding coefficient and number of pups are negatively related, and we reject the null hypothesis of zero correlation. Assumptions The methods used to estimate and test a population correlation assume that the sample of individuals is a random sample from the population. In addition, correlation analysis assumes that the measurements have a bivariate normal distribution in the population. A bivariate normal distribution is a bell-shaped probability distribution in two dimensions rather than one (Figure 16.3-1). FIGURE 16.3-1 The left panel shows a bivariate normal distribution with a correlation of 0.7 between X and Y. Height above the plane represents the probability density of each pair of values of X and Y. The right panel shows a random sample of 100 observations from the bivariate normal distribution shown on the left. A bivariate normal distribution has the following features: ■ The relationship between X and Y is linear. ■ The cloud of points in a scatter plot of X and Y has a circular or elliptical shape. ■ The frequency distributions of X and Y separately are normal. This is not a complete list of the features of the bivariate normal distribution, but it is the set that matters most and is easiest to evaluate with data. Inspecting the scatter plot of the data is probably the best way to check the assumption of bivariate normality. The right panel of Figure 16.3-1, a random sample from a bivariate normal distribution having a correlation of ρ = 0.70, shows an example of what a scatter of points should look like when the assumption of bivariate normality is met. All of the examples shown in the scatter plots of Figure 16.1-2 are also random samples from bivariate normal distributions. Figure 16.3-2 shows the most frequent types of departures from bivariate normality. These are also the types of departures that most seriously affect correlation analysis: FIGURE 16.3-2 Data from three distributions that differ from bivariate normality. Scatter plots show a funnel shape (left), an outlier (middle), and a nonlinear relationship between X and Y (right). ■ A cloud of points that is funnel shaped (i.e., wider at one end than at the other) ■ The presence of outliers ■ A relationship between X and Y that is not linear Histograms depicting the frequency distributions of X and Y separately are also helpful. If either X or Y has a decidedly skewed distribution, then the frequency distribution of the two variables is not bivariate normal. What do we do if the assumption of bivariate normality is not met? Two strategies are available—namely, transforming the data and nonparametric methods. It is best to try first to transform X, Y, or both variables to see if the assumptions are better met on a new scale. Here are the usual transformations, first described in Chapter 13: ■ The log transformation [an all-purpose transformation, as long as the data are not negative; use log(X + 1) or log(Y + 1) if there are zeros] ■ The square-root transformation (which often works for data that are counts) ■ The arcsine transformation (for data that are proportions) Log transformations are good to try if the relationship between the two variables is nonlinear or if the variance in one variable seems to increase with the value of the other variable. If transforming the data is unsuccessful, use a nonparametric method instead, such as Spearman’s rank correlation. This method is explained in Section 16.5. The correlation coefficient depends on the range The correlation between two variables X and Y depends on the range of values included. For this reason, we must be cautious when comparing correlations between studies that use a different range. For example, the top panel of Figure 16.4-1 shows that the population density of different species of stream invertebrates (Y) is strongly correlated with their body mass (X) when both variables are log-transformed (Schmid et al. 2000). Now imagine a second study of the same two variables in streams whose invertebrates have a smaller range of values of body mass. To see what effect this has, we’ve taken the points falling between the two dashed lines in the top panel and replotted them in the lower panel. The correlation is weaker, because the amount of scatter relative to total variation is increased. FIGURE 16.4-1 The correlation between two variables depends on the range of Xvalues included. The bottom graph plots those data points lying between the two dashed lines in the top graph. The data drawn from a smaller range of X have a reduced correlation coefficient. The data are log (base 10) of population density (individuals/m2) and body mass (µg) of different species of stream invertebrates (Schmid et al. 2000). This effect means that the correlation coefficient between the same two variables is not comparable between separate studies unless they include a comparable range of values. Spearman’s rank correlation Many situations require a test of zero correlation between variables that do not meet the assumption of bivariate normality, even after data transformation. For these cases, the nonparametric Spearman’s rank correlation is used. Spearman’s rank correlation uses the ranks of both the X and Y variables to calculate a measure of correlation. It does not make assumptions about the distribution of the variables, but it still assumes that the individuals are randomly chosen from the population. Example 16.5 illustrates the method. The Spearman’s rank correlation measures the strength and direction of the linear association between the ranks of two variables. EXAMPLE 16.5 The miracles of memory How reliable are people’s recollections of having witnessed “miracles”? One way to investigate is to compare different accounts of extraordinary magic tricks. For example, of the many illusions performed by magicians, none is more renowned than the Indian rope trick. In the most sensational version of the trick, a magician tosses into the air one end of a rope, which forms into a rigid pole. A boy climbs up the rope and disappears at the top. The magician scolds the boy to return but gets no reply, whereupon he grabs a knife, climbs the rope, and also disappears. The boy’s body then falls in pieces from the sky into a basket on the ground. The magician descends the rope and retrieves the boy from the basket, revealing him to be unharmed and in one piece. Wiseman and Lamont (1996) tracked down 21 firsthand, written accounts of the Indian rope trick. They gave a score to each description according to how impressive it was. For example, a score of 1 was given if the observer saw only that “boy climbs up rope, then climbs down again.” The most impressive accounts in the sample, “boy climbs rope, vanishes at the top, reappears in basket in full view of audience,” were given a score of 5, the highest possible. For each account, the researchers also recorded the number of years that had elapsed between the date that the trick was witnessed and the date the memory of it was written down. The measurements of impressiveness score and number of years elapsed are shown in the scatter plot in Figure 16.5-1. Is there an association between the impressiveness of eyewitness accounts and the time elapsed until the writing of the description? If so, then it might indicate a tendency of human memories to become more exaggerated and less accurate with time. FIGURE 16.5-1 Impressiveness of written accounts of the Indian rope trick by firsthand observers and the number of years elapsed between witnessing the event and writing the account. n = 21. A test of the null hypothesis of zero correlation is what we would like to carry out. But the assumption of bivariate normality is clearly violated because the impressiveness score is a discrete ordered score. A test of zero correlation is nevertheless possible using a nonparametric method, because the different categories for the impressiveness score can be ranked as shown in Table 16.5-1. The Spearman’s rank correlation measures the association between the ranks of the two variables. Spearman’s rank correlation is measured by the parameter ρS, which is estimated by rS. TABLE 16.5-1 Raw data from Example 16.5, and their ranks. Each variable is ranked separately. Midranks are assigned when there are ties. n = 21. Years elapsed Rank years Impressiveness score Rank impressiveness 2 5 5 4 17 17 31 20 22 1 3.5 3.5 2 5.5 5.5 13 7 8 1 1 1 2 2 2 3 4 4 2 2 2 5 5 5 7 12.5 12.5 25 9 4 12.5 25 28 29 34 43 44 46 34 9 10.5 12 14.5 17 18 19 14.5 4 4 4 4 4 4 4 4 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 28 39 50 50 10.5 16 20.5 20.5 5 5 5 5 19.5 19.5 19.5 19.5 The Spearman’s correlation coefficient is the linear correlation coefficient computed on the ranks of the data. The two variables must be ranked separately, from low to high. The data from Figure 16.5-1 are listed in Table 16.5-1. The table also includes the separate rankings of each variable. As first discussed in Section 13.5, we assign midranks when there are ties. The midrank is the average of the ranks associated with a set of tied observations. For example, the three measurements with the lowest impressiveness score (1) were all assigned the midrank 2, which is the average of the three ranks associated with the three lowest values: 1, 2, and 3. The 10 values having impressiveness score 4 are associated with the ranks 8 through 17, and so were assigned the midrank 12.5. In the following calculations, R refers to the rank of years elapsed, and S refers to the rank of the impressiveness score: ∑i(Ri-R¯)(Si-S¯)=566∑i(Ri-R¯)2=767.5∑i(Si-S¯)2=678.5, yielding rS=566767.5678.5=0.784. The hypotheses for the test are H0: ρS = 0 HA: ρS ≠ 0, where ρS refers to the Spearman’s correlation in the population. To determine the P-value for the test, compare rS with the critical value8 given in Statistical Table G corresponding to a sample size of 21: rS(0.05,21)=0.435. Since rS = 0.784 is greater than 0.435, P < 0.05, and so we reject the null hypothesis (a computer program gave P = 0.0003). We conclude that there is a positive correlation between the impressiveness score of the eyewitness accounts of the Indian rope trick and the number of years elapsed between viewing the trick and the retelling of it in writing. The likely explanation for these findings is that eyewitness accounts of the Indian rope trick are exaggerated,9 becoming more so with time. Procedure for large n Statistical Table G provides critical values for the Spearman’s rank correlation for sample sizes up to n = 100. For larger n, use the procedure for the linear correlation coefficient, but applied to the ranks. Calculate the t-statistic, t=rSSE[rS], where SE[rS]=1-rS2n-2. Under the null hypothesis of no Spearman’s rank correlation in the population (ρS = 0), t is approximately t-distributed with n − 2 degrees of freedom. Reject the null hypothesis of zero rank correlation if t≥t0.05(2),n−2 or t≤−t0.05(2),n−2. Assumptions of Spearman’s correlation The Spearman’s rank correlation assumes that observations are a random sample from the population. It also assumes that the relationship between the two numerical variables is monotonic. That is, as one variable increases, the other variable either (1) increases or does not change; or (2) decreases or does not change. Essentially, this means that the relationship between the ranks of the two numerical variables is linear. The effects of measurement error on correlation When a variable is not measured perfectly, we say that there is measurement error. Measurement error is difficult to avoid. Some biological traits are extremely challenging to measure, and measurement error can sometimes be an important component of variation (Chapter 15). For example, behavioral traits are notorious for having low repeatability: a behavior measured on one individual might be quite different the next time it is measured on the same individual. Measurement error is the difference between the true value of a variable for an individual and its measured value. Measurement error in either X or Y tends to weaken the observed correlation between the variables. The same thing happens if there is measurement error in both X and Y, if the errors in X and Y are uncorrelated. With measurement error, r will tend to underestimate the magnitude of ρ (it will tend to be closer to zero on average than the true correlation), a bias called attenuation (Figure 16.6-1). FIGURE 16.6-1 Attenuation. In the left panel, X and Y are measured without error and are highly correlated. In the middle panel, Y is measured with error. In the right panel, both X and Y are measured with error, and these errors are uncorrelated. In all cases, the true correlation is very strong, but the correlation appears weaker when the variables are measured with error. In the Quick Formula Summary (Section 16.8), we include an equation that corrects the estimate of correlation for the effects of measurement error. The method requires that repeated measurements have been made on the same individuals. This corrected correlation, r*, can’t be used in place of r in confidence intervals for ρ or in hypothesis testing. However, it might be useful for comparison with the uncorrected correlation to evaluate how measurement error is affecting the observed correlation between two variables. In general, measurement error can be reduced by taking precise measurements. If this is not possible, then it is best to measure each individual multiple times and use the average measurement in subsequent analysis. Summary ■ The correlation coefficient (r) measures the strength and direction of the association between two numerical variables. ■ Correlation implies association, not causation. It is appropriate when two variables are measured on a sample of individuals whether or not there is a causal connection between the variables. ■ The correlation coefficient ranges from − 1 (the maximum negative correlation) to 0 (no correlation) to 1 (the maximum positive correlation). ■ Analysis using the correlation assumes that the two numerical variables have a bivariate normal distribution and that the individuals are randomly sampled. ■ With a bivariate normal distribution, the relationship between X and Y is linear, the cloud of points in a scatter plot of X and Y is circular or elliptical in shape, and the frequency distributions of X and Y separately are normal. (This is just a partial list of its features.) ■ The scatter plot is a useful tool for examining the assumption of bivariate normality. Histograms of X and Y should both appear normal. ■ The Spearman’s rank correlation measures the linear correlation between the ranks of two variables, where each variable is ranked separately from low to high. ■ The correlation between two variables is expected to be weaker when only a narrow range of X values is represented. ■ Measurement error biases the estimate of a correlation coefficient toward zero. Quick Formula Summary Shortcuts Sum of Products: ∑i(Xi-X¯)(Yi-Y¯)=∑i(XiYi)-(∑iXi)(∑iYi)n Sum of Squares: ∑i(Xi-X¯)2=∑i(Xi2)-(∑iXi)2n ∑i(Yi-Y¯)2=∑i(Yi2)-(∑iYi)2n Covariance What is it for? Measuring the strength of an association between two numerical variables. Estimate: Covariance (X,Y) Formula: Covariance(X,Y)=∑i(Xi−X¯)(Yi−Y¯)n−1 Correlation coefficient What is it for? Measuring the strength of a linear association between two numerical variables. What does it assume? Bivariate normality and random sampling. Parameter: ρ Estimate: r Formula: r=∑i(Xi-X¯)(Yi-Y¯)∑i(Xi-X¯)2∑i(Yi-Y¯)2 Standard error: SEr=1-r2n-2 Degrees of freedom: n − 2 Alternate formula: r=Covariance(X,Y)sXsY, where Covariance (X,Y) is the covariance between X and Y and where sX and sY are the sample standard deviations of X and Y, respectively. Confidence interval (approximate) for a population correlation What does it assume? The sample is a random sample. The numerical variables X and Y have a bivariate normal distribution in the population. Sample size is not too small for the approximation. Parameter: ρ Estimate: r Formula: z−Zcrit σz<ξ<z+Zcrit σz, where z=0.5 ln(1+r1-r) is Fisher’s z-transformation of r, σz=1n-3 is the approximate standard error of z, and Zcrit is the critical value of the standard normal distribution for which Pr[Z > Zcrit] = α/2. To obtain the confidence interval for p, backtransform the limits of the resulting confidence interval using the inverse of Fisher s transformation, r=e2z-1e2z+1. The t-test of zero linear correlation What is it for? To test the null hypothesis that the population parameter (ρ) is zero. What does it assume? Bivariate normality and random sampling. Test statistic: t Distribution under H0: t-distributed with n − 2 degrees of freedom. Formula: t=rSEr, where SEr=1-r2n-2 is the standard error of r. Spearman’s rank correlation What is it for? To measure correlation between the two variables, when the variables do not meet the assumptions of correlation. What does it assume? A linear relation between the ranks of X and Y and random sampling. Parameter: ρS Estimate: rS Formula: Same as for linear correlation but calculated on ranks. Spearman’s rank correlation test What is it for? To test the null hypothesis that the rank correlation in the population (ρS) is zero. What does it assume? A linear relation between the ranks of X and Y, and random sampling. Test statistic when n ≤ 100: rS Distribution under H0: Distribution of the Spearman’s rank correlation (Statistical Table G). Test statistic when n > 100: t Distribution under H0: t-distributed with n − 2 degrees of freedom. Formula: t=rSSE[rS], where SE[rS]=1-rS2n-2 is the standard error of rS. Correlation corrected for measurement error What does it assume? That X and Y have been measured two or more times independently on all individuals, and that measurement error in X is uncorrelated with measurement error in Y. The following formula assumes that the correlation r between X and Y is measured using the average of the repeat measurements for every individual. Parameter: ρ Estimate: r* Formula: r*=rRXRY, where r is calculated from the average of the repeat measurements for every individual (Adolph and Hardin 2007). RX and RY are repeatabilities of X and Y, respectively. Repeatability here is similar to that described in Chapter 15, except that here the repeatability is for the average values of repeat measurements made on each individual rather than single measurements. The formula for RX is RX=sA2sA2+(MSerror/m), where m is the number of repeat measurements made on each individual and sA2=MSgroups-MSerrorm. MSgroups and MSerror are calculated from a random effects ANOVA on the repeat measurements for X, as explained in Section 15.6. The formula for RY is calculated in the same way using measurements of the Y-variable. Confidence intervals for the corrected correlation coefficients are discussed in Charles (2005). PRACTICE PROBLEMS 1. Calculation practice. Estimate a correlation coefficient. In their study of hyena laughter, or “giggling” (see Chapter 12, Assignment Problem 26), Mathevon et al. (2010) asked whether sound spectral properties of hyenas’ giggles are associated with age. The accompanying figure and data show the giggle fundamental frequency (in hertz) and the age (in years) of 16 hyenas. What is the correlation coefficient in the data, and what is mostplausible range of values for the population correlation? Age (years) 2 2 2 6 9 10 13 10 14 14 12 7 11 11 14 20 Fundametal frequency (Hz) 840 670 580 470 540 660 510 520 500 480 400 650 460 500 580 500 a. What type of graph is this? Does it suggest a positive or negative association? b. Calculate the sum of squares for age. Calculate the sum of squares for fundamental frequency. Calculate the sum of products between age and frequency. Compute the correlation coefficient, r. Compute the Fisher’s z-transformation of the correlation coefficient. Calculate the approximate standard error of the z-transformed correlation. What is Zcrit, the two-tailed critical value of the standard normal distribution corresponding to α = 0.05? i. Calculate the lower and upper bounds of the 95% confidence interval for ζ, the ztransformed population correlation. j. Transform the lower and upper bounds of the confidence interval for ζ, yielding the 95% confidence interval for p. c. d. e. f. g. h. 2. Calculation practice: Standard error and hypothesis testing for a correlation. Refer to Practice Problem 1. Test whether there is a correlation in the population between giggle fundamental frequency and age. a. State the null and alternative hypotheses. b. Calculate the standard error of the sample correlation r. c. Calculate the t-statistic. d. What is the sample size? What are the degrees of freedom? e. Obtain the critical value for t corresponding to α = 0.05. f. What is the conclusion from the test? 3. Calculation practice: Spearman rank correlation. As human populations became more urban from prehistory to the present, disease transmission between people likely increased. Over time, this might have led to the evolution of enhanced resistance to certain diseases in settled human populations. For example, a mutation in the SLC11A1 gene in humans causes resistance to tuberculosis. Barnes et al. (2011) examined the frequency of the resistant allele in different towns and villages in Europe and Asia and compared it to how long humans had been settled in the site (“duration of settlement”). If settlement led to the evolution of greater resistance to tuberculosis, there should be a positive association between the frequency of the resistant allele and the duration of settlement. The data are below (durations based on dates BC have been rounded). The relationship appears curvilinear, so we will use the Spearman’s correlation to test an association. Settlement Catal Höyük Susa, other Harappa Sanxingdui Knossos, other Carthage Tarquinia, other Angkor Borei Tong’gorou Colchester Aksum Nara Yakutsk Bathurst Blantyre Kiruna Juba Data Duration (years) Allele frequency 6000 BC 3250 BC 2725 BC 2000 BC 1700 BC 800 BC 720 BC 300 BC 100 BC 55 AD 100 AD 710 AD 1632 AD 1816 AD 1880 AD 1900 AD 1919 AD 8010 5260 4735 4010 3710 2810 2730 2310 2110 1955 1910 1300 378 194 130 110 91 0.990 1.000 0.948 0.885 0.947 0.854 1.000 0.769 0.956 0.979 0.865 0.922 0.821 0.842 0.734 0.766 0.772 a. Rank duration from low to high. b. Rank allele frequency from low to high, assigning midranks to ties. c. Calculate the sum of squares for the ranks of duration, allele frequency, and the sum of products. d. Compute the Spearman’s correlation coefficient. e. What is the sample size? f. State the null and alternative hypotheses. g. Obtain the critical value for the Spearman’s correlation corresponding to α = 0.05. h. Draw the appropriate conclusion. 4. Visually estimate the value of the correlation coefficient in each of the four following scatter plots. 5. Birds of many species retain the same breeding partner year after year. In some of these species, male and female partners migrate separately and spend the winter in different places, often thousands of kilometers apart. Yet they manage to find one another again each spring. In a field study of individually banded pairs of black-tailed godwits, Gunnarsson et al. (2004) recorded spring arrival dates of males and females on the breeding grounds in the year after they were observed breeding together. The data for 10 pairs are provided in the accompanying table. Arrival date is measured as the number of days since March 31. Female arrival date Male arrival date 24 22 36 35 35 38 50 35 35 44 46 50 55 55 56 57 69 56 56 59 a. Display the relationship between arrival dates of males and females in a graph. What type of graph did you use? b. Describe the pattern in part (a) briefly. Is there a relationship? Is it positive or negative? Is it linear or nonlinear? Is it weak or strong? c. Calculate the correlation coefficient between arrival dates of male and female godwits. Include a standard error for your estimate. d. What does the standard error in part (c) refer to? e. Calculate an approximate 95% confidence interval for ρ. 6. Answer the following questions using the data for the godwits in Practice Problem 5. a. Adding 30 to each of the observations for males converts arrival dates to “days since March 1” rather than March 31. How is the correlation coefficient between arrival dates of males and females affected? What can you conclude about the effects on the correlation coefficient of adding a constant to one or both of the variables? b. Dividing female arrival dates by seven converts their arrival dates to “weeks since March 31” rather than days. How does this affect the correlation between male and female arrival dates? What can you conclude about the effects on the correlation coefficient of multiplying one or both of the variables by a constant? 7. Use the godwit data in Practice Problem 5 to test whether the mean arrival dates of male and female partners differ significantly. What assumptions are required? 8. When measuring a correlation between two variables, under what circumstances would it be best to make and average several repeat measurements of each subject for a given variable rather than measure the variable only once on each subject? 9. In large wolf populations, most inbreeding coefficients of litters are close to zero, and very few are as high or higher than 0.25. What effect is this narrower range of inbreeding coefficients expected to have on the correlation between the number of pups surviving and inbreeding coefficient, compared with that measured in the Scandinavian population of Example 16.2? Explain. 10. Large males of the European earwig, Forficula auricularia, develop abdominal forceps, which are used in fighting and courtship. Smaller males do not develop the forceps. Tomkins and Brown (2004) compared the proportion of males having forceps on islands in the North Sea with the population density of earwigs, measured as number caught per trap. Their data are listed in the following table. Islands Earwig density (number per trap) Proportion of males with forceps 1 2 3 4 5 6 7 8 9 10 0.3 5.2 2.5 25.6 8.1 0.3 4.7 3.3 7.8 20.0 0.04 0.02 0.07 0.06 0.13 0.15 0.20 0.24 0.25 0.19 11 12 13 14 15 16 17 18 19 20 21 22 31.0 25.0 43.3 33.9 33.9 32.7 33.8 12.7 57.0 52.5 64.0 70.4 0.22 0.32 0.30 0.44 0.52 0.55 0.62 0.66 0.46 0.38 0.38 0.46 a. The distribution of the two variables is not bivariate normal, and transforming the data does not improve matters. Choosing the most appropriate method, test whether the two variables are correlated. b. What are your assumptions in part (a)? 11. Earwig density on an island and the proportion of males with forceps are estimates, so the measurements of both variables include sampling error. In light of this fact, would the true correlation between the two variables tend to be larger, smaller, or the same as the measured correlation? 12. According to the immunocompetence handicap hypothesis, males of a species evolve high reproductive effort to the point that they divert resources away from immune function. To test this, Simmons and Roberts (2005) measured sperm viability and immune function in labraised male crickets to test whether male reproductive effort affects immune function. The data in the following graph show male sperm viability and lysozyme activity, an important defense against bacterial infection. Each point is the average of the males in a single family of crickets. Lysozyme activity is measured as the area of clear region around an inoculation of 2 ml of hemolymph onto an agar plate containing bacteria. The total sample size n is 41. a. What assumption of linear correlation analysis is violated by these data? Explain. b. Assuming that transforming the data doesn’t help, what is the most appropriate method to test the null hypothesis of no correlation between male sperm viability and male lysozyme activity? c. When we applied the most appropriate method to these data, we obtained the following numbers based on the ranks of sperm viability and lysozyme activity: Sum of products: −1744.5 Sum of squares (sperm viability): 5726.0 Sum of squares (lysozyme activity): 5729.5 Using these figures, test the hypothesis that sperm viability and male lysozyme activity are correlated. d. What assumptions have you made in part (c)? 13. Left-handed people have an advantage in many sports, and it has been suggested that lefthandedness might have been advantageous in hand-to-hand fights in early societies. (Lefthanded people can get a lot of practice against right-handed opponents, whereas righthanders are less experienced against lefties.) To explore this potential advantage, Faurie and Raymond (2005) compared the frequency of left-handed individuals in traditional societies with measures of the level of violence in those societies. The following table lists the data for one index of violence, the rate of homicide, measured in number/1000 people/year. Society Percent left-handed Homicide rate Dioula Ntumu Kreyol Inuit Baka Jimi Eipo Yanomamo 3.5 8.2 6.6 6.4 10.2 13.0 20.4 22.7 0.01 0.02 0.03 0.17 0.50 5.37 3.02 3.98 a. Use a graph to illustrate the association between the two variables. b. What assumption of linear correlation analysis is violated by these data? Explain. c. Before resorting to a nonparametric method, what strategy is available to test for a correlation between percent left-handedness and homicide rate? d. Carry out this strategy, using a scatter plot to assess your success in meeting assumptions. e. Using your results from part (d), test whether the two variables are correlated. 14. Does stress age you? As part of an investigation, Epel et al. (2004) measured telomere length in blood mononuclear cells of healthy premenopausal women, each of whom was the biological mother and caregiver of a chronically ill child. Telomeres are complexes of DNA and protein that cap chromosomal ends. They tend to shorten with cell divisions and with advancing age. A scatter plot of the data is shown below. Telomere length is measured as a ratio compared to a standard. Chronicity is the number of years since the child’s diagnosis. The data in the scatter plot can be summarized as follows: Sum of products: −8.636 Sum of squares for chronicity: 327.436 Sum of squares for telomere length: 1.228 Total sample size (n): 38 a. Describe the pattern in the scatter plot briefly in words. Is there a relationship? Is it positive or negative? Is it linear or nonlinear? Is it weak or strong? b. Calculate the linear correlation between telomere length and the chronicity of caregiving. c. Calculate a 95% confidence interval for the population correlation. d. Provide an interpretation for the interval in part (c). What does the interval represent? e. What are your assumptions in part (c)? Does the scatter plot support these assumptions? Explain. ASSIGNMENT PROBLEMS 15. Does learning a second language change brain structure? Mechelli et al. (2004) tested 22 native Italian speakers who had learned English as a second language. Proficiencies in reading, writing, and speech were assessed using a number of tests whose results were summarized by a proficiency score. Gray-matter density was measured in the left inferior parietal region of the brain using a neuroimaging technique, as mm3 of gray matter per voxel. (A voxel is a picture element, or “pixel,” in three dimensions.) The data are listed in the accompanying table. a. Display the association between the two variables in a scatter plot. b. Calculate the correlation between second language proficiency and gray-matter density. c. Test the null hypothesis of zero correlation. d. What are your assumptions in part (c)? e. Does the scatter plot support these assumptions? Explain. f. Do the results demonstrate that second language proficiency affects gray-matter density in the brain? Why or why not? Proficiency score for second language 0.26 0.44 0.89 1.26 1.69 1.97 1.98 2.24 2.24 2.58 2.50 2.75 3.25 3.85 3.04 2.55 2.50 3.11 3.18 3.52 3.59 3.40 DATA TABLE FOR PROBLEM 16 Gray-matter density (mm3/voxel) −0.070 −0.080 −0.008 −0.009 −0.023 −0.009 −0.036 −0.029 −0.008 −0.023 −0.006 −0.008 −0.006 0.022 0.018 0.023 0.022 0.036 0.059 0.062 0.049 0.033 Site Attachment A B C D E F G H I J K L M N O 4.4 4.5 4.7 4.5 4.3 3.8 4.4 4.6 4.1 4.2 4.6 4.2 4.3 4.4 4.2 Area (ha) Number of butterfly species Number of bird species In (number of plant species) 23.8 16.0 6.9 2.3 5.7 1.2 1.4 15.0 3.1 3.8 7.6 12.9 4.0 5.6 4.9 6 14 8 10 6 5 5 7 9 5 10 9 12 11 7 12 18 8 17 7 4 8 22 7 4 11 11 13 16 7 5.1 5.5 6.4 4.7 5.3 4.6 4.5 5.5 5.2 4.6 4.5 5.0 5.0 5.6 5.4 16. In an increasingly urban world, are there psychological benefits to biodiversity? Fuller et al. (2007) measured the number of plant, bird, and butterfly species in 15 urban green spaces of varying size in Sheffield, England, a city of more than a half-million people. They also interviewed 312 green-space users and asked a series of questions related to the degree of psychological well-being obtained from green-space use. From the answers, the researchers obtained a measure of user “attachment” to green spaces (strength of emotional ties). Their results are in the table at the top of this page. a. Which of the three measures of green-space biodiversity (number of butterfly species, number of bird species, and ln number of plant species) is most strongly correlated with the “attachment” variable? Provide a standard error with each of your correlations. b. Provide an approximate 95% confidence interval for each of your correlations in part (a). 17. Use the data in Assignment Problem 16 to calculate a 95% confidence interval for the correlation between attachment and green-space area. 18. Pacific salmon return from the ocean to streams to spawn and die, bringing with them a lot of nutrients from one ecosystem to another. Bears kill and bring onto land up to half of the salmon present in the river, where the fish remains decompose, fertilizing the forest at the stream edge. Hocking and Reynolds (2011) measured the association between the density of salmon in streams (kg/m) and the abundance of the aptly named salmonberry, Rubus spectabilis (measured as the proportion of herb cover that is salmonberry). The graph on the next page shows the relationship between the square-root-transformed salmon density (Y′=Y+1/2) and the arcsine square root of salmon abundance. The data are available at http://whitlockschluter.zoology.ubc.ca. Here are some intermediate computations for square root of salmon density (X) and arcsine square root of salmonberry density (Y): ∑i=1nXi=113.86 n = 50 ∑i=1nYi2=7.65 ∑i=1nYi=15.14 ∑i=1nXi2=353.82 ∑i=1n(XiYi)=45.42 a. Why is it a good idea to transform these data? b. What other transformations might have been attempted instead? Try one of them (using the data on whitlockschluter.zoology.ubc.ca) and see if the result is as effective. Describe your actions. c. Before calculating the correlation coefficient, predict from the figure whether the correlation coefficient will be positive or negative. d. Calculate the correlation coefficient between salmon density and salmonberry density, transformed as in the figure. Include the standard error for the coefficient. e. Test the null hypothesis that this correlation is zero. f. In what way would these data support a hypothesis that nutrients from salmon are good for salmonberries? 19. The following data are from a laboratory experiment by Smallwood et al. (1998) in which liver preparations from five rats were used to measure the relationship between the administered concentration of taurocholate (a salt normally occurring in liver bile) and the unbound fraction of taurocholate in the liver. Rat Concentration (µM) Unbound fraction 1 2 3 4 3 6 12 24 0.63 0.44 0.31 0.19 5 48 0.13 a. Calculate the correlation coefficient between the taurocholate unbound fraction and the concentration. b. Plot the relationship between the two variables in a graph. c. Examine the plot in part (b). The relationship appears to be maximally strong, yet the correlation coefficient you calculated in part (a) is not near the maximum possible value. Why not? d. What steps would you take with these data to meet the assumptions of correlation analysis? 20. If you are having trouble solving homework problems, should you sleep on it and try again in the morning? Huber et al. (2004) asked 10 participants to perform a complex spatial learning task on the computer just before going to sleep. EEG recordings were then taken of the electrical activity of brain cells during their sleep. The magnitude of the increase in their “slow-wave” sleep after learning the complex task, compared to baseline amounts, is listed in the following table for all 10 participants. Also provided is the increase in performance recorded when the participants were challenged with the same task upon waking. Increase in slow-wave sleep (%)Improvement in task performance (%) 8 14 13 15 17 18 31 32 44 54 8 3 0 0 8 15 14 10 27 26 a. Calculate the correlation coefficient between the magnitude of the increase in slow-wave sleep and the magnitude of the improvement in performance upon waking. b. What is the standard error for your estimate in part (a)? c. Provide an interpretation of the quantity you calculated in part (b). What does it measure? d. Test the hypothesis that the two variables are correlated in the population. e. Is this an observational or an experimental study? Explain. 21. Both of the variables in Assignment Problem 20 are measurements that include some measurement error. a. How would this measurement error affect the correlation between the two variables? b. What steps could be taken in the design of the study to minimize the effect of measurement error? c. For a given variable, what quantity is used to estimate the proportion of the variance among subjects not attributable to measurement error? 22. Logging in western North America impacts populations of western trillium, a long-lived perennial that inhabits conifer forests (Trillium ovatum; see the photo at the beginning of the chapter). Jules and Rathcke (1999) measured attributes of eight local populations of western trillium, confined to forest patches of varying size created by logging in southwestern Oregon. Their data, presented in the following table, compare estimates of recruitment (the density of new plants produced in each population per year) at each site with the distance from the site to the edge of the forest fragment. Local population Distance to clear-cut edge (m) Recruitment 1 2 3 4 5 6 7 8 67 65 61 30 84 97 16 332 0.0053 0.0021 0.0069 0.0006 0.0124 0.0045 0.0028 0.0182 a. Display these data in an appropriate graph. Examine the graph and describe the shape of the distribution. What departures from the assumption of correlation analysis do you detect? b. Choose a transformation and transform one or both of the two variables. Plot the results. Did the transformation solve the problem? If not, try a different transformation. c. Using the transformed data, estimate the correlation coefficient between the two variables. Provide a standard error with your estimate. d. Calculate an approximate 95% confidence interval for the correlation coefficient. 23. Cocaine is thought to affect the brain by blocking the dopamine transporter, increasing the amount of dopamine in the nerve synapse. To investigate this idea, Volkow et al. (1997) administered intravenous doses of 0.3 to 0.6 mg/kg of cocaine to volunteers. They used PET scans to compare the magnitude of the perceived “high” of regular cocaine users with the percentage of dopamine receptors blocked. The results for 34 subjects are illustrated below. Full data are available at http://whitlockschluter.zoology.ubc.ca. a. Using the following quantities, calculated from these data, estimate the correlation between the percentage of dopamine receptors blocked and subjects’ ratings of the cocaine high. Provide a standard error with your estimate. Sum of products: 957.5 Sum of squares (receptors blocked): 8145.441 Sum of squares (rating of high): 372.5 b. Calculate a 99% confidence interval for the correlation in the population. c. What are your assumptions in part (b)? d. Imagine the following scenario: A second team of researchers carried out a similar study using the same population and sample size. They used a narrower range of intravenous doses of cocaine in their experiment, which led to a smaller range of values than in the first study for the percentage of dopamine receptors blocked. When they analyzed their results, they found only a low correlation between percentage dopamine receptors blocked and perceived high. In their published report, they concluded that the true correlation between these variables is much lower than estimated in the Volkow et al. study. Who is right? Explain. 24. In a study of the relationship between personality and humor appreciation, Mobbs et al. (2005) measured two dimensions of personality, neuroticism and extroversion, in 17 healthy volunteers. Scores along the personality dimensions were based on a 60-item, self-report questionnaire. What is the association between these two personality dimensions? Subject Extroversion Neuroticism 1 2 3 4 5 6 7 8 9 10 43 46 48 48 48 50 51 51 53 58 49 53 67 57 56 48 60 41 51 47 11 12 13 14 15 16 17 62 63 63 63 67 67 67 41 51 30 28 55 47 39 a. Plot the data in a graph. b. What is the correlation coefficient in the data, and what is most-plausible range of values for the population correlation? Use the 95% confidence interval. 25. Refer to Problem 24. Carry out a formal test of null hypothesis of zero correlation in the population between the two behavioral dimensions. 26. There is evidence that higher consumption of foods containing chemicals called flavonols— including cocoa, red wine, green tea, and some fruits—increases brain function in several ways. Messerli (2012) asked whether chocolate consumption in a country is correlated with the number of Nobel Prizes for the country over all time. The data are below. Both chocolate consumption and number of Nobel Prizes are scaled to the number of people in each country. Country Australia Austria Belgium Brazil Canada China Denmark Finland France Germany Greece Ireland Italy Japan Netherlands Norway Poland Portugal Spain Sweden Switzerland Chocolate consumption (kg/person/year) Nobel Prizes (per 100 million) 4.5 8.5 4.4 2.9 3.9 0.7 8.5 7.3 6.3 11.6 2.5 8.8 3.7 1.8 4.5 9.4 3.5 1.9 3.6 6.4 11.9 5.5 24.4 8.6 0 6 0 25.3 7.6 9 12.7 1.9 12.8 3.2 1.4 11.5 23.4 3.1 2.2 1.7 31.9 32.8 United Kingdom United States 9.7 18.8 5.3 10.6 a. Plot and examine these data. What challenges do you anticipate if your goal is to test whether chocolate consumption and number of Nobel Prizes are correlated? Describe any issue you identify. b. Without transforming the data, test for an association between the two variables using an appropriate method. c. Interpret the findings of the study appropriately. Does chocolate consumption increase the probability of winning a Nobel Prize? Should it be recommended as a national priority? 27. Biopsy is often used to distinguish cancerous from harmless tumors before resorting to surgery. Ridgway et al. (2004) investigated the ability of MIB-1 monoclonal antibodies, which detect rapidly proliferating cells with staining, to distinguish known breast tumor types from biopsies on a postoperative sample. The following measurements were taken to determine whether the MIB-1 index measured on biopsy is associated with whole tumor size. MIB-1 index was measured double-blind on histological sections of tumor tissue by the number of stained cells counted at a particular microscope magnification. a. Examine the association in a graph. What is the trend? Do the data look bivariate normal? b. Using an appropriate method, and without transforming the data, test whether there is an association between MIB-1 index and tumor size. Tumor size (mm) MIB-1 index 10 13 15 20 20 20 21 23 25 25 26 30 30 35 35 35 40 45 47 1 39 7 154 141 26 41 1 7 24 67 1 27 1 19 42 37 2 1 70 130 130 23 93 32 10 INTERLEAF Publication bias When we read an article in a journal we respect, we tend to believe what we read. The paper has been carefully vetted by expert referees, and the decision to publish it was made by an editor who is likely among the best scientists in the field. We might look at the authors of the papers and at the methods section, and we often find that both are above reproach. Very often the authors will have used statistical analysis, and usually it seems to be done correctly. We see that P is less than 0.05, or even less than 0.01, and we interpret that to mean that there was a low probability of getting results as extreme as those observed if the null hypothesis were true. When we’ve finished reading, we’ve just learned something. Haven’t we? It turns out that the papers that are actually published, especially those published in the “better quality,” more widely read journals, are not a random sample of all studies done. This is true in at least two ways. First, the papers that get published are, on average, reporting on science that is done better than those submitted to journals but rejected by the editors. This is as it should be, for far too much science is done poorly and journal space is limited. Second, papers that are more “interesting” get published more often than boring papers. Again, there is nothing intrinsically wrong with this, but it can raise a serious problem: papers that do not reject the null hypothesis of no effect are usually thought, by most editors and even most authors, to be less interesting than those that do reject the null hypothesis. Thus, papers that reject the null hypothesis are more likely to be published than those that do not. Moreover, papers that describe large effects of an experiment are more interesting than those showing smaller effects, so published papers tend to show larger effects than unpublished studies. The odds of publishing a study whose main outcome was a Pvalue less than 0.05 are about 2.5 times higher than those of studies obtaining P > 0.10. As a result, the science that gets published is a biased selection of all science done. The difference between the true effect and the average effect published in journals is called publication bias. Publication bias can seriously skew our perception of nature. It is difficult to quantify, though, because we don’t know much about the research that isn’t published, such as how much there is and what results were obtained. After all, publication is how we normally find out about research. One way to detect publication bias takes advantage of the fact that scientists must file for permits from an ethics review board to do research on humans in research hospitals. Records of these reviews allow people studying publication bias to know how many studies were carried out, so that they can follow up on the results and publication outcome of each study. Such reviews consistently find that the odds of publishing a study whose main outcome obtained a Pvalue less than 0.05 are about 2.5 times higher than those of studies obtaining P > 0.10 (Easterbrook et al. 1991; Dickersin et al. 1992). Moreover, researchers are slower to publish nonsignificant results (P > 0.05) when they do publish them (Stern and Simes 1997). Studies with P < 0.05 are more likely to get into more widely read journals (Ioannidis et al. 1997), and subsequently, they are more likely to be cited by other papers (Gøtzsche 1987). These findings are disturbing—the papers that we read are not necessarily representative of the truth. Another, perhaps more troubling, source of evidence about the possibility of publication bias comes from statistical analyses of drug trials according to their funding source. Most analyses of this sort have found that studies funded by drug-manufacturing companies are about 3.5 times more likely to yield a result favorable to the company than are publicly funded studies (Melander et al. 2003; Leopold et al. 2003; Bekelman et al. 2003). The implication is that research is unlikely to be published if it reflects poorly on the interests of the company funding it. A funnel plot showing the results of 140 studies that measured the association between left-right asymmetry and male mating success. The effect size is the correlation coefficient between asymmetry and mating success. Data from Palmer (1999). Small studies finding minor or nonsignificant effects are more likely to be left unpublished than are large studies with similarly weak results. Perhaps scientists who have done a large study are more determined to publish whatever the result, to get some payback for all their work. On the other hand, a researcher who has carried out a small study and gets inconclusive results is quite likely to assume that the study had insufficient power and leave it unpublished. If publication bias is present, therefore, then there should be a relationship between publication, the size of the effect, and the sample size, as depicted in a funnel plot. A funnel plot is a scatter plot of the magnitude of the effect detected in published studies and their corresponding sample sizes. The figure in this interleaf is a funnel plot showing the results of 140 published studies, each of which examined the relationship between the mating success of individual males in a study species and the degree of left-right asymmetry in a male trait (Palmer 1999). In studies of the mating success in flies, for example, asymmetry might measure the absolute value of the difference between the lengths of the left and right wings. In human studies, asymmetry might measure the difference in proportions of the two sides of the face. The 140 studies devoted solely to one male feature might seem excessive, but the causes of romantic success in nature are of great interest to biologists. Most of the 140 studies followed on the heels of claims that the symmetry of traits may be even more important than the traits themselves in explaining why, in nature and in human societies, some males get more than their fair share of mates and others get less. But does asymmetry really matter? Each point in the figure is from a different study. The x-axis gives the sample size of each study, whereas the y-axis gives the “effect size,” which in this case is the estimated correlation coefficient between asymmetry and mating success. A negative effect size means that males with greater asymmetry had lower mating success than males with less asymmetry. The plot combines studies of many types of animals carried out by many researchers in many jungles, shopping malls, and laboratories. The solid horizontal line in the funnel plot represents the null hypothesis tested in every study that the true correlation coefficient between male asymmetry and mating success is zero. The gold curves mark the critical values for tests of the null hypothesis. Points falling outside these bounds are statistically significant at α = 0.05. The dashed horizontal line marks the average of the observed effect sizes of all 140 published studies. This funnel plot is highly revealing. For example, note that the range of published correlation coefficients is broad when the sample size is small and narrow when the sample size is large. This is expected, though, because larger sample sizes should yield more precise estimates (see Chapter 4). This expectation gives the funnel plot its name. Other features of the funnel plot are unexpected and are cause for concern. In the first place, very few small studies yielded estimates close to the average effect size. Instead, most small studies found very large effects, in contrast to the larger studies, which tended to find smaller effects. Even more disturbing, many results for the smaller studies are statistically significant, clumping outside the lower critical value for significance (indicated in the figure by the lower gold curve). Again, this finding contrasts with the largest studies, most of which found no statistically significant effect. What is behind these unexpected patterns? The probable answer is that most small studies—those finding weak and nonsignificant effects—were never published. As a result, the published papers are not representative of all studies done. There must be many studies with weaker, nonsignificant effects still sitting in file drawers and on hard drives in universities around the world, never to see the light of day. Another implication is that the average effect size of all of the published studies is overestimated, so reading just the published papers gives a biased view of the strength of the relationship between asymmetry and male mating success. Symmetry apparently gives the hopeful male at most a slight edge in romance. As mentioned previously, the more interesting a result is, the more likely it is to be published in a high-profile journal. One of the things that makes a result interesting is an effect that is stronger than previously believed. This can occur because our previous assumptions were wrong, but it can also occur because the effect was overestimated. As a consequence, dramatic claims in published papers often turn out later to be exaggerated or even false. This is not always or even usually the result of bad science, but rather is due to publication bias. The lesson to take home about publication bias is that flashy new results of published studies should always be repeated, preferably by different researchers working with larger sample sizes and bent upon publishing no matter what the outcome. In this way, the excesses of publication bias can be detected and corrected, yielding more accurate views of the patterns in nature. 17 Regression Regression is the method used to predict values of one numerical variable from values of another. For example, the scatter plot on the following page shows how genetic diversity in a local contemporary human population is predicted by its dispersal distance from East Africa by fitting a straight line to the data points.1 Modern humans emerged from Africa around 60,000 years ago, and our ancestors lost some genetic variation at each step as they spread to new lands. Regression is a method that predicts values of one numerical variable from values of another numerical variable. The line fitted to the data is the regression line. The line can be used to predict the genetic diversity of a local human population (the response variable), even for a locale not included in this study, based on its dispersal distance from East Africa (the explanatory variable). The slope or steepness of the regression line indicates the rate of change of genetic diversity with distance. It shows that humans lose 0.076 units of genetic diversity, about 10% of the maximum, with every 10,000-km distance from East Africa. Both features of the relationship are captured in the equation for the line. In this chapter, we show how to estimate the regression line, how to put bounds on its predictions, and how to test hypotheses about the slope. Our focus is linear regression, but we also introduce some general principles of nonlinear regression. Like correlation (Chapter 16), linear regression measures aspects of the linear relationship between two numerical variables. However, there are important differences. Regression fits a line through the data to predict one variable from another and to measure how steeply one variable changes with changes in the other. Correlation does none of these things. It measures strength of association between two variables, reflecting the amount of scatter in the data. Regression is used on data from either of two study designs. In the first, individuals are randomly sampled from a population. Two variables are measured on the sample, and one of them (deemed the explanatory variable) is used to predict the other (response) variable. In the second design, the researcher fixes or chooses values of the explanatory variable, which represent treatments or doses. The response variable is then measured on one or more individuals assigned to each treatment. The calculations are the same in both cases. Examples 17.1 and 17.3 in this chapter illustrate these two designs. Linear regression The most common type of regression is linear regression, which draws a straight line through the data to predict the response variable (Y, shown on the vertical axis) from the explanatory variable (X, shown on the horizontal axis). One important assumption of the linear regression method is that the relationship between the two variables really is linear. Example 17.1 shows how to use linear regression to predict the value of a response variable. EXAMPLE 17.1 The lion’s nose Managing the trophy hunting of African lions is an important part of maintaining viable lion populations. Knowing the ages of the male lions helps, because removing males older than six years has little impact on lion social structure, whereas taking younger males is more disruptive. Whitman et al. (2004) showed that the amount of black pigmentation on the nose of male lions increases as they get older and so might be used to estimate the age of unknown lions. The relationship between age and the proportion of black pigmentation on the noses of 32 male lions of known age in Tanzania is shown in the scatter plot in Figure 17.1-1. The raw data are listed in Table 17.1-1. We can use these data to predict a lion’s age from the proportion of black on his nose. FIGURE 17.1-1 Scatter plot of the known ages of 32 male lions (Y, vertical axis) and the proportion of black on their noses (X, horizontal axis). TABLE 17.1-1 The proportion of black on the noses of 32 male lions of known age. Proportion black Age (years) 0.21 0.14 0.11 0.13 1.1 1.5 1.9 2.2 0.12 0.13 0.12 0.18 2.6 3.2 3.2 2.9 0.23 0.22 0.20 0.17 0.15 0.27 0.26 0.21 0.30 0.42 0.43 0.59 0.60 0.72 0.29 0.10 0.48 0.44 0.34 0.37 0.34 0.74 0.79 0.51 2.4 2.1 1.9 1.9 1.9 1.9 2.8 3.6 4.3 3.8 4.2 5.4 5.8 6.0 3.4 4.0 7.3 7.3 7.8 7.1 7.1 13.1 8.8 5.4 The scatter plot in Figure 17.1-1 puts age as the response variable (the vertical axis) and the proportion of black on the nose as the explanatory variable (the horizontal axis), rather than the reverse, because we want to predict age from the proportion of black, not the other way around. The method of least squares Many straight lines can be drawn through a scatter of points, so how do we find the “best” one? Ideally, we would find a line that leads to the most accurate predictions of Y from X. This is the line that has the smallest possible deviations in Y (the vertical axis) between the data points and the regression line (Figure 17.1-2). FIGURE 17.1-2 Illustration of the deviations between the data and several possible regression lines (the heavy black lines) drawn through the data points originally plotted in Figure 17.1-1. Vertical lines are the deviations in Y between each point and the regression line. The line in the right panel is the least-squares regression line. The least-squares regression line is the line for which the sum of all the squared deviations in Y is smallest. We square the deviations from the regression line for the same reason that we square deviations from the mean when calculating an ordinary variance—to overcome the fact that some deviations are positive (the points above the regression line) and others are negative (the points below the regression line), which would cancel each other out in a simple average. Formula for the line The regression line through a scatter of points is described mathematically by the following equation: Y=a+bX. The symbol Y is the response variable (displayed on the vertical axis in a scatter plot), and X is the explanatory variable (the horizontal axis). The formula has two coefficients, a and b. The coefficient a is the Y-intercept, or just the intercept. Mathematically, a is the value of Y when X is zero (hence, it is the Y-value where the regression line “intercepts” the y-axis). Its units are the same as the units of Y. The right panel in Figure 17.1-3 shows two regression lines that have different intercepts. The coefficient b is the slope of the regression line. It measures how much Y changes per unit change in X. Its units are the ratio of the units of Y and X. If b is positive, then larger values of X predict larger values of Y. If b is negative, then larger values of X predict smaller values of Y. The first three panels of Figure 17.1-3 show the slope of the line when b is positive, negative, and equal to zero. The slope of a linear regression is the rate of change in Y per unit of X. FIGURE 17.1-3 Comparing the slope of a line when b is positive (farleft), negative (left), and zero (right); comparing a line with a high intercept a and one with a low intercept a (far right). Calculating the slope and intercept Typically, you would use a computer to calculate the regression line, but we provide the formulas here for use with a calculator. The slope of the least-squares regression line is computed as2 b=∑i(Xi-X¯)(Yi-Y¯)∑i(Xi-X¯)2, where X¯ and Y¯ are the sample means of the two variables, and Xi and Yi refer to the X and Y measurements of individual i. The top of this formula is the sum of products, something first encountered in Section 16.1. The bottom is the sum of squares for X. Shortcut formulas for these sums are given in the Quick Formula Summary (Section 17.10). Once we have the slope b, getting the intercept is relatively straightforward, because the least-squares regression line always goes through the point (X¯,Y¯ ). As a result, we know that Y¯=a+bX¯. So, we find a by simple algebra: a=Y¯-bX¯. We can now use these formulas to calculate the coefficients of the least-squares regression line for the lion data in Example 17.1. First, though, we need the following quantities calculated from the data in Table 17.1-1: X¯=0.3222Y¯=4.3094∑i(Xi−X¯)2=1.2221∑i(Yi−Y¯)2=222.0872∑i(Xi−X¯) (Yi−Y¯)=13.0123. The slope is then b=∑i(Xi-X¯)(Yi-Y¯)∑i(Xi-X¯)2=13.01231.2221=10.647. The slope b measures the change in age of male lions per unit increase in the proportion of black on the nose. Its units are years per unit proportion black. The intercept, in years, is a=Y¯-bX¯=4.3094-10.647(0.3222)=0.879. The formula for the line that predicts age from the proportion of black pigmentation on the nose in these lions can be written by putting all of this together, with appropriate rounding: Y=0.88+10.65X. This equation could also be written as Age=0.88+10.65(proportion black). Figure 17.1-4 shows what this line looks like when it is plotted on the scatter plot3 shown originally in Figure 17.1-1. The slope of the line indicates that on average, lion age increases by 10.65 years per unit change in the proportion of the nose that is black. We can say, equivalently, that age goes up by 1.065 years for each 0.1 increase of black on the nose. FIGURE 17.1-4 The regression line for the lion data from Example 17.1. Populations and samples The regression line is not just calculated for the sake of the data. It is typically used to estimate the true regression of Y on X in the population from which the data are a sample. The regression equation for the population is Y=α+βX, where β is the slope in the population, and α is the intercept. The quantities α and β are population parameters, whereas a and b are their sample estimates. Under one sampling scenario for linear regression, we have a random sample of (X,Y) pairs of measurements from a population. The lion data correspond to this scenario. Or, under a second scenario, the researcher fixes or chooses values of X to include in the study, and Y is then measured on a sample of one or more individuals for each X-value included in the study. In either case, regression assumes that there is a population of possible Y-values for every possible value of X. The mean Y-value for each value of X lies on the true regression line. For example, one of the lions in the data for Example 17.1 has a value of 0.6 for the proportion of black on its nose. We assume that there is a population of lions having the value X = 0.6 for the proportion of black on their noses, even though the data includes just one lion with the value. The mean age of all lions in the population having X = 0.6 is assumed to lie on the true regression line. Predicted values Now that we have the regression line, we can use it to determine points on the line that correspond to specified values of X. These points on the regression line are called predictions. We will symbolize predictions as Y^ (“Y-hat”) to distinguish them from values of Y (i.e., actual data points), which lie above or below the line but not usually on it. The predicted value of Y for a given value of X estimates the mean of Y for the whole population of individuals having that value of X. For example, to predict the age of a male lion corresponding to a proportion of 0.50 black on the nose, plug the value X = 0.50 into the regression formula: Y^=a+b(0.50)=0.88+10.65(0.50)=6.2. In other words, the regression line predicts that lions with a proportion of black X = 0.50 will be 6.2 years old on average. If we observed a lion with 0.5 proportion black on its nose, we could predict its age, even though we had never seen a lion exactly like that before. According to Table 17.1-1, the value X = 0.50 was not represented in the sample, although it falls within the range of observed X-values (i.e., 0.10 to 0.79). For reasons that are explained in Section 17.2, we can reliably make predictions only by using values of X that lie within the range of values in the sample. The predicted value of Y from a regression line estimates the mean value of Y for all individuals having a given value of X. Residuals Residuals measure the scatter of points above and below the least-squares regression line. They are crucial for evaluating the fit of the line to the data. Each observation in the data has a corresponding residual, measuring the vertical deviation from the least-squares regression line (see the right panel in Figure 17.1-2). The point on the regression line used to calculate the residual for individual i is Y^i, the value predicted when its corresponding value for Xi is plugged into the regression formula: Y^i=a+bXi. For example, the 31st lion in the sample (i = 31) has a proportion X31 = 0.79 of black on its nose (see Table 17.1-1). The corresponding age Y^31 predicted for a lion with this much black on the nose is Y^31=0.88.+10.65(0.79)=9.29. The actual age of the lion was 8.8 years, which is below the predicted value. The residual is the observed value minus the predicted value: residual31=(Y31−Y^31)=(8.8−9.3)=−0.49 years. The variance of the residuals, symbolized as MSresidual, quantifies the spread of the scatter of points above and below the line. In regression jargon, this variance is called the “residual mean square”: MSresidual=∑i(Yi-Yˆi)2n-2. The MSresidual is like an ordinary variance, but it has n − 2 degrees of freedom4 rather than n 1. It is analogous to the error mean square in the analysis of variance (Section 15.1). The following alternate formula is easier to use, though, because you don’t need to calculate each Ŷi: MSresidual=∑i(Yi-Y¯)2-b∑i(Xi-X¯)(Yi-Y¯)n-2. Shortcuts for calculating the sum of squares and products are provided in the Quick Formula Summary (Section 17.10). All of the quantities needed to determine MSresidual for the lion data have been calculated previously on page 544. Inserting these values into the equation for MSresidual yields MSresidual=222.0872-10.647(13.0123)32-2=2.785. Standard error of slope Like any other estimate, there is uncertainty associated with the sample estimate b of the population slope β. Uncertainty is measured by the standard error, the standard deviation of the sampling distribution of b. The smaller the standard error, the higher the precision and the lower the uncertainty of the estimate of the slope. If the assumptions of linear regression are met (Section 17.5), then the sampling distribution of b is a normal distribution having a mean equal to β and a standard error estimated from data as SEb=MSresidual∑i(Xi-X¯)2. The quantity on top of the fraction under the square-root sign is the residual mean square, and the quantity on the bottom is the sum of squares for X. The standard error of b for the lion data is SEb=MSresidual∑i(Xi-X¯)2=2.7851.2221=1.510. The standard error of the slope has the same units as the slope itself (i.e., years per unit of proportion black for the lion data in Example 17.1). Confidence interval for the slope A confidence interval for the parameter β is given by b−tα(2),dfSEb<β<b+tα(2),dfSEb, where tα(2),df is the two-tailed critical value of the t-distribution having df = n − 2 degrees of freedom. For a 95% confidence interval, α = 0.05, and for a 99% confidence interval, α = 0.01. For the lion data, t0.05(2),30 = 2.042 (Statistical Table C), so the 95% confidence interval for the slope is 10.647−2.042(1.510)<β<10.647+2.042(1.510)7.56<β<13.73. This is a modest range of most-plausible values for the slope. The mean age of lions increases by as little as 7.6 years per unit proportion of black on the nose, or by as much as 13.7 years. Confidence in predictions The regression line calculated from data predicts the mean value of Y for any specified value of X lying between the smallest and largest X in the data. This line is calculated with error, however, which affects how precise the predictions are. Here in Section 17.2 we quantify the precision of predictions. We also discuss the hazards of extrapolating—making predictions when the values of X lie beyond the range of X-values in the data. Confidence intervals for predictions Two subtly different types of predictions can be made using the regression line. The first predicts the mean Y for a given X. What, for example, is the mean age of all male lions in the population whose noses are 60% black (i.e., X = 0.60)? The second type predicts a single Y for a given X. (For example, how old is that lion over there, given that 60% of its nose is black?) Usually we just want to predict the mean Y for each X (i.e., the first prediction) because we are interested in the overall trend. In special situations, though, we also want to predict an individual Y-value (i.e., the second prediction). This is especially true in the lion study (Example 17.1). A hunter who encounters a male lion would want to know the age of that specific lion if he or she wishes to avoid shooting a young male. Both types of predictions generate the same value for Y^. They differ in the precision of the predictions. In the case of lions with 60% black on their noses, Y^=a+bX=0.88+10.65(0.60)=7.27 years. Regardless of the prediction goals, this is the best prediction of age. The precision of the prediction is lower, however, if the goal is to predict the age of an individual lion rather than the mean age of lions having the specified proportion of black on their noses. This is because the prediction for a single Y-value includes uncertainty stemming from variation in Y among the individuals in the population having the same value of X (i.e., not all male lions having 60% black noses are the same age). The two graphs in Figure 17.2-1 illustrate these differences in precision of predictions using confidence intervals. FIGURE 17.2-1 Left: 95% confidence bands for the predicted mean age of male lions at every value of proportion of black on their noses. Right: 95% prediction intervals for the predicted age of single lions. n = 32. The left panel of Figure 17.2-1 shows the 95% confidence intervals for the predicted mean lion age at every X. The upper curve connects the upper bounds of all of the 95% confidence intervals for the predicted mean Y-values, one for every X between the smallest and largest X in the data. The lower curve connects the lower bounds of these same confidence intervals. Together the upper and lower curves showing the confidence intervals for the mean Y are called the 95% confidence bands. These bands are narrowest in the vicinity of X¯, the mean value for proportion of black on the nose, and they flare outward toward the extremes of the range of data. The uncertainty of predictions always increases the farther the X-value is from the mean X in the data. In 95% of samples, the confidence bands will bracket the true regression line in the population. The right panel of Figure 17.2-1 shows the 95% prediction intervals. The upper and lower curves connect the upper and lower limits of the 95% prediction intervals for a single Y over the range of X-values in the data. These are much wider than the confidence bands because predicting an individual lion’s age from the color of its nose is more uncertain than predicting the mean age of all lions having the same proportion of black on their noses. Prediction intervals bracket most of the individual data points in the sample, because they incorporate the variability in Y from individual to individual at a given X. Confidence bands measure the precision of the predicted mean Y for each value of X. Prediction intervals measure the precision of the predicted single Y-values for each X. Most statistical packages on the computer will calculate and display confidence bands and prediction intervals. We haven’t given calculation details, but we provide the formulas in the Quick Formula Summary (Section 17.10). Extrapolation We’ve stressed that regression can be used to predict Y for any value of X lying between the smallest and largest values of X in the data set. Regression cannot be used to predict the value of the response variable when an X-value lies well outside the range of the data. This is because there is no way to ensure that the relationship between X and Y continues to be linear beyond the range of the data. Predicting Y for X-values beyond the range of the data is called extrapolation. The graph in Figure 17.2-2 illustrates the problem. Extrapolation is the prediction of the value of a response variable outside the range of Xvalues in the data. FIGURE 17.2-2 Ear lengths of 206 adults 30 years old or more as a function of their ages. Modified from Heathcote (1995). The data are measurements of ear length taken on a sample of adults at least 30 years old (Heathcote 1995). The linear regression equation calculated from these data (in millimeters) is ear length=55.9+0.22(age). The results suggest that our ears grow longer by about 0.22 mm per year on average as we age. The intercept of this equation, which predicts the ear length at birth (i.e., when age is zero), is 56 mm. This makes no sense, though. To quote the authors of the study (Altman and Bland 1998), “A baby with ears 5.6 cm long would look like Dumbo.” The relationship between ear length and age is not linear from birth, but we wouldn’t know this unless we took measurements over the complete range of ages. 17.3 Testing hypotheses about a slope Hypothesis testing in regression is used to evaluate whether the population slope equals a null hypothesized value, β0, which is typically (but not always) zero. The test statistic t is t=b-β0SEb, where b is the estimate of the slope in the sample and SEb is the standard error of b. Under the null hypothesis, this test statistic has a t-distribution with n − 2 degrees of freedom. Example 17.3 shows how to use this test. EXAMPLE 17.3 Prairie Home Campion Human activity is reducing species numbers in many of the world’s ecosystems. Does this decrease affect basic ecosystem properties? Or are different plant species largely substitutable, with lost species compensated by those species remaining? To find out, Tilman et al. (2006) seeded 161 plots, each measuring 9 × 9 meters, at the Cedar Creek Reserve in Minnesota. They used a varying number of prairie plant species and measured plant biomass production over 10 subsequent years. Treatments of either 1, 2, 4, 8, or 16 plant species (randomly chosen from a set of 18 perennials) were randomly assigned to plots. After 10 years of measurement, the researchers measured the “stability” of plant biomass production in every plot as mean biomass divided by the standard deviation in biomass over the 10 years (the reciprocal of the coefficient of variation; Section 3.1). Results are plotted in Figure 17.3-1. Stability has been logtransformed to reduce skew. The data are available at whitlockschluter.zoology.ubc.ca. FIGURE 17.3-1 Stability of plant biomass production over 10 years in 161 plots and the initial number of plant species assigned to plots. Stability was log-transformed (natural log) to better meet the assumptions of regression. The line is the leastsquares regression line. Data are from Tilman et al. (2006). Unlike the previous example, involving lions, the data in Example 17.3 are from an experiment in which the values of the explanatory variable were fixed treatments. In contrast to correlation, regression does not require the explanatory variable to follow a normal distribution. The t-test of regression slope The null hypothesis is that the measure of ecosystem stability cannot be predicted from the species number treatment—that is, the slope of the linear regression of ecosystem stability on number of species is zero. The alternative hypothesis is that stability either increases or decreases with increasing number of species. This is a two-tailed test. H0: The slope of the regression of log ecosystem stability on species number is zero (β = 0). HA: The slope of the regression of log ecosystem stability on species number is not zero (β ≠ 0). The following quantities calculated from the data are needed for the t-test of zero slope: X¯=6.3168Y¯=1.4063∑i(Xi-X¯)2=5088.8447∑i(Yi-Y¯)2=24.8149∑i(Xi-X¯)(Yi- Y¯)=167.5548n=161. The best estimate of the slope is b=∑i(Xi-X¯)(Yi-Y¯)∑i(Xi-X¯)2=167.55485088.8447=0.03293. Calculating the intercept as before, and rounding, we get the least-squares regression line as Y=1.20+0.033X, which can also be written as Log stability=1.20+0.033(number of species). This line has a positive slope, as shown in Figure 17.3-1. The estimate of slope (b = 0.033) indicates that log stability of biomass production rises by the amount 0.033 for every species added to plots. To calculate the standard error of the slope, we need the mean square residual: MSresidual=∑i(Yi-Y¯)2-b∑i(Xi-X¯)(Yi-Y¯)n-2=24.8149-0.03293(167.5548)161-2=0.12137. Thus, the standard error of b is SEb=MSresidual∑i(Xi-X¯)2=0.121375088.8447=0.004884. We now have all of the elements needed to calculate the t-statistic: t=b-β0SEb=0.03293-00.004884=6.74. We must compare this t-statistic with the t-distribution having df = n – 2 = 161 − 2 = 159 degrees of freedom. Using a computer, we find that t = 6.74 corresponds to P = 2.7 × 10−10, so we reject the null hypothesis. We reach the same conclusion if we use the critical value for the t-distribution with df = 159 (Statistical Table C): t0.05(2),159=1.97. Since t = 6.74 is greater than 1.97, P < 0.05, and we reject H0. In other words, increasing the number of plant species in plots increases the stability of plant biomass production of the ecosystem. A 95% confidence interval for the population slope, calculated from the formula in Section 17.1, is 0.0233<β<0.0426. indicating that the estimate of slope has fairly tight bounds. The ANOVA approach In the literature, and in the output of regression analyses conducted on the computer, you will encounter tests of regression slopes that use an F-test rather than a t-test. Just as ANOVA can be used to compare two population means in place of the two-sample t-test, ANOVA can be used to test for a significant slope in place of the t-test of slope. The resulting P-values are identical. Table 17.3-2 shows the ANOVA table for the ecosystem stability data. Formulas for the quantities are given in the Quick Formula Summary (Section 17.10). TABLE 17.3-2 ANOVA table testing the effect of plant species number on the stability of bio-mass production. Source of Sum of Mean Fvariation Squares df Squares ratio P Regression 5.5169 1 5.5169 45.45 2.73 × 10-10 Residual 19.2980 159 Total 24.8149 160 0.1214 The basic idea behind the ANOVA approach in regression is similar to that when testing differences among means of multiple groups (Chapter 15). If the null hypothesis is true, then the population regression line is flat with a slope of 0. In this case, the amount of variation in Y among individual data points having the same value for X (represented by the residual mean square) is expected to equal the amount of variation among data points having different X values (represented by the regression mean square), except by chance. If the null hypothesis is false, we expect the regression mean square to exceed the residual mean square. The comparison of mean squares is done with an F-ratio. The first step to estimating these two sources of variation in the data is to take the deviation between each Y-measurement Yi and the grand mean Y¯ and break it into two parts. The residual part is the deviation between Yi and its predicted value on the regression line (i.e., Yi−Y^i, analogous to the “error” component in ANOVA). The regression, on the other hand, is the difference between the predicted value for each point and Y¯ (i.e., Y^i−Y¯, analogous to the “groups” component in ANOVA). The sum of squared deviations corresponding to each of these two sources of variation and their total are computed and used to calulate the mean squares. The test statistic is an F-ratio of the two mean squares (the mean square regression divided by the mean square residuals). If the null hypothesis is true, and the slope of the population regression β is zero, then the F-ratio is expected to be 1 (except by chance). If the slope of the regression is not zero, however, then F is expected to be greater than 1. The ANOVA approach can be used when the test is two-sided and the null hypothesized slope is zero. Using R2 to measure the fit of the line to data We can measure the fraction of variation in Y that is “explained” by X in the estimated linear regression with the quantity R2: R2=SSregressionSStotal. R2 is calculated from the sums of squares in the ANOVA table, and it is analogous to the R2 in analysis of variance (Section 15.1), which measures the fraction of variation in the sample of Y-values accounted for by differences between groups.5 If R2 is close to one (i.e., its maximum possible value), then X predicts most of the variation in the values of Y. In this case, the Yobservations will be clustered tightly around the regression line with little scatter. If R2 is close to zero (i.e., its minimum value), then X does not predict much of the variation in Y, and the data points will be widely scattered above and below the regression line. For the ecosystem stability study, R2 can be calculated from the quantities in the ANOVA table, Table 17.3-2: R2=5.516924.8149=0.222. Thus, number of plant species explained 22% of the variation in log-transformed ecosystem stability, a moderate percentage. 17.4 Regression toward the mean Suppose a study measured the cholesterol levels of 100 men randomly chosen from a population. After their initial measurement, the men were put on a new drug therapy, designed to reduce cholesterol levels. After one year, the cholesterol level of each man was measured again and compared with the first measurement. Figure 17.4-1 shows a scatter plot of the results. The researchers were delighted to find that cholesterol levels had dropped on average in the men who had previously had the highest levels. Their excitement dimmed, though, when they realized that cholesterol levels had ncreased on average in the men who had previously had low levels of cholesterol. What happened? Had they discovered a drug with complex effects? In this hypothetical example, there was no effect at all due to the drug—the trend resulted entirely from a general phenomenon known as regression toward the mean. If two variables measured on a sample of individuals, such as consecutive measures of cholesterol, have a correlation less than one, then individuals that are far from the mean on the first measure will on average lie closer to the mean for the second measure. Even without an effect of the drug treatment, average cholesterol levels of the men with the highest levels on the first measure were expected to drop by the second measurement, and average levels of the men who originally had low levels were expected to rise.6 Regression toward the mean results when two variables measured on a sample of individuals have a correlation less than one. Individuals that are far from the mean for one of the measurements will, on average, lie closer to the mean for the other measurement. FIGURE 17.4-1 Regression toward the mean. These hypothetical data are two cholesterol measurements taken on the same 100 men. The dashed line is the one-toone line with a slope of one. The solid line is the regression line predicting the second measure from the first. It has a slope less than one, as indicated by the blue arrows. Regression toward the mean is a tricky concept, which is perhaps why it is often overlooked. Think of it this way: each of the men in the study has a “true,” underlying cholesterol value, but his “measured” cholesterol value varies randomly with time and circumstance around the true value. The subset of men who scored highest on the first measurement therefore likely included a disproportionate number of men whose cholesterol measurement was higher than its true value the first time. The second measurement made on each of these men is expected to be closer to its true value on average, bringing down the average for the subset of men as a whole. Similarly, the subset of men who initially scored lowest likely included a disproportionate number of men whose measured values were lower than their true values, so on the second measurement they would seem to improve.7 Regression toward the mean is potentially a large problem in any study that tends to focus on individuals in one tail of the distribution. In many medical studies, for example, only sick people are included in the research, as indicated by their initial assessment before the study. Because of regression toward the mean, many of these people will appear to improve even if the treatment has no effect. Interpreting this improvement as if it were a response to the treatment, instead of a mathematical fact of regression, is called the regression fallacy. It is one of the reasons that experiments should always include a control group for comparison. Regression toward the mean is an issue only in observational studies, not in randomized experiments, where the value of the explanatory variable (X) is set by the experimenter. Kelly and Price (2005) discuss ways to disentangle biologically meaningful trends from the effect of regression toward the mean. Assumptions of regression When using linear regression, the following assumptions must be met for confidence intervals and hypothesis tests to be accurate: ■ At each value of X, there is a population of possible Y-values whose mean lies on the true regression line (this is the assumption that the relationship must be linear). ■ At each value of X, the distribution of possible Y-values is normal. ■ The variance of Y-values is the same at all values of X. ■ At each value of X, the Y-measurements represent a random sample from the population of possible Y-values. Figure 17.5-1 illustrates the first three of these assumptions. In the next few sections, we explore some of the ways to examine deviations from these assumptions. We also discuss methods to try when the assumptions are not supported by the data. FIGURE 17.5-1 Illustration of the assumptions of linear regression. At each value of X, there is a normally distributed population of Y-values with the mean on the true regression line. The variance of the Y-values is assumed to be the same for every value of X. Unlike correlation analysis, no assumptions are made about the distribution of X when using regression. In regression, for example, it is not necessary that the distribution of X-values is normal. It is not even necessary that X-values are randomly sampled—they might be fixed by the experimenter, instead, as they were in the ecosystem stability study (Example 17.3). Outliers Besides creating a non-normal distribution of Y-values at the corresponding value of X, and violating the assumption of equal variance in Y, outliers disproportionately affect estimates of the regression slope and intercept. If an outlier is present, biologists usually examine and report its influence on the results by comparing the regression line produced with and without the outlier. For example, Figure 17.5-2 shows how the average amount of white in the tails of darkeyed juncos varies with latitude. One outlier is present, however, indicated in red (a population that formed in 1983 on the campus of the University of California, San Diego). Without the outlier, the estimate of slope is b = −0.37 (black line in Figure 17.5-3), and the null hypothesis of zero slope is rejected (t = −2.66, P = 0.024). Including the outlier changes the slope substantially (red line; b = −0.18), and the null hypothesis of zero slope is not rejected (t = −0.81, P = 0.43). FIGURE 17.5-2 Graph showing the effect of an outlier on an estimate of the regression line. The data are the percentage of white in the tail feathers of the darkeyed junco at sites at different latitudes in California (Yeh 2004). The black regression line was calculated after excluding the red point on the lower left, whereas the red regression line included it. FIGURE 17.5-3 A scatter plot showing the relationship between the number of fish species and the surface area of 20 desert pools (Kodric-Brown and Brown 1993). The left panel fits a linear regression to the data to highlight how poorly a straight line matches the data. The right panel adds a “smoothed” fit to the same data (see Section 17.8). Outliers are especially likely to be influential if they occur at or beyond the range of Xvalues in the rest of the data. If the outlier has a large effect, then alternative approaches might also be sought. One approach is to transform X or Y, such as by taking logarithms, to see if this brings the outlier closer to the rest of the distribution. Further solutions include robust regression methods (Rousseeuw and Leroy 2003) and permutation testing (Chapter 13). Detecting nonlinearity Visually inspecting the scatter plot is a useful method for detecting departures from the assumption of a linear relationship between Y and X. Often, this approach is enough to conclude that the relationship between X and Y is not linear. Forcing a linear regression through the scatter plot can sometimes make the nonlinearity even more obvious (see the left panel of Figure 17.5-3). Scatter-plot “smoothing,” a method discussed in greater detail in Section 17.8, can also aid the eye in detecting a nonlinear relationship (see the right panel of Figure 17.5-3). Most statistics packages on the computer are able to carry out scatter-plot smoothing. Detecting non-normality and unequal variance It is often difficult to decide whether the assumptions of normally distributed residuals and equal variance of residuals are met. Visual inspection of a residual plot can help. In a residual plot, the residual for every data point (Yi−Y^i) is plotted against Xi, the corresponding value of the explanatory variable. This plot is best made with the aid of a computer. Two examples of residual plots are shown in Figure 17.5-4. A residual plot is a scatter plot of the residuals (Yi−Y^i) the explanatory variable. against the Xi, the values of FIGURE 17.5-4 Two examples of residual plots. Data on the left are from a linear regression of the cap color of offspring on that of their parents in the blue tit, a British bird (Hadfield et al. 2006). Those on the right are from a linear regression of firing rates of cockroach neurons on temperature (Murphy and Heath 1983). If the assumptions of normality and equal variance of residuals are met, then the residual plot should have all of the following features: ■ A roughly symmetric cloud of points above and below the horizontal line at zero, with a higher density of points close to the line than away from the line ■ Little noticeable curvature as we move from left to right along the x-axis ■ Approximately equal variance of points above and below the line at all values of X The blue tit data in the left panel of Figure 17.5-4 fit these requirements reasonably well. The density of observations peaks near the horizontal line and spreads outward above and below in a fairly symmetrical fashion. The spread of points above and below the line is similar across the range of X-values. (The spread may seem low at the extreme left end, but we can’t tell because there are only two data points). The cockroach data in the right panel of Figure 17.5-4 do not fit these requirements as well. The spread of points above and below the horizontal line is considerably higher at low values of X than at high values of X. Normal quantile plots (Section 13.1) and histograms of the residuals are yet other ways to evaluate the assumption that the residuals are normally distributed. Transformations Some (but not all) nonlinear relationships can be made linear with a suitable transformation. The most versatile transformation in biology is the log transformation (Section 13.3). The power and exponential relationships, which are commonly encountered in biology, are two nonlinear relationships that can be made linear with a log transformation, as shown in Figure 17.6-1. FIGURE 17.6-1 The power function (upper left) and the exponential function (lower left) are two types of nonlinear relationships that can be made linear using the log transformation. Plot log of Y against log of X if the two variables are described by a power function (upper right). Plot the log of Y against X if the relationship between these two variables is described by an exponential function (lower right). For example, the relationship between the number of fish species in 20 desert pools and the surface area of pools (Figure 17.5-3) looks like it might fit a power curve or an exponential curve. We tried a log-transformation of both the number of species and the surface area, and we obtained the graph shown in Figure 17.6-2. The straight line fits the transformed data much better than the untransformed data. We can now proceed to estimate parameters and test hypotheses about this relationship using the methods of linear regression, setting Y to be the log of the number of fish species and X to be the log of the surface area of the pools. FIGURE 17.6-2 A scatter plot of the log-transformed number of fish species and surface area of desert pools. This relationship is more linear than the one in Figure 17.5-3, which is based on the untransformed data. Transformation can also be used to help meet other assumptions of linear regression. For example, if a residual plot reveals that the variance of Y increases with increasing X, then transforming Y can often improve matters (it may be necessary, though, to transform X as well to keep the relationship linear). For example, the number of pollen grains received by flowers of the iris Lapeirousia anceps, which is pollinated by long-proboscid flies, increases with increasing flower tube length (Pauw et al. 2009). In a residual plot from a regression using the untransformed data (see the left panel of Figure 17.6-3), the variance of residuals increases from the smallest X-values to larger X-values. This problem goes away when Y is square-root transformed (see the right panel of Figure 17.6-3). FIGURE 17.6-3 The effect of a square-root transformation on the residuals from a linear regression of number of pollen grains received on floral tube length of an iris species (Pauw et al. 2009). Residuals from a linear regression calculated on the original data (left panel) do not fit the equal-variance assumptions of linear regression, but residuals from a regression using the square root of the number of pollen grains (right panel) have more equal variances. The square-root transformation, originally described in Section 13.3, often resolves unequal variance problems when the data are counts, as in Figure 17.6-3. The log transformation can also be effective when the variance in Y increases with increasing X. Arcsine transformation is often effective when Y is a proportion. When analyzing data that violate the assumptions of regression, try simple transformations of X and/or Y to see if they help to meet the assumptions of linear regression. The effects of measurement error on regression Recall from Section 16.6 that measurement error occurs when a variable is not measured with complete accuracy. Many biological traits, such as behavior or aspects of physiology, can be difficult to measure accurately, so measurement error can be an important component of variation (Section 15.6). The effects of measurement error on regression differ from the effects on correlation (Section 16.6). The effect depends on the variable. Measurement error in Y increases the variance of the residuals, as shown in Figure 17.7-1 when you compare the scatter in the middle panel (measurement error in Y) with that in the left panel (no measurement error). This increases the sampling error of the estimate of the slope and of the predictions but has no effect on expected slope. FIGURE 17.7-1 The effects of measurement error on the estimate of regression slope. X and Y are measured without error in the left panel. Y is measured with error in the middle panel, which has little effect on expected slope but increases the variability in the residuals. X is measured with error in the right panel, which causes the expected estimate of the slope to decline (solid line) compared with the slope in the absence of measurement error (dashed line). The variability of the residuals also increases. Measurement error in X (the right panel in Figure 17.7-1) also increases the variance of the residuals, and in addition it causes bias in the expected estimate of the slope. With measurement error in X, b will tend to lie closer to zero on average than the population quantity β. On average, the largest values of X in the data will include disproportionately many measurements that were erroneously overestimated. Since the true X-values of these points are smaller than their measured values, they predict Y-values that on average lie closer to the mean than would the same X-values in the absence of measurement error. Conversely, the smallest values of X will tend to include disproportionately many measurements that were erroneously underestimated. These underestimated X-values will be associated with Y-values that lie closer to the mean on average than in the absence of measurement error. Nonlinear regression Transformations won’t always successfully convert a nonlinear relationship into one that can be analyzed using linear regression. Nonlinear regression methods, however, are readily available in most statistics packages on the computer. Here in Section 17.8, we outline some basic principles for nonlinear regression. The assumptions of nonlinear regression are almost the same as those of linear regression (Section 17.5), except here we usually assume that the true relationship between X and Y has a specific nonlinear form. The immediate problem when turning to nonlinear regression is the nearly unlimited number of options. There are so many mathematical functions to choose from that it can be difficult to know where to begin. The appropriate choice depends on the data, but a few guidelines can help you make a sensible choice. A curve with an asymptote The best advice we can offer is to keep things simple, unless the data suggest otherwise. Figure 17.8-1 illustrates this principle. The left panel shows a nonlinear regression model fitted to data on the population growth rate of a species of phytoplankton and the concentration of iron, a limiting nutrient. The mathematical function fit to the data passes through every data point. The resulting MSresidual is zero. FIGURE 17.8-1 Population growth rate of a species of phytoplankton in culture in relation to the concentration of iron in the medium (data from Sunda and Huntsman 1997). The curve in the left panel is an arbitrarily complex function that passes through all of the data points. The curve in the right panel is a Michaelis-Menten curve that fits the data more simply. This is hardly the best possible outcome, however, even though each data point fits precisely. The problem with the curve in the left panel of Figure 17.8-1 is that it would probably do a terrible job of predicting any new observations obtained from the same population, because the curve does not describe the general trend. Such a complicated curve is also difficult to justify biologically—is there good reason to think that all the peaks and dips in this curve truly reflect the effects of iron on the growth of phytoplankton? Greater simplicity, as demonstrated by the fitted curve in the right panel of Figure 17.8-1, solves both of these problems. The data are the same as in the left panel, but this time we’ve fit the much simpler function, Y=aXb+X. This is the Michaelis-Menten equation used frequently in biochemistry. The curve rises from a Y-intercept at zero and increases at a declining rate with increasing X, eventually reaching a saturation point, or asymptote. The asymptote is represented in the formula by the constant a, whereas b determines how fast the curve rises to the asymptote. Reminiscent of linear regression, the Michaelis-Menten equation has only two parameters to estimate (the true asymptote a and the true rate parameter β). These parameters are very different, however, from those of linear regression. We obtained the curve on the right of Figure 17.8-1 with a statistics package on the computer that used least squares to find the best fit. The virtue of linear regression is simplicity. We should strive to retain this property as we look at the wide range of nonlinear functions available. Quadratic curves The quadratic curve is often used in biology to fit a humped-shape curve to data, such as the relationship shown in Figure 17.8-2 between the number of plant species present in ponds (Y) and pond productivity (X). The curve is a symmetric parabola described by the quadratic (second-degree polynomial) equation, Y=a+bX+cX2. FIGURE 17.8-2 A quadratic curve fit to the relationship between the number of plant species present in ponds and pond productivity (Chase and Leibold 2002). This equation is similar to the formula for a straight line except that one more term has been added for the square of X, and another regression coefficient c must be computed. When c is negative, the curve is humped, as in Figure 17.8-2. When c is positive, the parabola curves upward in a U-shape. Asymptotic and quadratic curves are just two of several nonlinear functions commonly used in biology. The choice between them must depend on the data: Is the relationship asymptotic or humped? If humped, is the hump symmetric or does it fall more steeply on one side than the other? If it falls more steeply on one side, then we must search for another function altogether. A good guide to a variety of curves used in biology can be found in Motulsky (1999). Formula-free curve fitting With the aid of a computer, it is possible to fit curves to data without specifying a formula. Often called smoothing, this approach gathers information from groups of nearby observations to estimate how the mean of Y changes with increasing values of X. There are several methods, including “kernel,” “spline,” and “loess” smoothing. We bypass the technical details here, but Example 17.8 illustrates the utility of the approach. EXAMPLE 17.8 The incredible shrinking seal Trites (1996) amassed a large set of measurements of the ages and sizes of northern fur seals (Callorhinus ursinus) in the Pacific Ocean, gathered over decades by many researchers. Most measurements were taken between spring and fall. In summer, females spend a lot of time on land giving birth and nursing young. The graph in Figure 17.8-3 shows the measurements for nearly 10,000 adult females. In a preliminary analysis, a line was fitted to the data, but the relationship appeared nonlinear. Average length increased with age but appeared to taper off by about 4500 days (i.e., about 12 years old). FIGURE 17.8-3 Measurements of body length as a function of age for female fur seals. The spline fit is in black. To fit the data, we used a spline technique to calculate a smoothed fit of body length on female age. The result is plotted in Figure 17.8-3. Astonishingly, the curve indicates that average female body length does not rise steadily with age, but oscillates each year. Female fur seals become longer each summer and then shorter again by winter (keep in mind that these changes are in length, not weight). Elongation results in part because the seals are heavier on land than in water, and the added weight stretches the skeleton during summer breeding. The skeleton shrinks back after the seals return to the water in the winter. It would have been difficult to come up with a formula that captured this complex relationship. The seal example illustrates that it is not always easy to anticipate or see the type of relationship present in the data by visual inspection alone. Smoothing techniques can help. Example 17.8 also demonstrates that we don’t always need a mathematical formula if all we want to do is fit the data and use it to improve our understanding of a biological system. The fit obtained by smoothing is controlled by a smoothing coefficient that determines how bumpy the curve is. You can adjust this coefficient in statistical computer programs so that you can explore its effects. A low value for the coefficient results in a bumpy curve that, in the extreme, would pass through all the data points. A larger value of the coefficient gives a smoother fit. Computer programs usually use rules of thumb to choose the best value for the smoothing parameter, but it is wise to try alternatives to see what effect varying the smoothing coefficient has on the curve. Logistic regression: fitting a binary response variable Logistic regression is a special type of nonlinear regression developed for a binary response variable—that is, when the Y-variable measured on independent units is either zero or one. A common use for logistic regression is to fit a dose-response curve, where mortality (or survival) of individuals (the “response”) is plotted against the concentration of a drug, toxin, or other chemical (the “dose”).8 Here we provide a quick description with a minimum of calculations. Our example data set is shown in Figure 17.9-1, from an experiment in which 160 guppies were exposed to cold temperatures (5°C) for different durations (3, 8, 12, or 18 minutes). Each treatment, or dose, was assigned to 40 fish. The study was by Pitkow (1960), who carried out the experiment to identify the physiological mechanism causing fish death at cold temperatures. Mortality is the binary response variable (Y) measured on each independent fish (1 = died, 0 = survived), and duration is the explanatory variable. We used logistic regression to fit the curve predicting the probability of fish mortality from duration of exposure (Figure 17.9-1). Because the data are binary, the curve describes the proportion of fish that die (the mean of Y) at levels of exposure X. Table 17.9-1 gives a frequency table of the data. FIGURE 17.9-1 Mortality of guppies in relation to duration of exposure to a temperature of 5°C (data from Pitkow 1960). Treatments were 3, 8, 12, or 18 minutes of exposure, with 40 fish in each of the four treatments. Each point (red circle) indicates a different individual (points were offset using a random perturbation to reduce overlap). Y = 1 if the individual died, whereas Y = 0 if the individual survived. Black dots indicate the proportion of deaths (±1 SE) in each treatment. The curve is the logistic regression predicting the probability of death. TABLE 17.9-1 Number of fish (out of 40) in two mortality groups at each of four cold-temperature treatments. Duration of exposure (min) 3 8 12 18 Died (Y = 1) Survived (Y = 0) 11 29 24 16 29 11 38 2 Linear regression of Y on X is unsuitable for these data because the binary response variable violates three of its assumptions. The relationship between Y and X is not linear, because the predicted values Y^ cannot fall outside the interval between 0 and 1. The residuals Y−Y^ are not normally distributed; they are binary—each point is either 0−Y^ or 1−Y^. The variance of the residuals is not constant: variance is expected to be highest when the predicted Y is near 0.5, and lowest when the prediction of Y is close to zero or one. (This is because when the prediction is zero or one, most of the data are zeros or ones (respectively) and the variance among individuals is therefore small.) These problems are not fixed with a simple transformation of the data. All three problems are solved by logistic regression. The method fits a curve constrained to lie between 0 and 1. Instead of the normal distribution, it assumes that outcomes at every X have a binomial distribution (Section 7.1) being either one or zero. The probability of an event (in this case, dying) is given by the corresponding predicted value on the regression curve. Finally, to correct for differences in the variance of residuals at different values of X, logistic regression weights each residual by its estimated variance obtained from the binomial distribution. Logistic regression fits the following equation to binary data: log-odds(Y)=α+βX. The log-odds refers to the natural log of the odds of Y (Section 9.2).9 The right side of the equation (α+βX) is the formula for a straight line, with α the intercept and β the slope. In other words, an ordinary line is used to fit the log-odds of the proportion of individuals dying. The curve shown in Figure 17.9-1 is based on estimates of these two parameters: a for intercept and b for slope, log-odds(Y)=a+bX. Methods to calculate this regression curve from data are available in most computer statistical packages. When we analyzed the guppy data with such a statistics package, we obtained the output in Table 17.9-2, showing estimates of regression coefficients, and Table 17.9-3, showing the results of a test of the null hypothesis of zero slope (H0: (β = 0). We will explain the contents of these two tables in turn. TABLE 17.9-2 Logistic regression output. The values shown are the estimates for a and b of the intercept (α) and the slope (β) of the logistic regression curve for the cold-fish data. SE refers to the standard error of estimates. Estimate SE Intercept Slope Number of iterations: −1.66 0.24 0.41 0.04 4 TABLE 17.9-3 Analysis of deviance table, containing results of the loglikelihood ratio test for the cold-fish data. Model df Deviance Residual df Residual deviance P Null Duration 1 44.86 159 209.55 158 164.69 2.12 × 10-11 From Table 17.9-2 we see that the best estimate for the intercept is given by α = − 1.66 ± 0.41 SE, and the best estimate for the slope is b = 0.24 ± 0.04 SE. Computer programs might additionally provide Wald statistics (symbolized by z) and corresponding P-values for approximate tests of the null hypotheses H0: α = 0 and H0: (β = 0. However, the Wald method is inaccurate, and we do not present these results. The loglikelihood ratio test (Table 17.9-3) should be used instead. Remember: the estimates a and b are not intercept and slope of a linear regression of Y on X. Instead, they describe the linear relationship between X and the predicted log odds of Y [which we’ll call log-odds(Ŷ)]. To obtain the predicted values (Ŷ), we need to convert the logodds to ordinary proportions: Yˆ=elog-odds(Yˆ)1+elog-odds(Yˆ) For example, to predict the proportion of fish dying for a cold-temperature duration of 10 minutes, we calculate log-odds(Yˆ)=a+bX=-1.66+0.24(10)=0.74,Yˆ=e0.741+e0.74=0.68. In other words, a duration of 10 minutes at 5°C is predicted to cause 68% mortality. Another useful quantity for the regression curve is the LD50 (lethal dose 50), the estimated dose predicting 50% mortality: LD50=-ab. For the fish data, LD50=--1.660.24=6.92 minutes. Computer output for logistic regression will typically also include the number of iterations used in the calculations (Table 17.9-2). Logistic regression uses maximum likelihood (Chapter 20) to fit the curve to the data, and no formula exists to calculate the estimates. Instead, the computer uses a series of iterations to search for the best-fit curve. The search ceases when there are no further improvements in the fit from successive iterations. Table 17.9-3 is an analysis of deviance table, with the results of a log-likelihood ratio test (see Section 20.5) of the null hypothesis that there is no relationship between mortality and duration (H0: β = 0 vs. HA: β ≠ 0). The analysis of deviance table is analogous to the ANOVA table in ordinary linear regression. The method fits two models to the data, one in which the variable duration is absent, and one in which it is present. Null model:log-odds(Y)=αRegression model:log-odds(Y)=α+βX. The null model is a restatement of the null hypothesis that β = 0, whereas the regression model is a restatement of the alternative hypothesis that β ≠ 0. Analysis of deviance compares the fit of the two models to the data. If the improvement in fit of the regression model over the null model is too large to be explained by chance, the null hypothesis is rejected. The key quantity in the table is the improvement in fit when β ≠ 0, which is here labeled simply “Deviance.” This quantity is the difference in fit between the two models, where “fit” is the discrepancy between the observed values of Y and the values predicted, Ŷ, from each model. For the fish data, the improvement in fit is 209.55 — 164.69 = 44.86. Residual deviance is analogous to residual mean square in ordinary linear regression. Under the null hypothesis that β = 0, deviance has a χ2 distribution with 1 df. The critical value for χ0.05,12=3.84. Since 44.86 > 3.84, P < 0.05, and we reject the null hypothesis. An approximation to the exact P-value is given in the table. 7.10 Summary ■ Regression is a method used to predict the value of a numerical response variable Y from the value of a numerical explanatory variable X. ■ Linear regression fits a straight line through a scatter of points. The equation for the regression line is Y = a + bX, where b is the slope of the line and a is the intercept. ■ The least squares regression line is found by minimizing the sum of the squared differences between the observed Y-values and the values predicted by the line. ■ The residuals are the differences between the observed values of Y and the values predicted by the least-squares regression line, Y^. The variance of the residuals, MSresidual, measures the spread of points above and below the regression line. ■ Linear regression calculated on a sample of points estimates the straight-line relationship between the two variables in the population. The formula for the population regression line is Y = α + βX, where β is the slope of the line and a is the intercept. ■ If the assumptions of regression are met, then the sampling distribution of b is normal with mean β and standard deviation estimated by the standard error of the estimate of the slope, SEb. ■ The confidence interval for the slope β is based on the t-distribution. ■ There are two types of prediction in regression: the mean Y at a given X, and a single Y observation at X. Both predictions generate the same value Y^, but they have very different precision. Precision is lower when predicting an individual Y because it includes the variability between individuals having the same X-value. ■ Confidence intervals for predicted mean Y-values at each X are represented by confidence bands. Analogous intervals for the predicted Y-values of a single individual are called prediction intervals. ■ Extrapolation is the prediction of Y at values of X beyond the range of X-values in the sample. Extrapolation is problematic, though, because there is no way to ensure that the relationship between X and Y continues to be linear beyond the data. ■ If the null hypothesis is correct that the slope b of a population regression line is zero, then the test statistic t = b/SEb has a t-distribution with n − 2 degrees of freedom. ■ An ANOVA table and F-test can also be used to test the null hypothesis that the population slope β = 0. ■ R2 is a measure of the fit of a regression line to the data. It measures the “fraction of the variation in Y that is explained by X.” ■ Regression toward the mean results when two imperfectly correlated variables are compared by regression. Individuals that are far from the mean for one of the measurements will on average lie closer to the mean for the other measurement. ■ Methods for regression assume that the relationship between X and Y falls along a straight line, that the Y-measurements at each value of X are a random sample from a population of Y-values, that the distribution of Y-values at each value of X is normal, and that the ■ ■ ■ ■ ■ ■ ■ variance of Y-values is the same at all values of X. The scatter plot and the residual plot are graphical devices for detecting departures from the assumptions of linear regression. Transformations of X and/or Y can be used to render a nonlinear relationship linear and to correct violations of the assumption of equal variance of residuals at every X. The log-transformation is the most versatile transformation. It is useful when Y is related to X by a power function or by an exponential function. If transformations do not work, nonlinear regression is an option. Nonlinear regression curves should be kept as simple as the data warrant. Overly complex curves may be biologically unjustified and have low predictive power. Smoothing methods make it possible to fit nonlinear curves to data without specifying a formula. Logistic regression allows us to use a numerical variable to predict the probability that an individual has a particular value of a binary response variable. 7.11 Quick Formula Summary Shortcuts Sum of Products: ∑i(Xi−X¯)(Yi−Y¯)=∑i(XiYi)−(∑iXi)(∑iYi)n. Sum of squares for X: ∑i(Xi−X¯)2=∑i(Xi2)−(∑iXi)2n. Regression slope What is it for? Estimating the slope of the linear equation Y = α + βX between an explanatory variable X and a response variable Y. What does it assume? The relationship between X and Y is linear; each Y-measurement at a given X is a random sample from a population of Y-measurements; the distribution of Y-values at each value of X is normal; and the variance of Y-values is the same at all values of X. Parameter: β Estimate: b Formula: b=∑i(Xi−X¯)(Yi−Y¯)∑i(Xi−X¯)2 . Standard error: SEb=MSresidual∑i(Xi−X¯)2, where MSresidual is the mean squared residual (the estimated variance of the residuals); MSresidual=∑i(Yi-Yˆi)2n-2. A quicker formula for MSresidual, not requiring you to calculate the Y^ first, is MSresidual=∑i(Y-Y¯)2-b∑i(Xi-X¯)(Y-Y¯)n-2. Regression intercept What is it for? Estimating the intercept of the linear equation Y = α + βX. What does it assume? Same as the assumptions for the regression slope. Parameter: α Estimate: α Formula: a=Y¯−bX¯. Standard error: Set X = 0 in the formula for the standard error of the predicted mean Y at a given X [see “Confidence interval for the predicted mean Y at a given X (confidence bands)” on p. 574]. This is valid only if the value X = 0 falls within the range of X-values in the data. Confidence interval for the regression slope What is it for? An interval estimate of the population slope. What does it assume? Same as the assumptions for the regression slope. Statistic: b Parameter: b Formula: b−tα(2),dfSEb<β<b+tα(2),dfSEb, where SEb is the standard error of the slope (see the formula given previously with the regression slope), and tα(2),n−2 is the two-tailed critical value of the t-distribution having df = n − 2. Confidence interval for the predicted mean Y at a given X (confidence bands) What is it for? An interval estimate of the predicted mean Y of all individuals in the population having the given value of X. What does it assume? Same as the assumptions for the regression slope. Statistic: Y^ Formula: Yˆ−tα(2),n−2SE[Yˆ]<predictedY<Yˆ+tα(2),n−2SE[Yˆ], where SE[Yˆ]=MSresidual(1n+(Xi-X¯)2∑i(Xi-X¯)2) standard error of the is the prediction, MSresidual is the mean square residual (see the formula given previously under “Regression slope”), and tα(2)n−2 is the two-tailed critical value of the t-distribution having df = n − 2. Confidence interval for the predicted individual Y at a given X (prediction intervals) What is it for? An interval estimate of the predicted Y for a single individual having the given value of X. What does it assume? Same as the assumptions for the regression slope. Statistic: Y^ Formula: Yˆ−tα(2),n−2SE1[Yˆ]<predictedY<Yˆ+tα(2),n−2SE[Yˆ], where SE1[Yˆ]=MSresidual(1+1n+(Xi-X¯)2∑i(Xi-X¯)2) the standard error of is the prediction, MSresidual is the mean square residual (see the formula given previously under “Regression slope”), and tα(2)n−2 is the two-tailed critical value of the t-distribution having df = n − 2. The t-test of a regression slope What is it for? To test the null hypothesis that the population parameter β equals a null hypothesized value β0. What does it assume? Same as the assumptions for the regression slope. Test statistic: t Distribution under H0: t-distributed with n − 2 degrees of freedom. Formula: t=b-β0SEb “Regression slope”). is the standard error of b (see the formula given previously under The ANOVA method for testing zero slope What is it for? To test the null hypothesis that the slope β equals zero, and to partition sources of variation. What does it assume? Same as the assumptions for the regression slope. Test statistic: F Distribution under H0: F distribution. F is compared with Fα(1),1,n−2. Source of variation Regression Sum of squares SSregression=∑i(Yˆi-Yˆ)2 Residual Total R squared (R2) SSresidual=∑i(Yi-Yˆi)2 SStotal=∑i(Yi-Y¯)2 df Mean squares 1 SSregressiondfregression n − SSresidualdfresidual 2 n − 1 What is it for? Measuring the fraction of the variation in Y that is “explained” by X. Formula: R2=SSregressionSStotal, where SSregression is the sum of squares for regression and SStotal is the total sum of squares. PRACTICE PROBLEMS 1. Calculation problem: Regression lines. Men’s faces have higher width-to-height ratios than women’s, on average. This turns out to reflect a difference in testosterone expression during puberty. Testosterone is also known to predict aggressive behavior. Does face shape predict aggression? To test this, Carré and McCormick (2008) compared the face width-toheight ratio of 21 university hockey players with the average number of penalty minutes awarded per game for aggressive infractions like fighting or cross-checking. Their data are below along with some partial calculations. We will calculate the equation for the line that best predicts penalty minutes from face width-to-height ratio. Face width: height ratio (X) Penalty minutes per game (Y) 1.59 1.67 1.65 1.72 1.79 1.77 1.74 1.74 1.77 1.78 1.76 1.81 1.83 1.83 1.84 1.87 1.92 1.95 1.98 1.99 2.07 0.44 1.43 1.57 0.14 0.27 0.35 0.85 1.13 1.47 1.51 1.99 1.06 1.20 1.23 0.80 2.53 1.23 1.10 1.61 1.95 2.95 a. Plot the data in a scatter plot. b. Examine the graph. Based on this graph, do the assumptions of linear regression appear to be met? c. Calculate the means of the two variables. (While you’re doing so, record the sum of all Xvalues and the sum of all Y-values.) d. Calculate the sum of X2, the sum of Y2, and the sum of the product XY. e. Calculate the sum of products (∑i(Xi-X¯)(Y-Y¯)) and the sum of squares (∑i(Xi-X¯)2) of the explanatory variable, face ratio. f. How steeply does the number of penalty minutes increase per unit increase in face ratio? From the sum of products and sum of squares for face ratio, calculate the estimate b of the slope. Double-check that the sign of the slope matches your impression from the scatter plot. g. Calculate the estimate of the intercept, a, from the variable means and b. h. Write the result in the form of an equation for the line. Add your line to the graph in (a). 2. Calculation problem: Standard error and confidence intervals of the slope. How uncertain is our estimate of slope? Using the face ratio and hockey aggressive penalty data from Practice Problem 1, calculate the standard error and confidence interval of the slope of the linear regression. a. Calculate the total sum of squares for the response variable, penalty minutes. b. Calculate the residual mean square MSresidual, using the total sum of square for Y, the sum of products, and the slope b. c. With the sum of squares for X and MSresidual, calculate the standard error of b. d. How many degrees of freedom does this analysis of the slope have? e. Find the two-tailed critical t-value for a 95% confidence interval (α = 0.05) for the appropriate df. f. Calculate the confidence interval of the population slope, β. 3. Calculation problem: Testing the null hypothesis that the slope equals zero. Can the relationship be explained by chance? Using the face ratio and hockey penalty data from Practice Problem 1, test the null hypothesis that the slope of the regression line is zero. a. State the null and alternate hypotheses. b. What is β0 for this null hypothesis? c. Calculate the test statistic t from b, β0, and the standard error of b. d. Find the critical value of t appropriate for the degrees of freedom, at α = 0.05. e. Is the absolute value of the t for this test greater than the critical value? f. Using a computer or the statistical tables, be as precise as possible about the P-value for this test. Draw conclusions from the test. g. What fraction of the variation in average penalty minutes per game is accounted for by face ratio? Calculate the value of R2. 4. Golomb et al. (2012) looked at whether higher chocolate consumption predicts higher body mass in humans. They fitted the data using a linear regression having chocolate consumption (number of times consumed per week) as the explanatory variable and the body mass index (BMI) as the response variable. BMI measures body mass relative to height, with high BMI typically meaning an overweight person. The slope of the regression was − 0.142, with a standard error of 0.053 and a P-value of 0.008. Is this evidence that people who eat more chocolate have higher BMI? Why or why not? 5. What is the formula for each of the following four regression lines? 6. Some species seem to thrive in captivity, whereas others are prone to health and behavior difficulties when caged. Maternal care problems in some captive species, for example, lead to high infant mortality. Can these differences be predicted? The following data are measurements of the infant mortality (percentage of births) of 20 carnivore species in captivity along with the log (base-10) of the minimal home-range sizes (in km2) of the same species in the wild (Clubb and Mason 2003). Log10 home-range size Captive infant mortality (%) −1.3 −0.5 −0.3 0.2 0.1 4 22 0 0 11 0.5 1.0 0.3 0.4 0.5 0.1 0.2 0.4 13 17 25 24 27 29 33 33 1.3 1.2 1.4 1.6 1.6 1.8 3.1 42 33 20 19 25 25 65 a. Draw a scatter plot of these data, with the log of home-range size as the explanatory variable. Describe the relationship between the two variables in words. b. Estimate the slope and intercept of the least-squares regression line, with the log of homerange size as the explanatory variable. Add this line to your plot. c. Does home-range size in the wild predict the mortality of captive carnivores? Carry out a formal test. Assume that the species data are independent.10 d. Outliers should be investigated because they might have a substantial effect on the estimates of the slope and intercept. Recalculate the slope and intercept of the regression line from part (c) after excluding the outlier at large home-range size (which corresponds to the polar bear). Add the new line to your plot. By how much did it change the slope? 7. The following two graphs display data gathered to test whether the exercise performance of women at high elevations depends on the stage of their menstrual cycle (Brutsaert et al. 2002). In the upper panel, the explanatory variable is the progesterone level and the response variable is the ventilation rate at submaximal exercise levels. The line is the least-squares regression. The lower panel is the corresponding residual plot. a. What is a “least-squares” regression line? b. What are residuals? c. Assume that the random sampling assumption is met. By viewing these plots, assess whether each of the three other main assumptions of linear regression is likely to be met in this study. 8. The slopes of the regression lines on the following graph show that the winning Olympic 100-m sprint times for men and women have been getting shorter and shorter over the years, with a steeper trend in women than in men (the graph is modified from Tatem et al. 2004). If trends continue, women are predicted to have a shorter winning time than men by the year 2156. What cautions should be applied to this conclusion? Explain. 9. In an analysis of the performance of Major League Baseball players, Schaal and Smith (2000) found that the batting scores of the top 10 players in the 1997 baseball season dropped on average in 1998. What is the best interpretation of this finding? a. Players who did well in 1997 reduced their effort the following year, realizing that they didn’t need to work as hard to get an above-average result. b. Players performing above average in 1997 were older and more worn out by 1998. c. Regression toward the mean. d. Possibly (a) and (b), but (c) is likely and cannot be ruled out. 10. Hybrid offspring of parents of different species are often sterile. How different must the parent species be, genetically, to produce this effect? The accompanying table (Moyle et al. 2004) lists the proportion of pollen grains that are sterile in hybrid offspring of crosses between pairs of species of Silene (bladder campions—see the photo on the first page of this chapter). Also listed is the genetic difference between the pair of species, as measured by DNA sequence divergence. Assume that different species pairs are independent. Silene species pairGenetic distanceProportion of pollen that is sterile 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0.00 0.00 0.00 0.00 0.00 0.03 0.02 0.03 0.04 0.04 0.05 0.11 0.12 0.12 0.13 0.13 0.13 0.14 0.13 0.13 0.15 0.16 0.18 0.02 0.06 0.14 0.24 0.30 0.62 0.28 0.23 0.15 0.45 0.84 0.65 0.77 1.00 1.00 0.93 0.91 0.93 0.96 1.00 1.00 0.97 1.00 a. We would like to predict the proportion of hybrid pollen that is sterile (Y) from the genetic distance between the species (X). Since the response variable is a proportion, what transformation would be your first choice to help meet the assumptions of linear regression? b. Transform the proportions and then produce a scatter plot of the data. Estimate and draw the regression line. c. Calculate the 95% confidence interval for the slope of the line. 11. Rattlesnakes often eat large meals that require significant increases in metabolism for efficient digestion. Snakes are known to adjust their thermoregulatory behavior after feeding, seeking out warmer spots to increase their metabolic rates. Can snakes increase body temperature, though, even without this behavior? Tattersall et al. (2004) measured the change in body temperature of snakes after meals of various sizes, and we have used their data in an inappropriate way in the following graph, fitting a nonlinear mathematical function indicated by the curve. a. Why is the nonlinear fit shown inappropriate? b. What alternative procedure would you recommend to achieve the goal of predicting snake body-temperature change from meal size? 12. Male lizards in the species Crotaphytus collaris use their jaws as weapons during territorial interactions. Lappin and Husak (2005) tested whether weapon performance (bite force) predicted territory size in this species. Their measurements for both variables are listed in the following table for 11 males. Bite force (N) Territory area (m2) 28.2 33.9 29.5 39.8 41.7 44.7 46.8 47.9 36.3 35.5 33.9 437 589 871 977 1288 2138 2455 3548 2692 2042 3020 a. How rapidly does territory size increase with bite force? Estimate the slope of the regression line. Provide a standard error for your estimate. b. How uncertain is our estimate of slope? Provide a 99% confidence interval for //. c. Provide an interpretation for the 99% confidence interval in part (b). What does it measure? d. Bite force is difficult to measure accurately, and so the values shown probably include some measurement error. Is the slope of the true regression line most likely to be underestimated, overestimated, or unaffected as a result? e. Territory area is difficult to measure accurately, so the values shown probably include some measurement error. Is the slope of the true regression line most likely to be underestimated, overestimated, or unaffected as a result? 13. An ANOVA carried out to test the null hypothesis of zero slope for the regression of lizard territory area on bite force (see Practice Problem 12) yielded the following results. Source of variation Regression Residual Sum of squares df Mean squares F-ratio 3758539 7303662 1 9 Total a. Complete the ANOVA table. b. Using the F-statistic, test the null hypothesis of zero slope at the significance level α = 0.05. c. What are your assumptions in part (b)? d. What does the MSresidual measure? e. Calculate the R2 statistic. What does it measure? 14. James et al. (1997) demonstrated that the chemical hypoxanthine in the vitreous humour (the colorless jelly filling the eye) shows a postmortem linear increase in concentration with time since death. This suggests that hypox-anthine concentration might be useful in predicting time of death when it is unknown. The following graph shows measurements collected by the researchers on 48 subjects whose time of death was known. The regression line, the 95% confidence bands, and the 95% prediction interval are included on the graph. a. The data set depicted in the graph includes one conspicuous outlier on the far right. If you were advising the forensic scientists who gathered these data, how would you suggest they handle the outlier? b. What do the confidence bands measure? c. Are the inner dashed lines the confidence bands or the prediction interval? d. If the regression depicted in the graph was to be used to predict the time of death in a murder case, which bands would provide the most relevant measure of uncertainty, the confidence bands or the prediction interval? Why? 15. Social spiders live together in kin groups that build communal webs and cooperate in gathering prey. The following web measurements were gathered on 17 colonies of the social spider Cyrtophora citricola in Gabon (Rypstra 1979). Colony Height of web aboveground (cm) Number of spiders 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 90 150 270 320 180 380 200 120 240 120 210 250 140 300 290 180 280 17 32 96 195 372 135 83 36 85 20 82 95 59 89 152 62 64 a. Use these data to draw a scatter plot of the relationship between the colony height aboveground (explanatory variable) and the number of spiders in the colony (response variable). b. Examine the scatter plot and determine any impediments that might make it difficult to use linear regression to predict number of spiders in a colony from colony height. c. In view of what you discerned in part (b), what method would you recommend to test whether colony height predicts the number of spiders? 16. Identify the assumption(s) of linear regression that is (are) violated in each of the following residual plots. 17. The forests of the northern United States and Canada have no native terrestrial earthworms, but many exotic species (including those used as bait when fishing) have been introduced. These immigrant species are dramatically changing the soil. The following data were gathered to predict the nitrogen content of mineral soils of 39 hardwood forest plots in Michigan’s Upper Peninsula from the number of earthworm species found in those plots (Gundale et al. 2005). Earthworm species 0 Nitrogen content (%) 2 3 0.22, 0.19, 0.16, 0.08, 0.05 0.33, 0.30, 0.26, 0.24, 0.20, 0.18, 0.14, 0.13, 0.11, 0.09, 0.08 0.27, 0.24, 0.23, 0.18, 0.16, 0.13 0.32, 0.32, 0.29, 0.24, 0.22, 0.12, 0.40 4 0.34, 0.33, 0.23, 0.21, 0.18, 0.17, 0.15, 0.14 5 0.20, 0.54 1 a. Draw a scatter plot of these data, using the number of earthworm species as the explanatory variable. b. Using the following intermediate calculations, calculate the regression line to predict the total nitrogen content of the soil from the number of earthworm species present. Add the line to your plot. X¯=2.205Y¯=0.215∑i(Xi-X¯)2=86.359∑i(Yi-Y¯)2=0.366∑i(Xi-X¯)(Yi-Y¯)=2.453. c. d. e. f. What are the units of your estimate of slope, b? What is the predicted nitrogen content of soil having five earthworm species? Calculate a standard error of the slope. Produce a 95% confidence interval for the slope. 18. Is the scaling of respiratory metabolism to body size in plants similar to that found in animals, where an approximate 3/4-power relation seems to hold? The data below are measurements of aboveground mass (in g) and respiration rate (in nmol/s) in 10 individuals of Japanese cypress trees (Chamaecyparis obtusa). They were obtained from a larger data set amassed by Reich et al. (2006). Respiratory metabolism (Y) is expected to depend on body mass (X) by the power law, Y = αXβ, where β is the scaling exponent. Aboveground mass (g) 453 1283 695 1640 1207 2096 2804 3528 Respiration rate (nmol/s) 666 643 1512 2198 2535 4176 3196 3494 5940 10,000 7386 10,363 a. Use linear regression to estimate p for Japanese cypress. Include a standard error of your estimate. b. Plot your line and the data in a scatter plot. c. Use the 95% confidence interval to determine the range of most-plausible values for β based on these data. Does this range include the value 3/4? d. Carry out a formal test of the null hypothesis that β = 3/4. e. It is a challenge to estimate mass and respiration rate of a living tree in the field, and both measurements are likely subject to measurement error. How is measurement error in each of these two traits likely to affect the estimate of the exponent? ASSIGNMENT PROBLEMS 19. You might think that increasing the resources available would elevate the number of plant species that an area could support, but the evidence suggests otherwise. The data in the accompanying table are from the Park Grass Experiment at Rothamsted Experimental Station in the U.K., where grassland field plots have been fertilized annually for the past 150 years (collated by Harpole and Tilman 2007). The number of plant species recorded in 10 plots is given in response to the number of different nutrient types added in the fertilizer treatment (nutrient types include nitrogen, phosphorus, potassium, and so on). Plot Number of nutrients added Number of plant species 1 2 3 4 5 6 7 8 9 10 0 0 0 1 2 3 1 3 4 4 36 36 32 34 33 30 20 23 21 16 a. Draw a scatter plot of these data. Which variable should be the explanatory variable (X), and which should be the response variable (Y)? b. What is the rate of change in the number of plant species supported per nutrient type added? Provide a standard error for your estimate. c. Add the least-squares regression line to your scatter plot. What fraction of the variation in the number of plant species is “explained” by the number of nutrients added? d. Test the null hypothesis of no treatment effect on the number of plant species. 20. Heusner (1991) assembled the following data on the mass and basal metabolic rate of 17 species of primates, including the potto shown in the accompanying photo. Species Alouatta palliata Aotus trivirgatus Arctocebus calabarensis Callithrix jachus Cebuella pygmaea Cheirogaleus medius Euoticus elegantilus Mass (g) Basal metabolic rate (watts) 4670.0 1020.0 206.0 190.0 105.0 300.0 261.5 11.6 2.6 0.7 0.9 0.6 1.1 1.2 Galago crassicaudatus Galago demidovii Galago elegantulus Homo sapiens Lemur fulvus Nycticebus coucang Papio anubis Perodicticus potto Saguinus geoffroyi Saimiri sciureus 1039.0 61.0 261.5 70,000.0 2330.0 1300.0 9500.0 1011.0 2.9 0.4 1.2 82.8 4.2 1.7 16.0 2.1 225.0 800.0 1.3 4.4 Previous research has indicated that basal metabolic rate (R) of mammal species depends on body mass (M) in the following way: R = αMβ, where α and β are constants. a. Use linear regression to estimate β for primates. Call your estimate b. b. Plot your line and the data in a scatter plot. c. How precise is the estimate of β? Provide a standard error for b and a 95% confidence interval for β. Assume that the species data are independent. 21. Previous evidence and some theory predict that the exponent β describing the relationship between metabolic rate and mass should equal 3/4. Using the data from Assignment Problem 20, test whether the exponent differs from the expected value of 3/4. 22. The white forehead patch of the male collared flycatcher is important in mate attraction. Griffith and Sheldon (2001) found that the length of the patch varied from year to year. They measured the forehead patch on a sample of 30 males in two consecutive years, 1998 and 1999, on the Swedish island of Gotland. The scatter plot provided gives the pair of measurements for each male. The solid regression line predicts the 1999 measurement from the 1998 measurement. The dashed line is drawn through the means for 1998 and 1999, but it has a slope of one. The difference between the two lines indicates that males with the longest patches in 1998 had smaller patches in 1999, relative to the other birds. Similarly, the males with the smallest patches in 1998 had larger patches, on average, in 1999, relative to other birds. a. The following table summarizes the data. Use these numbers to calculate the regression slope. Patch length 1998 Patch length 1999 Mean Sum of squares Sum of products 7.62 7.40 45.43 47.03 36.26 b. Now let the patch length in 1998 be the response variable (Y). Use the patch length in 1999 to predict patch length in 1998. What is the slope of this new regression? c. What is the most likely reason that the slope is less than one in both regressions? 23. Seedlings of understory trees in mature tropical rainforests must survive and grow using intermittent flecks of sunlight. How does the length of exposure to these flecks of sunlight (fleck duration) affect growth? Leakey et al. (2005) experimentally irradiated seedlings of the Southeast Asian rainforest tree Shorea leprosula with flecks of light of varying duration while maintaining the same total irradiance to all the seedlings. Their data for 21 seedlings are listed in the following table. Tree Mean fleck duration (min) Relative growth rate (mm/mm/week) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 3.4 3.2 3.0 2.7 2.8 3.2 2.2 2.2 2.4 4.4 5.1 6.3 7.3 6.0 5.9 0.013 0.008 0.007 0.005 0.003 0.003 0.005 0.003 0.000 0.009 0.010 0.009 0.009 0.016 0.025 16 17 7.1 8.8 0.021 0.024 18 19 20 21 7.4 7.5 7.5 7.9 0.019 0.016 0.014 0.014 X¯=5.062Y¯=0.0111∑i(Xi-X¯)2=100.210∑i(Yi-Y¯)2=0.001024∑i(Xi-X¯)(Yi-Y¯)=0.2535. a. What is the rate of change in relative growth rate per minute of fleck duration? Provide a standard error for your estimate. b. Using these data, test the hypothesis that fleck duration affects seedling growth rate. c. Calculate a 99% confidence interval for the slope of the population regression. d. What are your assumptions in parts (a)−(c)? e. What is the main procedure you would employ to evaluate those assumptions? 24. How do we estimate a regression relationship when each subject is measured multiple times over a series of X-values? The easiest approach is to use a summary slope for each individual and then calculate the average slope. Green et al. (2001) dealt with exactly this type of problem in their study of macaroni penguins exercised on treadmills. Each penguin was exercised at a range of speeds, and its oxygen consumption was measured in relation to its heart rate (a proxy for metabolic rate). The graph provided shows the relationship for just two individual penguins. The following table lists the estimated regression slopes for each of 24 penguins in three categories. Group Breeding males Breeding females Regression slope 0.31, 0.34, 0.30, 0.38, 0.35, 0.33, 0.32, 0.32, 0.37 0.30, 0.32, 0.23, 0.38, 0.31, 0.26, 0.42, 0.28, 0.35 Molting females 0.25, 0.41, 0.32, 0.34, 0.27, 0.23 a. Calculate the mean, standard deviation, and sample size of the slope for penguins in each of the three groups. Display your results in a table. b. Test whether the means of the slopes are equal between the three groups. 25. Many species of beetle produce large horns that are used as weapons or shields. The resources required to build these horns, though, might be diverted from other useful structures. To test this, Emlen (2001) measured the sizes of wings and horns in 19 females of the beetle species Onthophagus Sagittarius. Both traits were scaled for body-size differences and hence are referred to as relative horn and wing sizes. Emlen’s data are shown in the following scatter plot along with the least squares regression line (Y = −0.13 − 132.6X). We used this regression line to predict the horn lengths at each of the 19 observed horn sizes. These are given in the following table along with the raw data. Relative horn size (mm2) 0.074 0.079 0.019 0.017 0.085 0.081 0.011 0.023 Relative wing mass (mg) Predicted relative wing mass (mg) −42.8 −21.7 − 18.8 −16.0 −12.8 11.6 7.6 1.6 −9.9 −10.6 −2.6 −2.4 −11.4 −10.9 −1.6 −3.2 0.005 0.007 3.7 1.1 −0.8 −1.1 0.004 −0.002 −0.065 −0.065 −0.014 −0.8 −2.9 12.1 20.1 21.2 −0.7 0.1 8.5 8.5 1.7 −0.014 −0.132 −0.143 −0.177 22.2 20.1 12.5 7.0 1.7 17.4 18.8 23.3 a. Use these results to calculate the residuals. b. Use your results from part (a) to produce a residual plot. c. Use the graph provided and your residual plot to evaluate the main assumptions of linear regression. d. In light of your conclusions in part (c), what steps should be taken? 26. Can the songs of extinct species be predicted? Gua et al. (2012) used measurements of living species of katydid to predict the call frequency, or “pitch,” of the extinct Archaboilus musicus based on a 165-million-year-old fossil. Male katydids call by stridulating—rubbing fore-wings together so that a scraper on one wing rubs against a “file” on the other. Call frequency is predicted by file length (see accompanying graph; the data are available at whitlockschluter.zoology.ubc.ca). File length of a single well-preserved fossil of the extinct Archaboilus musicus was 9.34 mm. What was its call frequency? Summary for log-transformed data is as follows: n=58,∑iXi=33.241,∑iYi=183.936,∑iXi2=42.615,∑iYi2=609.994,∑iXiYi=86.720. a. Calculate the regression line from the summary numbers provided. Assume for the purpose of this exercise that the data points are independent.11 b. On the basis of this regression, what is the predicted log-transformed call frequency of Archaboilus musicus? The log file length for this species is 2.23. c. What is the most-plausible range of values for the stridulation frequency of the 165million-year-old katydid? Give the appropriate 95% confidence interval or prediction interval to determine this. d. Calls with a frequency above about 20 kHz [or ln(frequency) of about 3.0] are ultrasonic and inaudible by most humans. How confident can we be that the calls of Archaboilus musicus were audible to humans? Answer based on your confidence or prediction interval in part (c). 27. The parasitic bacterium Pasteuria ramosa castrates and later kills its host, the crustacean Daphnia magna. The length of time between infection and host death affects the number of spores eventually produced and released by the parasite, as the following scatter plot reveals. The x-axis measures age at death for 32 infected host individuals, and the response variable is the square-root-transformed number of spores produced by the infecting parasite (Jensen et al. 2006). a. Describe the shape of the relationship between the number of spores and host longevity. b. What equation would be best to try first if you wanted to carry out a nonlinear regression of Y on X? 28. Human brains have a large frontal cortex with excessive metabolic demands compared with the brains of other primates. However, the human brain is also three or more times the size of the brains of other primates. Is it possible that the metabolic demands of the human frontal cortex are just an expected consequence of greater brain size? Sherwood et al. (2006) investigated this question in a number of ways. Their data in the accompanying table and scatter plot shows the relationship between the glia-neuron ratio (an indirect measure of the metabolic requirements of brain neurons) and the log-transformed brain mass in nonhuman primates. A linear regression is drawn through these data. Species Homo sapiens Pan troglodytes Gorilla gorilla Pongo pygmaeus Brain mass (g) ln (brain mass)Glia-neuron ratio 1373.3 336.2 509.2 342.7 7.22 5.82 6.23 5.84 1.65 1.20 1.21 0.98 Hylobates muelleri Papio anubis Mandrillus sphinx Macaca maura Erythrocebus patas Cercopithecus kandti Colobus angolensis Trachypithecus francoisi 101.8 155.8 159.2 92.6 102.3 71.6 74.4 91.2 4.62 5.05 5.07 4.63 4.53 4.27 4.31 4.51 1.22 0.97 1.02 1.09 0.84 1.15 1.20 1.14 Alouatta caraya Saimire boliviensis Aotus trivirgatus Saguinus oedipus Leontopithecus rosalia Pithecia pithecia 55.8 24.1 13.2 10.0 12.2 30.0 4.02 3.18 2.58 2.30 2.50 3.40 1.12 0.51 0.63 0.46 0.60 0.64 a. Determine the equation of the regression line for nonhuman primates. b. Using the nonhuman primate relationship, what is the predicted glia-neuron ratio for humans, given their brain mass? c. Determine the most-plausible range of values for the prediction. Which confidence interval is relevant for your prediction of human glia-neuron ratio in (b): the confidence interval for the predicted mean glia-neuron ratio at the given brain mass, or the interval for the prediction of a single new observation? d. Carry out the calculation of the 95% confidence interval chosen in part (c). (See the Quick Formula Summary for the method.) Assume for the purpose of this exercise that the species data are independent. e. On the basis of your result in part (d), does the human brain have an excessive glia-neu-ron ratio for its mass compared with other primates? Explain. f. Considering the position of human data point relative to those data used to generate the regression line (see accompanying figure), what additional caution is warranted? Why? 29. Golenda et al. (1999) carried out a human clinical trial to investigate the effectiveness of a formulation of DEET (N,N-diethyl-m-toluamide) in preventing mosquito bites. The study applied DEET to the underside of the left forearm of volunteers. Cages containing 15 fresh mosquitoes were then placed over the skin for five minutes, and the number of bites was recorded. This was repeated four times at intervals of three hours. The scatter plot provided displays the total number of bites (square-root transformed) received by 52 women in the study in relation to the dose of DEET they received. a. What are the uses of the square-root transformation in linear regression? b. What feature of this study justifies our calling it an experimental study rather than just an observational study? c. Complete the ANOVA table for these data. Source of variation Sum of squares df Mean squares F-ratio Regression Residual 9.97315 Total 32.0569 d. Use the F-statistic to test the null hypothesis of zero slope. e. Calculate the R2 statistic. What does it measure? 30. Calculating the year of birth of cadavers is a tricky enterprise. One method proposed is based on the radioactivity of the enamel of the body’s teeth. The proportion of the radioisotope 14C in the atmosphere increased dramatically during the era of aboveground nuclear bomb testing between 1955 and 1963. Given that the enamel of a tooth is non-regenerating, measuring the 14C content of a tooth tells when the tooth developed, and therefore the year of birth of its owner. Predictions based on this method seem quite accurate (Spalding et al. 2005), as shown in the accompanying graph. The x-axis is A14C, which measures the amount of 14C relative to a standard (as a percentage). There are three sets of lines on this graph. The solid line represents the least-squares regression line, predicting the actual year of birth from the estimate based on amount of 14C. One pair of dashed lines shows the 95% confidence bands and the other shows the 95% prediction interval. a. What is the approximate slope of the regression line? b. Which pair of lines shows the confidence bands? What do these confidence bands tell us? c. Which pair of lines shows the prediction interval? What does this prediction interval tell us? 31. A lot of attention has been paid recently to portion size in restaurants, and how it may affect obesity in North Americans. Portions have grown greatly over the last few decades. But is this phenomenon new? Wansink and Wansink (2010) looked at representations of the Last Supper in European paintings painted between about 1000 AD and 1700 AD. They scanned the images and measured the size of the food portions portrayed (relative to the sizes of heads in the painting). (For example, the painting reproduced here was painted by Ugolino di Nerio in 1234 AD.) They reported the year of the painting and the portion size as follows: Portion size Year 3.08 2.70 2.14 2.91 3.69 4.41 3.51 2.44 3.21 2.78 3.39 3.21 3.17 2.78 2.57 3.30 999 1004 1050 1098 1314 1314 1350 1309 1398 1400 1467 1458 1486 1494 1479 1527 3.81 3.99 4.24 4.80 5.40 5.27 5.44 5.44 1525 1520 1542 1515 1522 1568 1554 1544 5.70 3.04 3.30 3.47 3.56 5.87 1.93 1.76 1.84 1544 1561 1573 1618 1626 1707 1153 1434 1426 a. Calculate a regression line that best describes the relationship between year of painting and the portion size. What is the trend? How rapidly has portion size changed in paintings? b. What is the most-plausible range of values for the slope of this relationship? Calculate a 95% confidence interval. c. Test for a change in relative portion size painted in these works of art with the year in which they were painted.12 d. Draw a residual plot of these data and examine it carefully. Can you see any cause for concern about using a linear regression? Suggest an approach that could be tried to address the problem. 32. Scarlet king snakes (left photo) are relatively harmless snakes from the southeastern United States. Most individuals have a conspicuous color pattern very similar to the extremely venomous coral snake (right photo). The king snake mimics are thought to gain a survival advantage when coral snakes are present, because predators have learned to avoid coral snakes. However, king snakes also live well outside the range of coral snakes, where the conspicuous colors of these mimics should make them more vulnerable than non-mimic king snakes, because the predators have not learned to avoid coral snakes. To test this, Harper and Pfennig (2008) compared predation rates on mimic and non-mimic king snake color patterns at locations with varying distance from the boundary of the range of coral snakes. The results are given in the table. The first variable is the distance in km between each study location and the boundary of the area where coral snakes are present. Negative numbers mean locations inside the range of coral snakes, and positive numbers mean locations outside the range. At each location, plasticine dummies of king snakes were set out in the habitat, with half the dummies painted to look like mimics and the other half like less-conspicuous nonmimics. The second variable in the table is the proportion of attacks by predators on the mimics at each location. Distance from boundary −97 −47 −33 −23 −72 −23 152 −15 97 113 105 80 138 148 152 49 48 Proportion of attacks on mimics 0 0.01 0 0 0.33 0.5 0.4 0.67 0.66 0.66 1 1 1 1 1 0.4 0 a. Give the equation for the line that best predicts proportion of attacks on mimics from the distance to the boundary. What is the trend? Assume that the relationship is linear over the range of the data (being a proportion, the true relationship cannot extend below zero or above one). b. Test the hypothesis that distance to the boundary predicts the proportion of attacks on mimics. 33. The warm temperatures of spring and summer arrive earlier now at high latitudes than they did in the past, as a result of human-caused climate change. One consequence is that many organisms start breeding earlier in the year than in previous years, often at suboptimal times. For example, historically the great tit Parus major (a well-studied European bird; see Example 2.3A) laid its eggs on dates that resulted in the chicks hatching around the time that caterpillars, a major source of food, became abundant. Currently, a shift in breeding date has led to a mismatch between hatching date and the dates when the caterpillars appear. Does this mismatch affect the growth rate of the bird population? To test this, Reed et al. (2013) used multiple years of study to examine the average timing mismatch (X), in days, with the growth rate of the bird population (Y), expressed as log of the ratio of the number of birds in one year over the number of birds in the previous year. A growth rate greater than zero indicates that the population is increasing, whereas a negative value indicates that the population is declining. Their data are summarized below (available at whitlockschluter.zoology.ubc.ca). ∑iXi=4.923885,∑iYi=41.12394,n=38,∑iXi2=2005.83430,∑iYi2=50.15243,∑iXiYi=−3.97524. a. Find the formula of the line that best predicts population growth rate from mismatch. What is the trend in growth rate with timing mismatch? b. What is the confidence interval for the slope of this line? c. Are the data consistent with a “substantial” effect on population growth rate (where “substantial” refers to a decline of 0.1 or more in growth per 10-day mismatch, which would be enough to cause extinction with expected climate change)? d. Is there a significant relationship between mismatch and growth rate of the population? Carry out a formal test. 34. Dads transmit many more new mutations than do mothers to their babies at conception. These mutations occur from copying errors during sperm production. There is increasing interest in the effect of father age on this process. As part of a larger study into the genetics of mental illness, Kong et al. (2012) used complete genome squencing of 21 father-child pairs to tally the total number of new mutations inherited from each father (in this particular sample, all the offspring were afflicted with schizophrenia). These counts are listed in the following table along with fathers’ ages at offspring conception. Age of father (years) Number of new mutations 16 18 20 19 22 24 24 24 25 39 41 39 49 50 54 55 61 57 28 29 30 32 37 36 34 30 52 54 57 61 67 70 77 83 29 33 26 33 67 68 54 65 a. Graph the relationship between number of new mutations (Y) and father’s age (X). Add the regression line to your plot. b. Based on these data, how rapidly does the number of new mutations increase with father’s age? Provide a standard error for your estimate. c. What is the predicted mean number of new mutations from fathers 36 years of age? How does this compare with the predicted number for fathers only 18 years old? d. Use the ANOVA approach to test the null hypothesis of no relationship between father’s age and number of new mutations. Include an ANOVA table with your results. e. What fraction of the variation among fathers in the number of new mutations is explained by father’s age? 36. The threat of bioterrorism makes it necessary to quantify the risk of exposure to infectious agents such as anthrax (Bacillus anthracis). Hans (2002) measured the mortality of rhesus monkeys in an exposure chamber to aerosolized anthrax spores of varying concentration. The data are available at whitlockschluter.zoology.ubc.ca and are tabulated below. Anthrax concentration (spores/I) Survived Died 29,300 32,100 45,300 57,300 64,800 67,000 100,000 125,000 166,000 7 4 3 2 3 5 0 1 0 1 4 5 6 5 3 8 7 8 a. Graph the relationship between mortality (Y) and anthrax concentration (X). b. We would like to use these data to predict the probability of death based on anthrax concentration. Which assumptions of linear regression are violated by these data? Explain. c. Which method could be used instead to predict mortality from anthrax concentration? d. An analysis of these data using the method in part (c) yielded the following results. Using these results, what is the predicted mortality from a concentration of 100,000 spores/l? Estimate Intercept Slope Model Null Duration SE −1.7445 0.00003643 df 1 0.6206 0.00001119 Deviance Residual df Residual deviance 19.02 71 70 92.982 73.962 e. Based on these results, add the regression curve to your plot in (a). f. Based on these data, what is the concentration predicting a 50% mortality (include units)? g. Using the results in part (d), test the null hypothesis of zero slope. 36. Do individual differences in stress physiology influence survival or reproduction in natural populations? Blas et al. (2007) investigated this question in a Spanish population of European white stork (Ciconia ciconia). The accompanying data display stress-induced corticosterone levels circulating in the blood of 34 storks, measured once when they were nestlings, and their survival over the subsequent five years of study. “Stress” involved restraining each stork for 45 minutes and then taking a blood sample. Corticosterone (ng/ml) Survival 26 28.2 29.8 34.9 34.9 35.9 37.4 37.6 38.3 39.9 41.6 42.3 52 26.6 27 27.9 31.1 31.2 34.9 35.9 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 41.8 43 45.1 46.8 46.8 47.4 47.4 47.7 0 0 0 0 0 0 0 0 47.8 50.7 51.6 56.4 57.6 61.1 0 0 0 0 0 0 a. Graph the relationship between survival (Y) and stress-induced corticosterone levels (X). b. Give three reasons that linear regression would not be suitable for these data. c. What regression method could be used instead to predict stork survival from corticosterone levels? How does the method overcome the problems noted in part (b)? d. An analysis of these data using the method in (c) yielded the following results. Based on these results, add the regression curve to your plot in part (a). Intercept Slope Model Null Duration df 1 Estimate SE 2.70304 −0.07980 1.74725 0.04368 Deviance Residual df Residual deviance 3.84 33 32 45.234 41.396 e. Based on these data, what is the estimated concentration predicting a 50% mortality (include units)? f. Using the results in part (d), test the null hypothesis of zero slope. 11 INTERLEAF Using species as data points Many types of studies in biology use species measurements as data points. We’ve encountered several cases in this book. In Chapter 2, for instance, we looked at the frequency distribution of bird abundances using the abundance measurements of different bird species as data points (Example 2.2B). In Chapter 16, we illustrated the association between brain and body mass in mammals by using measurements of different mammal species as data points. What we haven’t told you yet is how tricky it can be to analyze such data. The trouble is that species data are not usually independent. The reason is that species share an evolutionary history. Here we explain the situation and what can be done about it. The following study of lilies illustrates the problem created by shared evolutionary history. Patterson and Givnish (2002) found that lily species flowering in the low-light environment of the forest understory, such as the bluebead lily (Clintonia borealis; below left), tend to have small and inconspicuous flowers that are whitish or greenish in color. Lilies that live in sunny, open habitats, or that live in deciduous woods but flower before the tree leaves come out, such as the Turk’s-cap lily (Lilium superbum; below right), tend to have large, showy flowers. Data from 17 lily species, shown in the branching figure on the next page, indicate an almost perfect association between habitat and flower type. All 10 species flowering in open habitats had large and showy flowers. Six of the seven species flowering in shaded habitats had relatively small and inconspicuous flowers. A χ2 contingency test with these data soundly rejects the null hypothesis of no association (χ2 = 13.24, df = 1, P = 0.0003). However, this contingency test assumes that the data from the 17 species of lilies are independent. The figure indicates that this assumption is likely false. The branching tree in the figure is a phylogeny, indicating the ancestor-descendant relationships among the 17 lily species in the data set, which are at the tips of the tree. Branching points, or nodes, in the tree represent points in history when a single ancestor split into two descendant species. Two lily species at the tips of the tree are relatively closely related if they have a recent ancestor in common, such as Nomocharis pardanthina and Lilium super-bum. Two species are more distantly related if their common ancestor is deeper in the tree, such as Lilium superbum and Prosartes mac-ulata. The color of the branches in the figure indicates flower type, and the background shading indicates habitat type. (We can only guess about the habitat types and flower types of the ancestors, because they are not alive any more. The colors and shading used in the figure represent just one of the more likely possible scenarios for the transitions in habitat and flower type through history.) The crucial insight from the tree is that closely related lily species tend to have the same flower type. They also tend to have the same habitat type. Both attributes were likely inherited from their common ancestor. In all, closely related species are more similar on average than species picked at random. This means that the species data points are not independent. The situation is like that confronted when pollsters interview more than one person from the same household, or when a bird researcher measures more than one chick from the same nest. However, in the present case, the non-independence is not the fault of the way the species were sampled by the researcher. It is generated by the process of evolution. This is a problem unique to biology. The preceding example is about discrete variables, but the same issue arises when species data are numerical. The following extreme case makes the point. The branching tree in the figure above shows a phylogeny for 10 hypothetical species. In this example, the species fall into two lineages that split from a common ancestor a long time ago (at the first branching point at the far left of the figure). The lineages have been evolving separately ever since. More recently, each lineage split into five new species more or les; simultaneously. We’ve used distinct colors to represent species in the two groups. Now consider two numerical variables, X and Y, measured on all 10 hypothetical species. It is often the case that species in the same group (indicated by red or black) will be more similar to each other in these traits than to species in the alternative group, just because they share a more recent common ancestor. The scatter plot at bottom left shows example measurements for the 10 species. The symbols in the scatter plot indicate the main lineage to which the species belong. If we paid no attention to the evolutionary relationships among the 10 species represented by the different symbols, and if we assumed that the species data were independent, we would conclude that X and Y were positively correlated (r = 0.69, df = 8, P = 0.016). However, it is clear from the scatter plot that closely related species have similar values of X and of Y, a likely outcome of the common history they shared until very recently. Within the two groups, there appears to be no correlation between X and Y. What can be done about the non-independence of species data resulting from shared evolutionary history? Happily, methods have been developed that correct for the problem. They make it possible to test whether an association is present and to put confidence limits on the strength of the association. The most widely used method for analyzing associations between continuously varying species traits is known as phylogenetically independent contrasts, invented by Felsenstein (1985). His explanation of how and why it works is very clear, and we refer you to his original paper for details. Analogous methods have been developed for categorical species traits.1 None of the methods that correct for the problem of non-independence of species traits are foolproof, because all make assumptions that can be difficult to verify. For example, they all assume that the process of trait evolution through time can be adequately mimicked by a simple mathematical model of a “random walk.” If the mathematical model badly describes the process of evolution in a specific instance, then using the method can be even worse than ignoring the problem of shared history altogether and just using conventional statistical methods. Biologists nowadays tend to cover all bases and analyze their data both ways. Another strategy is to begin an analysis of species data by examining whether closely related species really are more similar on average than species picked at random. If not, then conventional statistical methods are adequate. If so, then the more specialized methods are used, often along with the results from conventional statistical methods, so that the outcomes can be compared. 1. Specialized computer programs are available to carry out phylogenetic comparative methods for continuously varying traits (such as phylogenetically independent contrasts) and discrete traits, such as MESQUITE (Maddison and Maddison 2011). Several contributed packages are available for the R statistical computing language (see the topic “Trait Evolution” at http:cran.rproject.org/web/views/Phylogenetics.html). Review Problems 3 1. The early movies by Eadweard Muybridge in the late 19th century showed for the first time the exact positions and movements of the legs during walking by horses and other large mammals. How much has this scientific analysis affected the representation of such animals in art? And how well do modern images of quadrupeds depict walking compared to images made by prehistoric humans? Horvath et al. (2012) examined a large number of images of horses and other animals in art created after Muybridge, in art made by modern humans before Muybridge, and in art from prehistoric humans depicted in cave paintings. For each image, they assessed whether the animal was presented in a biologically realistic posture. The data are at the bottom of the page. a. Draw a graph of these data. What is the pattern? b. Is there a statistically significant difference in modern images before and after Muybridge in the probability of getting the posture correct? TABLE FOR PROBLEM 1 Period Prehistoric Modern (pre-Muybridge) Modern (post-Muybridge) Correct walking posture Incorrect walking posture 21 45 289 18 227 397 c. What assumptions are you making in (b)? d. Calculate a confidence interval for the proportion of prehistoric paintings that depicted walking posture correctly. 2. Assume that the few remaining hairs on a balding man’s head occur independently of each other and with equal probability for each square centimeter (cm) of scalp. Imagine that this man has 2.3 hairs per square cm on average. What is the probability that a randomly chosen square cm of his scalp has exactly four hairs? 3. Many species have “assortative mating,” meaning that a female is more likely to mate with a male that is similar to her in some particular feature, such as body size, than with a dissimilar male. Imagine a female butterfly weighing 0.4 g in a population where the weight of males is normally distributed with mean 0.3 g and standard deviation 0.08 g. Assume that the female encounters males independently of his body weight. a. What is the probability that the first male she encounters is within 0.1 g of her own weight? b. What is the probability that the first five males she encounters are all more than 0.1 g different from her in body weight? 4. Spot the flaw. In their more recent study of “high-rise syndrome” (see Chapter 1), Vnuk et al. (2004) reported injury scores (0–4) of 119 fallen cats brought to a veterinary clinic in Zagreb, Croatia. The graph illustrates the average injury scores by the number of stories fallen. Identify the two principles of good graph design that are violated in this figure. 5. Most of us would like science to find ways to extend our lives. Genetic research has found some promising variants of genes in other organisms that greatly increase life span. Some mutations in the gene daf-2 cause the worm Caenorhabditis elegans1 to live almost three times as long as normal worms. But does this greater life span come at a cost? Below are data for the average life span (in days) of worms having one of 14 different mutations at the daf-2 gene, along with data on the number of offspring produced during their lives (expressed as a percentage of number of offspring of normal unmutated worms) (Gems et al. 1998). Here we wish to calculate the correlation between these variables. daf-2 mutation e1365 m577 sa193 e1371 e1368 m41 m212 e1369 m120 e1370 m596 Life span (days) Relative number of offspring 28.2 25.8 33.5 31.8 33.2 27.0 48.4 52.8 32.3 33.8 36.8 101 95 96 99 85 87 95 88 93 70 75 m579 e1391 e979 44.3 63.4 50.3 73 61 70 a. Both variables have positive skew. Find an appropriate transformation to reduce the skew (you will find that skew can be reduced but not altogether eliminated). b. Include a plot of the transformed data. What trend is suggested? c. Is life span significantly correlated with number of offspring? Answer using the transformed data. 6. Mendel famously discovered the basics of genetics with garden peas. He proposed a law of independent assortment, that the inheritance of different genes should be independent. We now know that this “law” is erroneous because genes that are linked on the same chromosome tend to be inherited together. Mendel (1866) used the following data from a cross of peas to test the predictions made by independent assortment. Yellow smooth: 315 Yellow wrinkled: 101 Green smooth: 108 Green wrinkled: 32 The traits yellow/green and smooth/wrinkled are determined by different genes. If independent assortment were true, then the traits of the offspring of the cross should have the following proportions: 9/16 yellow smooth peas, 3/16 yellow wrinkly peas, 3/16 green smooth peas, and 1/16 green wrinkly pea. Test whether Mendel’s prediction about the proportions is consistent with these data. 7. Van Hylckama Vlieg et al. (2009) investigated the relationship between oral contraceptive use and thrombosis in women. In a sample of 1524 adult female patients who had thrombosis, 103 had taken oral contraceptives regularly. In a second sample of 1760 women from the same population who did not have thrombosis, 658 had taken oral contraceptives regularly. a. What type of study design was used? b. Graph the data. Which treatment condition (oral contraceptive use) had the higher proportion of women with thrombosis? c. Test for an association between oral contraceptive use and thrombosis. d. What is the odds ratio of thrombosis in women taking oral contraceptives compared to women not taking oral contraceptive? Include a confidence interval for the population odds ratio. e. Under what circumstances can we say that the odds ratio estimated in part (d) is a reasonable estimate of the relative risk of thrombosis? 8. Studying the influence of metabolic differences among individuals on survival or reproductive success in nature requires that an individual’s metabolism doesn’t vary too much from time point to time point. To investigate, Hayes and O’Connor (1999) measured repeatability of thermogenic capacity (ability to generate heat) by recording maximal rate of oxygen consumption (VO2max) in high-altitude deer mice exposed to cold temperatures in a wind tunnel. A sample of 34 mice were measured twice about 68 days apart. The two measurements on each mouse, in ml/min, are given below. Mouse VO2max 1 2 3 4 5 5.62, 5.84 5.13, 4.75 5.00, 5.65 5.76, 6.07 6.10, 5.22 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 5.11, 5.68 5.63, 6.10 5.40, 6.20 4.62, 5.54 5.06, 5.56 4.29, 5.41 5.24, 5.58 4.81, 5.04 5.84, 5.69 5.34, 5.66 5.53, 5.89 5.59, 4.80 5.75, 5.92 4.41, 5.15 4.63, 4.82 5.59, 6.42 5.14, 5.31 5.18, 5.30 5.13, 5.21 4.80, 4.64 5.69, 6.17 5.00, 3.70 4.98, 5.32 5.33, 5.86 5.15, 5.50 4.79, 5.46 5.95, 5.85 5.91, 6.04 4.78, 4.83 a. Calculate the variance components of VO2max within and among deer mice. b. What is the repeatability of thermogenic capacity, as measured using VO2max under cold exposure? c. What are your assumptions in parts (a) and (b)? 9. Collins and Bell (2004) investigated the impacts of elevated carbon dioxide (CO2) concentrations on plant evolution. They raised separate lines of the unicellular algae Chlamydomonas under normal and high CO2 levels. After 1000 generations, they measured the growth rate of all of the experimental lines in a high CO2 environment. The results for 14 experimental lines are presented in the following table. Growth rate is measured relative to the starting strain and has no units. Use these data to test whether the mean growth rate is associated with the CO2 treatment. CO2 treatment Growth rate Normal Normal Normal Normal Normal Normal Normal High High High High High High High 2.31 1.95 1.86 1.59 1.55 1.30 1.07 2.37 1.89 1.55 1.49 1.26 1.20 0.98 10. To investigate whether subcutaneous fat provides insulation in humans, Sloan and Keatinge (1973) measured the rate of heat loss by boys swimming for up to 40 min in water at 20.3°C and expending energy at about 4.8 kcal/min. Heat loss was measured by the change in body temperature, recorded using a thermometer under the tongue, divided by time spent swimming, in minutes. The authors measured an index of body “leanness” on each boy as the reciprocal of the skin-fold thickness adjusted for total skin surface area (in meters squared) and body mass (in kg). Their data are listed in the following table. Body leanness (m/kg) Heat-loss rate (°C/min) 7.0 7.0 6.2 5.0 4.4 3.3 3.6 2.8 0.103 0.097 0.090 0.091 0.071 0.024 0.014 0.041 2.4 2.1 2.1 1.7 0.031 0.010 0.006 0.002 a. Draw a scatter plot of these data, showing the relationship. b. Does body leanness predict heat-loss rate? Using the following intermediate calculations, calculate the regression line and add it to your plot in part (a). Carry out a formal test. X¯=3.96667Y¯=0.04833∑i(Xi−X¯)2=41.14667∑i(Yi−Y¯)2=0.01696∑i(Xi−X¯) (Yi−Y¯)=0.78053. c. How uncertain is the estimate of slope? Calculate a 95% confidence interval. d. What are your assumptions in parts (b) and (c)? e. What fraction of the variation in heat-loss rate is predictable from body leanness? 11. For each of the following scenarios, state what statistic would be used to estimate the effect of interest. a. How different are the two hospitals X and Y in the frequency of doctors who wash and do not wash their hands before medical procedures? b. How different is the number of bacteria on hands between people who wash for one minute and people who do not wash their hands? c. How different are athletes and nonathletes in the mean number of mitochondria per muscle cell? d. How different is the number of mitochondria per cell between the muscles of people’s dominant arm and the muscles in their other arm? e. What fraction of individuals in an elephant population are male? f. How different are the frequencies of males in two populations of elephants? g. How much variation in weight is there among individuals in an elephant population? h. How strong is the association between the number of mitochondria per cell in arm muscles and leg muscles? 12. For each of the following scenarios or questions, say which method for hypothesis testing would be most appropriate to best answer the scientific question. Unless otherwise stated, make any necessary assumptions. Be as specific as possible. (Do not try to answer the biological question posed; just say what statistical technique would be best.) a. Do Hospital A and Hospital B differ in the frequency of doctors who wash their hands before medical procedures? b. Does the mean rate at which doctors wash their hands before medical procedures vary among the three hospitals X, Y, and Z? c. Does the rate of hand washing at hospitals predict the proportion of patients catching infections? d. Does washing hands for five minutes leave a different number of bacteria on people’s e. f. g. h. i. j. k. l. m. n. o. p. q. r. hands than washing for one minute, on average? Which group washes hands for the greatest mean number of minutes each time, doctors or nurses? Does whether or not doctors wash their hands before the first examination of patients have an effect on the lengths of patient stays in the hospital? Do athletes have more mitochondria per muscle cell on average than nonathletes? (Assume that mitochondria per cell is normally distributed among individuals.) Do athletes have a greater mean number of mitochondria per cell than nonathletes? (Assume that the number of mitochondria is not normally distributed but they have the same shape of distribution in the two groups.) Is the mean number of mitochondria per cell in the muscles of people’s dominant arm different from the number in their other arm? Is the proportion of males in an elephant population equal to 0.50? Do two populations of elephants have the same proportion of males? Are the left tusks of elephants on average longer than their right tusks? Are elephants spread out over the savanna independently and with equal probability everywhere? Is the length of elephants’ trunks normally distributed? Are male elephants more variable in weight than are females? Do male and female elephants differ in their mean growth rates? (Assume that elephant growth rate is not normally distributed but males and females have the same shape of distribution.) Does the thickness of an elephant’s first left molar predict its age? Does the thickness of an elephant’s first left molar predict whether it lives at least to 5 years of age? 13. The naked mole rat is a very unusual creature. For one thing, it is the only known mammal that is eusocial, with most individuals forgoing reproduction and instead helping to raise the offspring of the “king” and “queen” of their colony. They also live many times longer than other animals their size, and even up to twice as long as their two closely related species, the blind mole rat and the Damaraland mole rat (Edrey et al. 2012). It is possible that this difference is in part caused by differential expression of a transcription factor called HIF1-α, which regulates proteins called neuregulins (neural growth factors thought to be involved in maintaining nerve function). The data below give measurements of HIF1-α expression in several individuals from each of the three species (expressed as a percentage of the expression of actin, a common protein used as a reference). Does the expression of HIF1-α a differ among these three species? Naked mole rat: 3.5, 3.8, 5.6, 12.9, 13.9, 28.2 Blind mole rat: 5.2, 8.7, 8.9, 11.4, 12.6 Damaraland mole rat: 4.3, 5.2, 8.4, 10.2, 10.2, 20.6 a. Show the data in a graph. What trend is suggested? b. Do the species differ significantly in their mean amount of HIF1-α? (Use a log transformation to improve the fit to assumptions.) 14. Previous studies have shown that the antibody titers in obese people are lower after vaccination than in people of normal weight. One suggested reason is that the vaccines may not effectively penetrate the layer of subcutaneous fat in obese individuals. To test this, Middleman et al. (2010) compared the response to hepatitis B virus vaccine in obese participants in two different groups. The researchers vaccinated one group of 10 individuals with standard 1-inch (2.5 cm) needles. They used 1.5-inch (3.8 cm) needles instead for a second group of 14 individuals. They later measured the antibody titers (in units of mIU/ml) of each participant. Greater numbers indicate a more successful response to the vaccine. These results are as follows. Short-needle group: 51.6, 87.4, 143.6, 144.6, 189.7, 189.8, 208.9, 324.7, 368, 383.9 Long-needle group: 28.0, 181.6, 203.9, 243, 249.6, 274.3, 341.2, 349.6, 393.0, 429.2, 464.2, 473.1, 492.9, 647.0 a. What is the most-plausible range of values for the difference in mean antibody titers between the long- and short-needle groups? Use the 95% confidence interval to answer this question. b. Use an appropriate hypothesis test to compare the means of the two groups. What can you conclude about the effectiveness of the vaccine as a function of the length of the needle? c. What is the 95% confidence interval for antibody titer in the long-needle group? 15. Spot the flaw. Refer to Problem 14. Grover, a student who skipped reading Chapter 2 of this book, made the following graph when presenting the results of the needle-length study to his epidemiology class. The points are means, and the error bars are 95% confidence intervals. What is the biggest weakness of the graph? 16. We are accustomed to thinking that the proportion of males at birth is fixed by genetics in birds and mammals to be close to 50%. Some have suggested, however, that the sex of offspring can be adjusted by females, such as in response to the quality of her mate or the number of helpers she has. West and Sheldon (2002) found a total of 15 studies, all done on birds, that have measured changes in the sex ratio in response to such social factors, and each has expressed its results in terms of a coefficient ranging from −1 to 1. The coefficient is positive if the change in the sex ratio of offspring is in agreement with evolutionary theory, and negative if the data disagree with the theory. A coefficient of zero indicates no association with social factors in the data. The measures are as follows: −0.160, −0.037, 0.034, 0.144, 0.137, 0.118, 0.395, 0.363, 0.350, 0.376, 0.253, 0.440, 0.453, 0.460, 0.563. The frequency distribution of the coefficients is shown in the following histogram. a. What method could be used to test whether these data are consistent with a mean or median coefficient of zero? Discuss why you would use this method in contrast to specific other methods. b. Apply the best method to these data to test whether the mean coefficient differs from zero. 17. Cellulose from the butts of smoked cigarettes is commonly used by urban birds in nest construction. In an observational study of the house finches in Mexico City, SuarezRodnguez et al. (2013) discovered that nests with more cellulose from smoked cigarette butts contained fewer nest-dwelling ectoparasites of birds (such as mites) than nests with less cellulose from smoked cigarettes. In a separate experimental study, the researchers placed thermal traps in 28 active house finch nests (the parasites hiding in the nest are drawn to the warmth and become trapped). Smoked Marlboro cigarette butts2 were placed in the trap in about half the nests, randomly chosen. In the other nests, filters from unsmoked cigarettes (lacking tobacco residues) were used as control. At the end of the experiment, the researchers counted the number of ectoparasites caught in each trap. The traps containing smoked butts had fewer ectoparasites than traps with unsmoked filters. a. Which study provided the stronger evidence that the chemical contents of smoked cigarette butts deters ectoparasites: the observational study or the experimental study? Explain your reasoning. b. Which of the six commonly used components of experimental design were not incorporated in the experimental study described above? What benefit might result from including them? 18. We often assume that the mapping between words and their meanings is completely arbitrary. Maurer et al. (2006) tested whether this was completely true. College students were shown the following two shapes, and asked to say which was “bouba” and which was “kiki.” Eighteen of 20 students called the angular shape on the left kiki, while the other two called that shape bouba. a. Calculate a confidence interval for the proportion of adults who would call the left shape kiki and the other bouba. b. Test whether kiki and bouba are used with equal probability in the student population. 19. Sex, with its many benefits, also brings risk. For example, individuals that are more promiscuous are exposed to more sexually transmitted diseases. This is true for other primates as well as for our own species. Different species of primates vary widely in the mean number of sexual partners per individual, and this raises the question, are the immune systems of more promiscuous species different from those of less promiscuous species? Researchers approached this question by comparing pairs of closely related primate species, in which one species of the pair was more promiscuous and the other less promiscuous (Nunn et al. 2000). They measured the mean white blood cell (WBC) count in cells per nanoliter for each species. The results are listed in the following table. WBC count: Less promiscuous species WBC count: More promiscuous species 5.7 7.2 7.4 10.4 10.4 9.9 8.1 8.4 9.2 9.1 9.1 10.6 9.1 9.2 11.9 9.3 8.9 12.5 a. What is the mean difference in WBC count between less and more promiscuous species? Which type of species (more promiscuous or less promiscuous) has the higher WBC count on average? b. What is the 99% confidence interval for this difference? c. Test the null hypothesis that there is no mean difference in WBC count between more and less promiscuous species. d. What are your assumptions in (b) and (c)? 1. The lowly worm C. elegans is a popular study organism in aging research. See also Chapter 15, Assignment Problem 20. The daf-2 protein is an insulin receptor. 2. Cigarettes were consumed using an artificial smoking machine and contained chemical residues from the smoked tobacco. 18 Multiple explanatory variables Up to this point in the book, we have discussed methods to predict a response variable from at most one explanatory variable. Many biological studies, however, investigate more than one explanatory variable at the same time. One reason for this is efficiency: for the same amount of work, or just a little more, we can obtain answers to more than one question if we include more than one explanatory variable in a study. For example, a study of the causes of nearsightedness in children might want to examine the effects of genetic factors and the amount of time subjects spend reading at the same time, rather than just one of these variables alone. The same sample of subjects can be used to address both questions. The analysis of data from a study having more than one explanatory variable follows the purpose and design of the study. We considered several study designs in Chapter 14, and three of these will serve as the basis of the present chapter. The first design is an experiment that includes blocking (Section 14.4) to improve the detection of treatment effects. The second design is the factorial experiment (Section 14.5), which is carried out to investigate the effects of two or more treatment variables (often called factors) and their interaction. The third design adjusts for the effects of known confounding variables (also called covariates; see Interleaf 4 and Section 14.1) when comparing two or more groups. In this chapter, we introduce an all-purpose method that can be used to analyze data from designs fulfilling all of these objectives and more. The method is called general linear models. General linear models is a large topic. It includes two-factor ANOVA, multiple regression, analysis of covariance, and other methods you might have encountered already in the scientific literature. Our purpose in this chapter is to introduce the basic elements by example, using visual displays of data and models. Each analysis begins with a model statement. We will show you how model statements are constructed, and how the results are interpreted, when analyzing block and factorial designs and when adjusting for a covariate. Our goal is to give you an overview of what is possible when there are multiple explanatory variables, but we don’t have space here to describe all of the many important complications that arise when analyzing these kinds of study designs. To go further or to apply this approach to your own data, consult more advanced textbooks, such as Grafen and Hails (2002) or Quinn and Keough (2002). ANOVA and linear regression are linear models ANOVA, linear regression, and more complicated analyses having multiple explanatory variables all involve a response variable Y that can be represented by a linear model plus random error. By model, we mean a mathematical representation of the relationship between a response variable Y and one or more explanatory variables. The scatter of Y measurements around the model is random error, which results from chance and various effects not included in the model. Linear regression is an obvious example of a linear model (Chapter 17). Let’s quickly review key elements of regression and then see how they can be generalized. A model is a mathematical representation of the relationship between a response variable Y and one or more explanatory variables. Modeling with linear regression The model of linear regression is a straight line, Y=α+βX, where α and β are the intercept and slope, respectively. Individual values of the response variable are scattered above and below the regression line, representing random error. In Example 17.3, for example, we fitted a linear regression to the relationship between the natural logarithm of stability of plant biomass production in prairie plots (Y) and treatments of either 1, 2, 4, 8, or 16 plant species (X). If we call the response variable LOGSTABILITY and the explanatory variable SPECIESNUMBER, then the equation with the variable names is LOGSTABILITY=α+β (SPECIESNUMBER). Least squares yielded the best fit of this model to the data, which is drawn in the right panel of Figure 18.1-1. The data points are scattered above and below the best-fit line, representing random error. FIGURE 18.1-1 Comparison of the fits of the null model (left panel) and the linear regression model (right panel) to the plant biomass stability data of Example 17.3. When we tested the null hypothesis that β = 0, we were really comparing the fit of two linear models to the data: the linear regression model and the null model, in which the slope β is set to zero. The null model is a simplified linear model because setting β equal to zero removes the variable species number (SPECIESNUMBER) from the equation, yielding the model statement LOGSTABILITY=α′, where α’ is just a constant. Fitting this null model to the data using least squares yields the flat line depicted in the left panel of Figure 18.1-1; the best estimate of the constant is the sample mean log stability of plant biomass production. By visually comparing the two panels of Figure 18.1-1, you can see that the full regression model—the one that includes the explanatory variable SPECIESNUMBER—is the superior fit. The residuals are smaller than for the null model: data points on average lie closer to the line fitted through the data (right panel) than to the flat line having zero slope (left panel). Even if the null hypothesis were true, though, we expect the data to have a slope that departs from zero just by chance. An important question is whether the regression model fits the data significantly better than the null model. The F-ratio (Section 17.3) is used to evaluate this: if F is sufficiently large, then the reduction in the magnitudes of the residuals represents a significant improvement in fit, and the null model is rejected. Another important question is whether the coefficients of the linear model, such as the slope in this example, are large enough to be interesting biologically. Graphical displays of model fits are a valuable tool for evaluating this question, and we emphasize this approach in the current chapter. Generalizing linear regression The method of general linear models takes this regression approach and extends it in two key ways. First, it allows more explanatory variables to be included in the model. Second, the method incorporates categorical explanatory variables in the same framework, not just numerical explanatory variables as in regression. For example, the linear model for single-factor ANOVA (one categorical variable) is Y=μ+A. The constant μ is the grand mean, the average of all the observations combined, and A stands for the group or treatment effect. For each observation, A is the difference between the mean of its group and the grand mean. The data themselves are scattered above and below the group means, representing random error. The resemblance between the model for a single categorical variable and the model for linear regression highlights their fundamental similarity. Both models include a response variable and an explanatory variable. Both also have a constant term—namely, the intercept in linear regression, and the grand mean in the case of a categorical variable. The only real difference is that the explanatory variable is categorical for one model and numerical for the other.1 In Example 15.1, we compared phase shift of the circadian rhythms of subjects assigned one of three different light treatments. We can analyze these data with a general linear model as follows. Phase shift (call it SHIFT) is our response variable, and the light treatment (TREATMENT) is the explanatory variable of interest. The model statement can be written as SHIFT=CONSTANT+TREATMENT. This word statement is similar to the way most statistical packages on the computer require you to enter your model statement if you want to analyze your data with a general linear model.2 The CONSTANT term stands for the grand mean. TREATMENT represents the light treatment variable, indicating the group to which individuals belong. If there is a treatment effect, the TREATMENT group means will differ. The hypotheses are as follows. H0: Treatment means are all the same. HA: Treatment means are not all the same. As we showed previously for linear regression, the significance of a treatment variable in a general linear model is tested by comparing the fit of two models to the data using an F-test. The two models correspond to the null and alternative hypotheses. The model that includes the TREATMENT term represents the alternative hypothesis— that there is an effect of the treatment on circadian rhythm. The other model is the null hypothesis. Under the null hypothesis, all the group means are the same (i.e., there is no treatment effect). Under the null hypothesis, the TREATMENT variable contributes nothing and is removed from the model, yielding SHIFT=CONSTANT. Figure 18.1-2 shows the fits of the null (left panel) and alternative (right panel) models to the circadian rhythm data. Horizontal lines give the predicted values (analogous to the Y in linear regression) under the two models.3 The plot on the right suggests that the effect of the “eyes” treatment is to shift circadian rhythm by an hour or more. FIGURE 18.1-2 The fit of a general linear model to the circadian rhythm data (right panel) compared with the fit of the null model (left panel). Horizontal lines indicate the values predicted by the model. Even if the null hypothesis were true, we expect the full model—the one including the treatment variable TREATMENT—to fit the data best, because the sample means of the different groups will not be identical simply by chance. An important question is whether the model that includes the TREATMENT variable fits the data significantly better than the null model. The F-ratio is used to test whether including the treatment variable in the model results in a significant improvement in the fit of the model to the data, compared with the fit of the null model lacking the treatment variable. The test was summarized in Chapter 15 (Table 15.1-2). The P-value (0.004) indicated that the improvement in fit was sufficiently large to warrant rejection of H0 in favor of the alternative model. F measures the improvement in the fit of a general linear model to the data when the term describing a given explanatory variable is included in the model, compared with the null model in which the term is absent. General linear models A unified model encompassing both linear regression and single-factor ANOVA can therefore be stated using the following generic format: RESPONSE=CONSTANT+VARIABLE. RESPONSE is the response variable, which is numerical. CONSTANT refers to a constant and can represent the mean or intercept, depending on whether the explanatory variable (VARIABLE) is categorical or numerical. To keep the format as generic as possible, the model statement leaves out the coefficients for the explanatory vari-able—namely, the slope in the case of a numerical explanatory variable, and group effects in the case of a categorical explanatory variable. Henceforth, we will use simple word statements like this to describe all types of general linear models. Model statements will be recognizably different mainly by the explanatory variables that are, or are not, included. Linear models having more than one explanatory variable differ from single-variable models in two respects. They have one more variable, but more importantly they might include an interaction between variables, as indicated by the last term in the following word statement: RESPONSE=CONSTANT+VARIABLE1+VARIABLE2+VARIABLE1∗VARIABLE2. An interaction between two explanatory variables means that the effect of one variable on the response depends on the value of the second variable (Chapter 14). With an interaction, the average response for a particular combination of the two variables differs from that expected simply from adding the average effect of each variable separately. A few example models for two explanatory variables are listed in Table 18.1-1. As we show, the different models are best suited to analyzing data from particular experimental designs or observational studies. In the following sections we introduce several of these models for two explanatory variables, emphasizing key concepts rather than computational details. TABLE 18.1-1 Linear models having one or two explanatory variables, with examples of study designs analyzed. Symbols refer to “words” in the model statement: μ is a constant (mean or intercept); Y is the numerical response variable; X is a numerical explanatory variable; A and B are fixed, categorical variables, whereas b indicates a blocking or other random-effect categorical variable. Models covered in this chapter are indicated in bold. Example study Linear model Other name design Y=μ+X Y=μ+A Y=μ+A+b Y=μ+A+B+A*B Linear regression One-way (single factor) ANOVA Two-way ANOVA, no replication Dose-response Completely randomized Two-way, fixed-effect Factorial experiment Randomized block Y=μ+A+B+A*B Y=μ+A+b+A*b Y=μ+X+A Y = μ + X1, + X2 + X1 * X2 Factorial experiment ANOVA Two-way, mixed-effects Factorial experiment ANOVA Analysis of covariance Observational study (ANCOVA) Multiple regression Dose-response Analyzing experiments with blocking Here we show how to analyze an experiment designed with one treatment variable plus a blocking variable. Blocking (Section 14.4) results in an additional variable (block) that must be included in the analysis. We begin with the randomized block design because it is the simplest experiment incorporating blocking. Analyzing data from a randomized block design A randomized block design is like a paired design, but for more than two treatments. Data from such a design is analyzed with the following linear model: RESPONSE=CONSTANT+BLOCK+TREATMENT. In the typical randomized block design, such as Example 18.2, every treatment is replicated exactly once within each block. It yields exactly one data point from each combination of treatment and block, and so no interaction term is included in the linear model.4 EXAMPLE 18.2 Zooplankton depredation Svanbäck and Bolnick (2007) set up a field experiment in the shallows of a small lake on Vancouver Island (British Columbia) that allowed them to measure how fish abundance affects the abundance and diversity of prey species. They used a randomized block design to minimize noise caused by background variation in prey availability between locations in the lake. Three fish abundance treatments — Low, High, and Control — were set up at each of five locations in the lake. In the Low treatment, 30 small fish were added to a mesh cage having a surface area of 3 m × 3 m. In the High treatment, 90 fish were added to a second cage nearby. An unenclosed space of equal area adjacent to each pair of enclosures served as the Control. Table 18.2-1 shows the diversity of zooplankton prey in each treatment at the five locations after 13 days. Diversity is measured using an index called Levins’ D, which takes both the number of species and their rarity into account (e.g., common species count more than rare species). Does treatment affect zooplankton diversity? TABLE 18.2-1 Zooplankton diversity D in three fish abundance treatments. Location (block) Abundance treatment Control Low High 1 2 3 4 5 4.1 2.2 1.3 3.2 2.4 2.0 3.0 1.5 1.0 2.3 1.3 1.0 2.5 2.6 1.6 To analyze these data, we can’t simply combine all 15 measurements into three treatment groups, because the three measurements made at each location in the lake are not independent. We ran into the same issue in Section 12.2 when analyzing paired measurements. Instead, we need to designate each location as a distinct category of a blocking variable and include it in the analysis. If the blocking variable accounts for some of the variation in the data, then it can improve our ability to detect an effect of the treatment of interest. As in a paired t-test, treatment effects are assessed by the differences in response to different treatments within each block. Model formula Let’s call the abundance treatment ABUNDANCE in our statement of the general linear model. BLOCK is the individual location. Finally, let’s call the response variable DIVERSITY. The full linear model, including all terms, is DIVERSITY=CONSTANT+BLOCK+ABUNDANCE. This linear model resembles one for single-factor ANOVA, except that we’ve added a blocking variable. Fish abundance treatment is the factor we are interested in, so the null and alternative hypotheses are as follows. H0: Mean zooplankton diversity is the same in every abundance treatment. HA: Mean zooplankton diversity is not the same in every abundance treatment. The linear model, including all of the terms, represents the alternative hypothesis, whereas the model representing the null hypothesis (the null model) leaves out ABUNDANCE: DIVERSITY=CONSTANT+BLOCK. Fitting the model to data The graphs in Figure 18.2-1 visually compare the fits of the null (left panel) and alternative (right panel) models. The horizontal lines indicate predicted values for each model. The residual is the difference between each data point and the corresponding predicted value. The predicted values under the full model (right panel) suggest that the diversity of zooplankton was lowest in the High treatment and highest in the Control. The predicted values under the null model (left panel) lie on five horizontal lines rather than one line, as in Figure 18.1-2, because the null model for a test of the effect of ABUNDANCE still includes the blocking variable, and there are five blocks, each with its own average value for the response, D. FIGURE 18.2-1 Comparison of the fits of two linear models to the zooplankton diversity data. Symbols indicate blocks (locations). Horizontal lines are predicted values, with numbers indicating blocks. The left panel shows the fit of the null model, including the CONSTANT and BLOCK terms. The right panel shows the fit of the “full” model, which includes the ABUNDANCE treatment variable. The ANOVA table provides the results of the test (Table 18.2-2). Adding ABUNDANCE significantly improves the fit of the linear model (F = 16.37; df = 2,14; P = 0.001). TABLE 18.2-2 ANOVA results of fitting the linear model to the zooplankton diversity data. Source of variation Sum of squares df Mean square F P BLOCK ABUNDANCE Residual Total 2.3400 6.8573 1.6760 4 2 8 10.8733 14 0.5850 3.4287 0.2095 16.37 0.001 Computer output from programs fitting linear models to data typically also include coefficients estimating magnitudes of effects with standard errors. We do not provide interpretations of these coefficients here. Instead, we emphasize graphical displays of model fits to evaluate effects. The model predicted values shown in the right panel of Figure 18.2-1 suggest that fish abundance had relatively large effects on zooplankton diversity in this experiment. The effect of BLOCK is not tested here, because it is of less interest than the effect of ABUNDANCE, the treatment variable. Blocking is included in the analysis only to reduce the effect of variation between locations in zooplankton diversity when testing the effect of the fish abundance treatment. BLOCK should still be retained in the analysis whether or not it is statistically significant, because it is there by design and because either way it can still improve the ability to detect a treatment effect. Analyzing factorial designs Here we illustrate the analysis of data from a factorial experiment, an experiment in which all combinations of the values of two (or more) explanatory variables are investigated (Section 14.5). The explanatory variables are called factors; they represent treatments of direct interest. (In contrast, a blocking variable is not considered a factor because it is not of direct interest—a block is included only to improve the detection of treatment effects.) The linear model for an experiment having two treatment variables (call them A and B, for short) is RESPONSE=CONSTANT+A+B+A∗B. A and B terms in the model represent the main effects, whereas A*B is the interaction term. Figure 18.3-1 illustrates the interpretation of these model terms. Main effects are so named because they represent the effects of each factor alone, when averaged over the categories of the other factor. FIGURE 18.3-1 Interaction plots of effects in a hypothetical experiment with two factors (variables) A and B, each having two treatment categories. The title of each panel indicates which effects are present. Dots represent means. Lines connect means of each B group between different A groups. An interaction between A and B is present in the data if the lines are not parallel. Remember that all subjects in the experiment will have a value for both the A and B variables. A main effect of A is present in the data when the mean value of the response variable differs among subjects belonging to different A treatment groups, when averaged over subjects’ values for the B variable (top left panel of Figure 18.3-1). Similarly, a main effect of B is present in the data when the mean response differs among subjects in different B treatment groups, averaging over their values for the A variable (top right and bottom left panels of Figure 18.3-1). An interaction is present in the data if the magnitude of the difference between A groups differs according to which B group subjects belong, as indicated by nonparallel lines in the interaction plot (bottom left and right panels of Figure 18.3-1). An example will make these ideas more concrete. We restrict ourselves to an example involving fixed factors. As discussed in Section 15.5, the different categories of a fixed factor are predetermined, of direct interest, and repeatable. The analysis is different when one or more of the factors is random, instead. A random factor is a variable whose groups are not predetermined, but instead are randomly sampled from a “population” of groups (see Section 15.5). EXAMPLE 18.3 Interaction zone Harley (2003) investigated how herbivores affect the abundance of plants living in the intertidal habitat of coastal Washington using field transplants of a red alga, Mazzaella parksii. The experiment also examined whether the effect of herbivores on the algae depended on where in the intertidal zone the plants were growing. Thirty-two study plots were established just above the low-tide mark, and another 32 plots were set at mid-height between the low- and high-tide marks. The plots were cleared of all organisms, and then a constant amount of new algae was glued onto the rock surface in the center of each plot. Using copper fencing, herbivores (mainly limpets and snails) were excluded from a randomly chosen half of the plots at each height. The remaining plots were left accessible to herbivores. The design was balanced and included every combination of the height and herbivory treatments (the data are summarized in Table 18.3-1, page 618). At the end of the experiment, the surface area covered by the algae (in cm2) was measured in each plot. The data were square-root transformed to improve the fit to the assumptions of normal distributions with equal variance. Means and standard errors are shown in Figure 18.3-2. Figure 18.3-3 shows the data. TABLE 18.3-1 ANOVA results of fitting the two-factor model to the herbivory data. Source of variation Sum of squares df Mean square F P HERBIVORY 1,512.18 1 1512.18 6.36 0.014 HEIGHT HERBIVORY*HEIGHT Residual 88.97 2,616.96 14,270.52 1 1 60 88.97 2616.96 237.84 0.37 0.543 11.00 0.002 Total 18,488.63 63 FIGURE 18.3-2 Mean surface area of algae at every combination of treatments. Original units are cm2, and n = 16 in each treatment combination. The data are shown in Figure 18.3-3. FIGURE 18.3-3 Visual depiction of the fit of the full model (right panel) compared with the fit of the null model (left panel) lacking HERBIVORY*HEIGHT, the interaction term. Horizontal lines are predicted values under each model. Red circles indicate mid-height above low tide, whereas black squares indicate low height. Points in each combination of herbivory and height treatments are spread out to reduce overlap. The experiment had two factors, herbivory treatment and height. Figure 18.3-2 suggests that an interaction between these factors might be present, because herbivory treatment had a strong effect on algal cover at low height, but little or even an opposite effect at mid-height. We can fit a linear model to the data to decide which effects are statistically significant. Model formula We’ll use ALGAE to indicate the response variable, the square-root-transformed algal cover. HERBIVORY and HEIGHT refer to the two factors of interest. Finally, HERBIVORY*HEIGHT will indicate the interaction between HERBIVORY and HEIGHT. The full general linear model is then written as ALGAE=CONSTANT+HERBIVORY+HEIGHT+HERBIVORY∗HEIGHT. This formula captures every effect that can be examined from the data. A significant HERBIVORY term would indicate that the growth rate of algae differs between HERBIVORY treatments, when averaged over HEIGHT categories. A significant HEIGHT term would indicate that algal growth differs between HEIGHT levels, when averaged over the HERBIVORY treatments. HEIGHT and HERBIVORY are known as the main effects because each represents the effects of that factor alone, when averaged over the categories of the other factor. However, the overall effect of herbivory and of height treatments also includes their contribution to the interaction. The HERBIVORY*HEIGHT term in the model represents the differences in slope between line segments in the interaction plot (Figure 18.3-2). The interaction term will be different from zero if the effect of HERBIVORY on algal growth is different for different values of HEIGHT. Testing the factors An F-test from an analysis of variance is used to examine the improvement in fit of the model to the data when each main effect or interaction is present in the model, compared to when it is absent. There are three sets of null and alternative hypotheses to be tested. HERBIVORY (main effect): H0: There is no difference between herbivory treatments in mean algal cover. HA: There is a difference between herbivory treatments in mean algal cover. HEIGHT (main effect): H0: There is no difference between height treatments in mean algal cover. HA: There is a difference between height treatments in mean algal cover. HERBIVORY*HEIGHT (interaction effect): H0: The effect of herbivory on algal cover does not depend on height in the inter-tidal zone. HA: The effect of herbivory on algal cover depends on height in the intertidal zone. Each of these sets of hypotheses is tested by comparing the fit of the full model to the data with the fit when the term of interest is deleted from the model. To test the interaction term, for example, the fit of the full model is compared with a null model in which the main effects are present but the interaction is absent: ALGAE=CONSTANT+HERBIVORY+HEIGHT. The fits of these two models are depicted in Figure 18.3-3. The model including the interaction term (right panel of Figure 18.3-3) suggests that herbivory treatment has little effect at mid-height, but a substantial effect at low height. Each combination of herbivory and height treatment represents a separate group in Figure 18.3-3. Under the full model (right panel), which includes the HERBIV-ORY*HEIGHT interaction term, the predicted values are the group sample means. The null model (Figure 18.3-3, left panel) lacks the interaction term, which constrains the difference between predicted values at low and mid-heights to be the same in both herbivory treatments. The fit is the best possible in which the lines connecting means from different height treatments are constrained to be parallel. This leads to greater residuals (i.e., greater vertical distances between points and corresponding predicted values) under the null model than under the full model. According to an F-test (Table 18.3-1), the interaction term is indeed significant, so the null hypothesis of no interaction is rejected (F = 11.00; df = 1,60; P = 0.002). The other two F-ratios in Table 18.3-1 test the significance of the main effects, HERBIVORY and HEIGHT. Each test compares the fit of the full model with that of a null model in which the term of interest is removed (but all other terms are still included).5 The results show that the null hypothesis of no HERBIVORY main effect is also rejected (F = 6.36; df = 1,60; P = 0.014). This effect is evident in the interaction plot by a higher overall mean algal cover in the absence of herbivores than in their presence averaged over height categories (Figure 18.3-3). No significant main effect of HEIGHT was detected (F = 0.37; df = 1,60; P = 0.543). This is just the main effect, however; HEIGHT still interacted with HERBIVORY to influence algal growth (Figure 18.3-2). Height in the intertidal zone has its effects by changing the way herbivores affect algal cover. The importance of distinguishing fixed and random factors F-tests to compare the fits of null and alternative models to the data are straightforward when all factors are fixed, as in Example 18.3. When one or more factors are random, however, a subtle change occurs that affects how F-ratios are calculated. The change is required because the groups are randomly sampled in the case of the random factor, which contributes extra sampling error to the design. This sampling error adds noise to the measurement of differences between group means for other factors that interact with the random factor. The F-ratios must be calculated differently to compensate. Most statistics packages assume that all factors are fixed until instructed otherwise. Designating factors as random takes some extra work on your part (you might need to consult the manual of your statistics package to figure out how to do this). If factors are not properly identified as fixed or random, the results given by the computer program will be wrong. Consult more advanced statistics references, such as Sokal and Rohlf (2012), for details on how random factors influence the expected mean squares for treatment effects. Adjusting for the effects of a covariate Our last application is a general linear model having one categorical explanatory variable and one numeric explanatory variable, with a numerical response variable. The method is also called analysis of covariance (ANCOVA). The method is often used to investigate whether linear regressions fitted to data sampled from two or more groups have the same slope. Here we use the method to adjust the effects of a categorical treatment variable to account for a known numerical confounding variable, often called the covariate. Confounding variables bias estimates of treatment effects, as we described in Interleaf 4 and Chapter 14. Experimental studies eliminate such biases by randomly assigning treatments to experimental units, but often experiments are not feasible. A frequent strategy in an observational study is to include known confounding variables in the analysis and to “correct” for their distorting influence on the estimation of the treatment effect. The model that we would like to fit is RESPONSE=CONSTANT+COVARIATE+TREATMENT. This model contains no term for the interaction between the covariate and the treatment or factor. Adjusting for the effect of a covariate is simpler when no interaction is present. The effect of treatment changes with the value of the covariate when an interaction is present, complicating the effort to adjust for the covariate. The usual strategy is then to fit the data in a two-stage process. In the first round, the interaction between the treatment and the covariate is included in the linear model and tested. If no significant interaction is detected, the interaction term is dropped from the model in the second round, in which the treatment effect is estimated and tested. Failure to detect an interaction does not confirm that an interaction is truly absent, but the result is often used to justify using a linear model without an interaction term. As usual, graphing the data can help to decide whether this strategy is a sensible one in each particular case. EXAMPLE 18.4 Mole-rat layabouts Mole rats are the only known mammals with distinct social castes. A single queen and a small number of males are the only reproducing individuals in a colony. Remaining individuals, called workers, gather food, defend the colony, care for the young, and maintain the burrows. Recently, it was discovered that there might be two worker castes in the Damaraland mole rat (Cryptomys damarensis). “Frequent workers” do almost all of the work in the colony, whereas “infrequent workers” do little work except on rare occasions after rains, when they extend the burrow system. To assess the physiological differences between the two types of workers, Scantlebury et al. (2006) compared daily energy expenditures of wild mole rats during a dry season. Energy expenditure appears to vary with body mass in both groups (Figure 18.4-1), but infrequent workers are heavier than frequent workers. How different is mean daily energy expenditure between the two groups when adjusted for differences in body mass? FIGURE 18.4-1 Log-transformed daily energy expenditure and body mass of “frequent workers” (open circles, n1 = 21) and “infrequent workers” (red-filled circles, n2 = 14) of Damaraland mole rats in a dry season. Original units for the two measurements are kJ/day and g. Predicted values in the right panel include the CASTE*MASS interaction term, whereas the null model (the left panel) lacks the interaction term. Testing interaction To analyze these data, we used the log transformation of daily energy expenditure and body mass to improve the fit to the assumptions of general linear models. We’ll call the logtransformed response variable ENERGY and the log-transformed body mass MASS. The factor of interest is CASTE, whereas MASS is the covariate. All together, the full general linear model is ENERGY=CONSTANT+CASTE+MASS+CASTE∗MASS. CASTE*MASS indicates the interaction. The MASS variable is numerical, whereas the CASTE variable is categorical. The model describes a linear regression of ENERGY on MASS, separately for each category of CASTE, as shown by the predicted values for each worker caste in Figure 18.4-1 (right panel). The question of interest is whether energy expenditure differs between worker castes after adjusting for their differences in body size. The answer is easiest to obtain if we can assume that the regression lines predicting energy expenditure from body mass have the same slope in the two castes (left panel of Figure 18.4-1). Therefore, the usual first step when adjusting for a covariate is a test of equal slopes. This is a test of the interaction term in the general linear model. A general linear model with one numerical and one categorical explanatory variable fits separate regression lines to each group of the categorical variable. A test of the interaction term between the two variables is a test of whether the slopes of the regression lines is the same for all groups of the categorical variable. The hypotheses for the interaction term are as follows. H0: There is no interaction between caste and body mass. HA: There is an interaction between caste and body mass. To test the hypotheses, we compared the fit of the full model, which contains the CASTE*MASS interaction term, with that of the null model (ENERGY = CONSTANT + CASTE + MASS), which lacks the interaction. The fit of the null model to the data is illustrated in the left panel of Figure 18.4-1. Without an interaction term, the regression lines for worker castes have the same slope. To compare the two models, we focus exclusively on the F-test of the interaction term (CASTE*MASS) in Table 18.4-1, the ANOVA table of results. The interaction term is not statistically significant (F = 1.02; df = 1,31; P = 0.321). The null hypothesis of equal slopes (no interaction) is therefore not rejected by the data. TABLE 18.4-1 ANOVA table for the general linear model fitted to the molerat data. We test only the interaction term in this round. Source of variation Sum of squares df Mean square F P CASTE MASS CASTE*MASS Residual 0.0570 1.3618 0.0896 2.7249 1 1 1 31 Total 4.2333 34 0.0570 1.3618 0.0896 0.0879 1.02 0.321 The test result does not mean that the slopes are truly equal—it is not wise to “accept” a null hypothesis just because you have failed to reject it. But the assumption that the slopes are equal seems reasonable to make at this point, and the data do not contradict it. Figure 18.4-1 suggests that the assumption is reasonable. Fitting a model without an interaction term If we can assume that there is no difference between the regression slopes, then we can fit a model without an interaction term: ENERGY=CONSTANT+CASTE+MASS. The fit of this model is illustrated in Figure 18.4-1 (left panel). Now, for the second round of the analysis, the hypotheses are as follows. H0: Castes do not differ in energy expenditure. HA: Castes differ in energy expenditure. The test involves comparing the fit of the “full” model (ENERGY = CONSTANT + CASTE + MASS) with that of the null model lacking the CASTE term (ENERGY = CONSTANT + MASS). This null model is fitted by a single linear regression of energy on mass, calculated on all of the data combined. Both models include MASS, so the test of differences between castes is “adjusted” for mass differences. The ANOVA results are listed in Table 18.4-2. TABLE 18.4-2 ANOVA table for the general linear model without an interaction term fitted to the mole-rat data. Source of variation Sum of squares df Mean square F MASS CASTE Residual 1.8815 0.6375 2.8145 1 1 32 Total 5.3335 34 1.8815 0.6375 0.0880 P 21.39 <0.001 7.25 0.011 The F-ratio for CASTE is significant (F = 7.25; df = 1,32; P = 0.011), confirming that the two worker castes differ in their mean daily energy expenditure after adjusting for body mass. The magnitude of the difference is reflected by the vertical gap between the regression lines for the two castes (Figure 18.4-1, left panel). Infrequent workers expend less energy than frequent workers during the dry season. The results for MASS are also statistically significant (F = 21.39; df = 1,32; P < 0.001), indicating that energy expenditure changes with body mass—the regression slope in Figure 18.4-1 is significantly different from zero. Keep in mind that this was an observational study. Energy expenditure was statistically “adjusted” using naturally occurring variation in body mass, rather than experimentally induced variation in mass, which might yield a different regression slope and hence a different value for the effect of the factor. The analysis of covariance is still prone to bias resulting from other confounding variables not included in the model. The justification for using the method in observational studies is that bias is reduced by including one (or more) important covariates in the model, but it is not necessarily eliminated. How would we have proceeded had the data indicated that the regression lines did not have equal slopes and the interaction term should not be dropped from the model? In this event, the difference between castes is not a constant but changes with mass (as illustrated in the right panel of Figure 18.4-1). Adjusting for mass would then require that we specify a value of body mass at which to estimate caste differences. Leaving out an interaction term in a linear model to adjust for a confounding variable does not imply that we should usually drop nonsignificant terms from general linear models fitted to data. Leaving model terms such as an interaction out of a linear model should be done with great caution. A sensible rule is that the analysis should follow the design and purpose of the study. In general, variables that are part of the study design should be retained in the general linear model fitted to the data. Assumptions of general linear models The assumptions of general linear models are the same as those for regression and ANOVA. ■ The measurements at every combination of values for the explanatory variables (e.g., every block and treatment combination) are a random sample from the population of possible measurements. ■ The measurements for every combination of values for the explanatory variables have a normal distribution in the corresponding population. ■ The variance of the response variable is the same for all combinations of the explanatory variables. As in linear regression, the residual plot is a useful technique to evaluate these assumptions in general linear models. In Figure 18.5-1, we present the residual plot for the mole-rat data fitted with the general linear model after dropping the interaction term. Residuals in a general linear model have the same interpretation as in linear regression. Each residual is the difference between an observed Y-value and the value of Y predicted by the model. The residual plot has the predicted values along the horizontal axis and the residuals along the vertical axis. Statistical packages on the computer can compute predicted values and residuals for you. FIGURE 18.5-1 Residual plot for the general linear model without an interaction term fitted to the mole-rat data in Example 18.4. If the assumptions of general linear models are met, then the residual plot should have the following features: ■ A roughly symmetric cloud of points above and below the horizontal line at zero, with a higher density of points close to the line than away from the line ■ Little noticeable curvature as we move from left to right along the horizontal axis ■ Approximately equal variance of points above and below the horizontal line at all predicted values The residual plot for the mole-rat example meets these criteria reasonably well (Figure 18.5-1), although one or two data points with predicted ENERGY of about 4.3 have very low residuals and might represent outliers. In general, if assumptions are violated, then a log or other transformation of the response variable can sometimes improve the situation, just as in single-factor analysis of variance. When an outlier is present, it is advisable to determine how its presence influences the results. In the present example, deleting the most extreme outlier with predicted ENERGY of 4.3 did not perceptibly change the results. 8.6 Summary ■ Some experiments and observational studies have more than one explanatory variable. These usually can be analyzed by using a general linear model approach, in which the response variable is represented by a linear model plus random error. ■ A model is a mathematical representation of the relationship between a response variable Y and one (or more) explanatory variables. ■ Linear regression is an example of a linear model. General linear models extend the regression approach to include multiple explanatory variables that may be numerical or categorical. ■ The general linear model approach begins with a statement of the model to be fitted to the data. ■ The F-ratio is used to test whether including a term of interest in the general linear model results in a significant improvement in the fit of the model to the data, compared with the fit of the null model lacking the term. ■ Graphical displays of model fits and the data are a valuable tool for evaluating the magnitude of effects. ■ General linear models assume that every combination of values of the explanatory variables has a random sample of F-values from a population having a normal distribution with equal variance. ■ After fitting a model to the data, a plot of the residuals against the predicted values (i.e., a residual plot) is a useful method to evaluate whether the assumptions of general linear models are met. ■ Experiments with blocking include the block as an explanatory variable in the general linear model statement. ■ Model statements for factorial designs include the main effects of the factors and their interaction. ■ F-ratios are calculated differently when a factorial design includes one or more random factors, compared with that when only fixed factors are present. It is important to make sure that, when using statistical programs on the computer, the correct designation of fixed or random is associated with each factor. ■ A general linear model with one numerical and one categorical explanatory variable (also called analysis of covariance, or ANCOVA) fits separate regression lines to each group of the categorical variable. An interaction between the variables means that the regression slopes differ among the groups. ■ Analysis of covariance is often used to adjust for known confounding variables, or covariates, when testing treatment effects. The procedure is simplest if we can assume that no interaction term is present. Testing the interaction term is usually the first step in this analysis. ■ Leaving model terms such as an interaction out of a linear model should be done with great caution. Variables that are part of the study design should usually be retained in the general linear model fitted to the data. PRACTICE PROBLEMS 1. Examine the accompanying interaction plots. Each is based on hypothetical data from a factorial experiment with two factors A and B. In each case, indicate which of the main effects and interaction are likely to be present, and which are likely to be weak or absent. 2. Rest in fruit flies, Drosophila melanogaster, has many features in common with mammalian sleep. Its study might lead to a better understanding of sleep in mammals, including humans. Hendricks et al. (2001) examined the role of the signaling molecule cyclic AMP (cAMP) by comparing the mean number of hours of resting in six different lines of mutant or transgenic fly lines having different levels of cAMP expression. Measurements are hours per 24 hours divided by the mean of “wild type” flies. Means (±SE) are shown for the different fly lines in the accompanying graph. a. Write the statement of the general linear model to be fit to these data to compare means between groups. Indicate what each term in the model represents. b. Write the corresponding statement for the null hypothesis of no differences between mutant lines. c. Using a ruler, add the predicted values for each model to the figure (approximate positions will suffice). d. What test statistic should be used to test whether the null model should be rejected in favor of the alternative? 3. A study of the Magellanic penguin (Spheniscus magellanicus) measured stress-induced levels of the hormone corticosterone in chicks living in either tourist-visited areas or undisturbed areas of a breeding colony in Argentina (Walker et al. 2005). Chicks at three stages of development were included in the study—recently hatched, midway through growth, and close to fledging. Penguin chicks were stressed (captured) by the researchers and their corticosterone concentrations were measured 30 minutes later. The following graph diagrams the mean hormone concentrations for the three age groups of chicks from touristvisited (filled circles) and undisturbed (open circles) areas of the colony. a. b. c. d. e. What is the response variable? What are the explanatory variables? The line segments in this plot are not parallel. What does this suggest? Is this an observational or experimental study? Explain. Did the study use a factorial design? Explain. 4. Refer to Practice Problem 3. a. Write a complete model statement for a general linear model to fit to the penguin data. Indicate what each term in the model represents. b. What are the null hypotheses tested in the ANOVA table for the general linear model? c. What are the assumptions of your analysis? 5. Give three reasons that studies in biology sometimes have more than one explanatory variable. 6. Evidence is mounting that a part of the brain known as the hippocampus is crucial for tasks that depend on relating information from multiple sources, such as tasks requiring spatial memory. Broadbent et al. (2004) tested this experimentally by surgically inducing lesions of different extent in the hippocampus of rats and measuring the subsequent memory performance of the rats in a maze. Their data are plotted in the accompanying graph. a. Write a model statement for a general linear model to fit to these data. Indicate what each term in the model represents. b. Write the corresponding statement for the null model. c. Using a ruler, add the predicted values for each model to the figure (approximate positions will suffice). 7. The ejaculate of male Drosophila contains a protein, called SP, that reduces the life span of mated females. How the protein does this remains mysterious. One possibility is that the protein manipulates females to produce more eggs, and the extra effort reduces her life span. To investigate, Barnes et al. (2008) housed young female flies with fertile young males either intermittently (low-cost treatment) or continuously (high-cost treatment) for the rest of their lives—up to 56 days. Males in the high-cost treatment were replaced every four days with fresh young males to ensure continued, frequent mating. The same happened in the lowcost treatment, except that the fertile males were present on only one day of each four-day cycle; during the other three days, females were given mutant non-mating males, so that male density was always the same in both treatments. Two strains of females were used: fertile and sterile. Sterile females do not produce eggs, and so do not bear a cost of extra egg production. Sample size was 210 females in each combination of treatment (low-cost vs. high-cost) and fertility (fertile vs. sterile). (Sample size was 212 in the low-cost, sterile combination.) The data are shown in the figure on the next page. Lines connect mean life span between the two treatments (fertile females indicated by solid line and filled circles, sterile females by dashed line and open circles). TABLE FOR PROBLEM 7 Source a. b. c. d. Sum of squares df Mean square Treatment Fertility Treatment * fertility Error 10244.6 1001.3 24.8 57027.8 1 1 1 838 Total 62027.5 841 10244.6 1001.3 24.8 68.1 F P 150.54 < 0.0001 14.71 0.0001 0.36 0.547 What type of experimental design was used? What was the response variable, and what were the explanatory variables? Provide the word statement of the linear model employed. Examine the graph, and state which effects are likely to be present. Does there appear to be a main effect of treatment and a main effect of female fertility? Which main effect is likely e. f. g. h. to be larger? Is there an interaction between the explanatory variables? The ANOVA results of the general linear model analysis are shown in the table at the bottom of the page. What are the null hypotheses being tested? Using this table, determine which main effects were found to be statistically significant. Do the results agree with your asses-ment in part (d) based on the graph? From the ANOVA table alone, is it possible to determine which females live longer— fertile or sterile females? Based on these results, does the cost of more frequent mating (more frequent in the highcost treatment than in the low-cost treatment) affect sterile and fertile females similarly? Comment on whether these results support or fail to support the hypothesis that SP reduces female life span by causing females to produce more eggs. 8. The foraging gene (for) has been found to underlie variation in foraging behavior in several insect species. Ben-Shahar et al. (2002) examined whether the gene might influence behavioral differences in the honeybee (Apis mellifera). Worker bees perform tasks in the hive such as brood care (“nursing”) when they are young, but switch to foraging for nectar and pollen outside the hive as they age. The authors compared for gene expression in nurse and foraging worker bees in three bee colonies. The results are compiled in the accompanying table. Gene expression is measured in arbitrary units. Worker type Nurse Forager Nurse Forager Nurse Forager Colony for gene expression 1 1 2 2 3 3 0.99 1.93 1.00 2.36 0.24 1.96 a. Draw an interaction plot for these data. b. Treating COLONY as a blocking variable, write the statement of a general linear model to fit to these data. Indicate what each term in the model represents. c. Write the corresponding null model for a test of whether worker types differ in their mean gene expression. d. Is worker type a random effect or a fixed effect? Explain. e. What is the purpose of a blocking variable in experimental design? 9. The following table lists the ANOVA results for the general linear model fit to the data in Practice Problem 8. Source of variation Sum of squares df Mean square BLOCK WORKERTYPE Residual 0.342 2.693 0.152 2 1 2 Total 3.187 5 0.171 2.693 0.076 F P 35.34 0.03 a. Explain in words what the F-ratio for WORKERTYPE measures. b. Explain in words what the F-ratio for BLOCK measures. c. The term for BLOCK is not statistically significant. Should it be dropped from the general linear model? Explain. d. Explain in words what the residuals are. e. Explain what is plotted along each axis in the following residual plot for the bee data. ASSIGNMENT PROBLEMS 10. Examine the accompanying plots. Each is based on hypothetical data from an experiment with a categorical factor A and a continuous covariate X. In each case, indicate which of the main effects and interaction are likely to be present, and which are likely tob e weak or absent. 11. Langford et al. (2006) investigated whether lab mice experiencing discomfort “empathize” with familiar mice also in discomfort. They conducted an experiment in which individual mice were given an injection of 0.9% acetic acid into the abdomen, causing mild discomfort. These mice were placed in one of three different treatments: (1) isolation, (2) with a familiar companion mouse (a cage mate) that was not injected, or (3) with a familiar companion also injected and exhibiting behaviors associated with discomfort. The response variable was the percentage of time that each treated mouse exhibited a characteristic “stretching” behavior (measured by abdominal constriction) indicative of discomfort. The data on 42 male mice are below. Isolated: 46.7, 38.9, 65.6, 35.6, 32.2, 30.0, 41.1, 63.3, 0.0, 53.3, 22.2, 48.9, 5.6, 14.4, 46.7, 45.6, 42.2 Companion not injected: 56.7, 51.1, 50.0, 51.1, 44.4, 2.2, 41.1, 33.3, 25.6, 22.2, 14.4, 3.3, 64.4 Companion injected: 36.7, 81.1, 66.7, 66.7, 44.4, 54.4, 63.3, 62.2, 58.9, 50.0, 54.4, 57.8 a. Write a model statement for a general linear model fit to these data. Indicate what each term in the model represents. b. Write the corresponding statement for the null model. c. Plot the data and add the predicted values for the null model and the “full” model. If mice empathize, focal mice should stretch most often, on average, when the companion mouse is injected. Is this the pattern in the data? d. Is the fit of the full model significantly better than that of the null model? Carry out the appropriate hypothesis test. 12. Were Neanderthals smaller-brained than modern humans? Estimates of cranial capacity from fossils indicate that Neanderthals had large brains, but also that they had a large body size. The accompanying graph shows the data from Ruff et al. (1977) on estimated logtransformed brain and body sizes of Neanderthal specimens (filled circles) and early modern humans (open circles). The goal of the analysis was to determine whether humans and Neanderthals have different brain sizes once their differences in body size are taken into account. The ANOVA results of the model are listed in the following table. Source of variation Sum of squares df Mean square SPE CIES MASS SPECIES *MASS Residual 0.00547 0.09810 0.00485 0.15503 1 1 1 35 0.00547 0.09810 0.00485 0.00443 Total 0.26345 38 F P 1.24 0.274 22.15 <0.001 1.09 0.303 a. Examine the ANOVA table and the graph and write the statement of the general linear model that was fit to these data. b. What does the SPECIES*MASS term represent? c. What does the F-ratio corresponding to the SPECIES*MASS term represent? d. What null and alternative hypotheses are tested with the SPECIES*MASS term? e. What can you conclude from the F-ratio and P-value listed in the table for the SPECIES*MASS term? f. In view of the goals of the study, what steps would you recommend next to test whether the brain sizes of Neanderthal and early modern humans differ after adjusting for differences in body size? 13. Using a ruler, add the predicted values from the analysis recommended in part (f) of Assignment Problem 12 to the scatter plot (approximate positions will suffice). 14. For each of the following scenarios, draw an interaction plot (like that in Figure 18.3-2) showing the results of a hypothetical experiment having two factors, A and B, each having two groups, in which there is a. a main effect of A, no main effect of B, and no interaction between A and B; b. a main effect of A, a main effect of B, and an interaction between A and B; c. no main effect of A or B, and an interaction between A and B. 15. In a study of the effects of commercial fishing on fish populations, Hsieh et al. (2006) measured the year-to-year coefficient of variation (CV) of larval population sizes of exploited and unexploited fish species in the California current system. They compared the two groups of fish species by using a general linear model that adjusted for differences between exploited and unexploited species in the age of maturation (MATURATION), which also seemed to influence the coefficient of variation. Data for 13 exploited species and 15 unexploited species are shown in the following graph, along with the predicted values of the model for the two groups of fish. The model was CV=CONSTANT+MATURATION+EXPLOITATION, where EXPLOITATION is the explanatory variable representing the two groups of fish species (exploited and unexploited). The ANOVA results are listed in the following table. Source of variation Sum of squares df Mean square MATURATION EXPLOITATION Residual 1.7313 1.5924 2.4518 1 1 25 Total 5.7755 27 1.7313 1.5924 0.0981 F P 17.65 0.0003 16.23 0.0005 a. Explain the steps that likely led to the authors using the model analyzed above. b. What are the assumptions of this analysis? c. Is there a significant difference between exploited and unexploited fish in their year-to-year coefficients of variation? Explain the basis for your conclusion. 16. The tortoise beetle Deloyala guttata feeds and lays eggs on leaves of the two morning glory species Ipomea pandurata and I. purpurea. TABLE FOR PROBLEM 16 Family Mean development time on I. Mean development time on I. number pandurata (days) purpurea (days) 18 19 25 50 65 66 15.1 14.8 15.9 16.9 14.7 15.6 14.1 14.5 14.0 17.1 14.7 14.4 Rausher (1984) investigated whether there was genetic variation in the population in the relative abilities of beetles to exploit the two plant species. To test this, he randomly sampled six beetle families from a local population by crossing randomly sampled males and females. He then raised half the offspring from each family on leaves of I. pandurata and the other half on I. purpurea. Here we analyze development time (the days from hatching to formation of the pupa) of female offspring from each family. If genetic variation is present in the relative abilities to exploit the two plant species, then there should be an interaction between family and plant species. Means are given in the table on page 631; standard errors were about 0.7. a. Draw an interaction plot for these results. Briefly describe (in words) the patterns revealed. Is an interaction present? How can you tell? b. Write a model statement for a “full” general linear model to fit to these data. Explain what each term in the model represents. c. Which factors in the general linear model are random and which are fixed? Explain. d. What are the null hypotheses to test in the corresponding ANOVA table? e. What assumptions are required to test these hypotheses? f. What does the F-statistic for any given term in the model signify? 17. Females of the yellow dung fly, Scathophaga stercoraria, mate with multiple males, and the sperm of different males “compete” to fertilize her eggs. The last male to mate usually gains a disproportionate number of fertilizations. In a laboratory experiment on male and female dung flies from two populations, one in Switzerland and the other in the United Kingdom (U.K.), Hosken et al. (2002) found that fertilizations by the last male (assessed by DNA fingerprinting) depended on the population of origin of both the male and female. Females were mated to two males in turn, one from each population, and the father of each egg laid was then determined. On average, males from the same population as the female fared worse than foreign males. The following graph, redrawn from Hosken et al. (2002), shows the mean percentage of offspring sired by the second male ±SE. a. What do we call the type of experimental design that was carried out? b. Write the statement of a general linear model to fit to these data. c. Based on the graph, which F-ratios are likely to be greater than one? 18. Does light environment have an influence on the development of color vision? The accompanying data, from Fuller et al. (2010), are measurements of the relative abilities of bluefin killifish from two wild populations to detect short wavelengths of light (blue light in our own visible color spectrum). One population was from a swamp, whose tea-stained water filters out blue wavelengths, whereas the other population was from a clear-water spring. Fish were crossed and raised in the lab under two light conditions simulating those in the wild: clear and tea-stained. Sensitivity to blue light was measured as the relative expression of the SWS1 opsin gene in the eyes of the fish (as a proportion of total expression of all opsins). Opsin proteins in eyes detect light of specific wavelengths; SWS1 is so named because it is shortwave sensitive. The data are from a single individual from each of 33 families. Because the fish were raised in a common lab environment, population differences are likely to be genetically based, whereas differences between fish under different water clarity conditions are environmentally induced. Population Water clarity Spring Clear 0.16, 0.11, 0.12, 0.11, 0.08, 0.09, 0.14, 0.16 Relative expression of SWS1 Swamp Spring Swamp Clear 0.08, 0.13, 0.07, 0.12, 0.12, 0.05, 0.06, 0.11 0.09, 0.09, 0.08, 0.10, 0.12, 0.06, 0.13, 0.08, 0.11, Tea 0.07 Tea 0.06, 0.07, 0.08, 0.08, 0.10, 0.03, 0.03 How many factors are included in this experiment? Identify them. What type of experimental design was used? Draw an interaction plot of the data (remember to show the data also). Provide a word statement of a full linear model to fit to the data. Examine your graph from part (c), and state which effects are likely to be present in the results. Say how you reached your conclusions. Explain whether the genetic and enviromentally induced effects on SWS1 opsin expression appear to be in the same direction. f. The ANOVA results of the general linear model analysis are shown in the table below. What null hypotheses are tested? g. Using this table, indicate which main effects were found to be statistically significant. Do the results agree with your assessment in part (e) based on your graph? a. b. c. d. e. TABLE FOR PROBLEM 18 Source Sum of squares df Mean square F P Population Water treatment Population * water treatment Error 0.00670 0.00647 1 1 0.00670 0.00647 8.98 0.006 8.68 0.006 0.000000 1 0.000000 0.00 0.999 0.021619 29 0.000745 Total 0.03479 32 19 Computer-intensive methods The advent of fast and cheap computers has changed the way statistics is done. Most obviously, graphics and the tedious calculations required for classical statistical techniques can be done at the touch of a button, so that the large amount of time that was once spent on such tasks could be used to collect more data or drink more coffee. But the computer has allowed more than just speedier calculations of what could already be done. Computers have also made possible new approaches for analyzing data that were not feasible before. This chapter describes two of these methods: simulation and bootstrapping. Simulation is primarily a method for hypothesis testing, whereas bootstrapping is a method designed to calculate the precision of estimates. These methods are particularly useful when the assumptions of standard statistical methods cannot be met or when no standard method exists. Simulation and bootstrapping both require a large number of calculations, and they are impractical except with the aid of a computer. The value of these techniques is that they can be applied to almost any type of statistical problem. In this chapter, we describe simulation and bootstrapping using relatively simple examples. We emphasize the conceptual basis of each method without assuming that the reader is adept at computer programming. Hypothesis testing using simulation Simulation is a computer-intensive method used in hypothesis testing, where the major challenge is to determine the null distribution—that is, the sampling distribution of the test statistic when the null hypothesis is true. In many situations, this distribution can be calculated from probability theory and simulation is not needed. In some situations, however, the null distribution is too difficult to calculate from theory. Computer simulation of the sampling process can be an excellent way of getting an approximation of the null distribution in these cases. Simulation uses a computer to mimic—or “simulate”—sampling from a population under the null hypothesis. Simulation uses a computer to imitate the process of repeated sampling from a population to approximate the null distribution of a test statistic. With simulation, we use a computer to create an imaginary population whose parameter values are those specified by the null hypothesis. We then use the computer to simulate sampling from this imaginary population, using the same protocol as was used to collect the real data. Each time we take a simulated sample, we use it to calculate the test statistic. We repeat this simulation process a large number of times. The frequency distribution of values obtained for the test statistic from all of the simulations is an approximate null distribution for our hypothesis test. We have used simulation already without saying so. In Example 6.2, we used computer simulation to generate the null distribution for the number of right-handed toads in a random sample of 18 toads under the null hypothesis that the proportion of right-handed toads in the population was 0.5. Simulation was unnecessary in that case because, as we now know, the binomial test is faster, easier, and gives an exact P-value. For Example 19.1, however, simulation is the best or only option. EXAMPLE 19.1 How did he know? The non-randomness of haphazard choice Some stage performers claim to have real telepathic powers—the ability to read minds. However, these powers can be convincingly faked. In one example, the performer asks every member of the audience to think of a two-digit number. After a show of pretending to read their minds, the performer states a number that a surprisingly large fraction of the audience was thinking of. This feat would be surprising if the people thought of all two-digit numbers with equal probability, but not if people everywhere tend to pick the same few numbers. Figure 19.1-1 shows the numbers chosen independently by 315 volunteers (Marks 2000). Are all two-digit numbers selected with equal probability? FIGURE 19.1-1 The distribution of two-digit numbers chosen by volunteers. According to the histogram in Figure 19.1-1, the two-digit numbers chosen don’t seem to occur with equal probability at all. A large number of people chose numbers in the teens and twenties, with few choosing larger numbers. Almost nobody chose multiples of ten. Let’s use these data to test the null hypothesis that every two-digit number occurs with equal probability. The hypotheses are as follows. H0: Two-digit numbers are chosen with equal probability. HA: Two-digit numbers are not chosen with equal probability. To analyze this problem, our first thought might be to use a χ2 goodness-of-fit test (Chapter 8). There are 90 categories of outcome (each of the 90 integers between 10 and 99). The expected frequency of occurrence of each category is 315/90 = 3.5, and the observed frequencies are those shown in Figure 19.1-1. The resulting value of the test statistic is χ2 = 1231.0. If this number exceeds the critical value for the null distribution at α = 0.05, then we can reject H0. But here we run into a problem. The expected frequency of 3.5 for each category violates the requirements of the χ2 test that no more than 20% of the categories should have expected values less than five. As a result, the null distribution of the test statistic χ2 is not a χ2 distribution, so we cannot use that distribution to calculate a P-value. How do we determine the null distribution of our test statistic so that we can get a P-value? One possible solution is to use computer simulation to generate the null distribution for the test statistic. Here’s how it’s done, in five steps: 1. Use a computer to create and sample an imaginary population whose parameter values are those specified by the null hypothesis. In the case of the mentalist’s numbers, simulating a single sample involves drawing 315 two-digit numbers between 10 and 99 at random and with equal probability. Each simulated sample must have 315 numbers, because 315 is the sample size of the real data. The first row of Table 19.1-1 lists the first 12 numbers of our first simulated sample of 315 numbers. TABLE 19.1-1 A subset of the results of a simulation. Each row has the first 12 of 315 numbers randomly sampled from an imaginary population in which all numbers between 10 and 99 occur with equal probability. The last column has the χ2 statistic calculated on each simulated sample of 315 numbers. Test statistic Simulation Simulated samples of 315 numbers (the first 12 number numbers of each sample are shown) χ2 1 32 42 27 74 78 86 98 71 28 50 41 54 ... 86.1 2 3 4 5 38 52 69 11 63 45 55 66 98 59 31 32 36 44 42 11 88 44 35 76 10 25 48 96 74 94 52 73 35 29 37 20 62 27 91 64 52 64 40 40 90 24 67 37 48 47 67 49 ... ... ... ... 80.5 95.4 78.4 87.7 6 7 8 9 10 ... 46 61 47 21 88 15 35 54 84 26 87 58 16 30 48 23 33 39 66 44 92 58 91 81 34 18 82 68 13 72 91 67 49 16 89 26 95 57 18 14 31 16 10 81 98 23 64 21 91 35 40 59 79 52 99 51 64 51 95 54 ... 96.9 ... 95.4 ... 106.7 ... 92.3 ... 100.5 ... 2. Calculate the test statistic on the simulated sample. For the number-choosing example, we have decided to use χ2 as the test statistic. This statistic does not necessarily have a χ2 distribution under H0, because of the small expected frequencies, but χ2 is still a suitable measure of the fit between the data and the null hypothesis. In our first simulated sample, the χ2 value turned out to be 86.1 (Table 19.1-1). 3. Repeat steps 1 and 2 a large number of times. We repeated the simulated sampling process 10,000 times, calculating χ2 each time. (Typically a simulation should involve at least 1000 replicate samples.) Table 19.1-1 shows a subset of outcomes for the first 10 of our simulated samples. 4. Gather all of the simulated values for the test statistic to form the null distribution. The distribution of simulated test statistics can be used as the null distribution of the estimate. The frequency distribution of all 10,000 values for χ2 that we obtained from our example simulations is plotted in Figure 19.1-2. This is our approximate null distribution for the χ2 statistic. FIGURE 19.1-2 The null distribution for the χ2 statistic based on 10,000 simulated random samples from an imaginary population conforming to the null hypothesis (Example 19.1). The test statistic calculated from the data, χ2 = 1231.0, is far greater than for any of the simulated samples. 5. Compare the test statistic from the data to the null distribution. We use the simulated null distribution to get an approximate P-value. In an ordinary χ2 goodness-of-fit test, the P-value for the χ2 statistic is the probability under the null distribution of obtaining a χ2 statistic as large or larger than the observed value of the test statistic. The same is true with a simulated null distribution for χ2: the P-value is approximated by the fraction of simulated values for χ2 that equal or exceed the observed value of χ2 (in our case, χ2 = 1231.0). According to Figure 19.1-2, none of the 10,000 simulated χ2 values exceeded the observed χ2 statistic. This means that the approximate P-value is less than 1 in 10,000 (i.e., P < 0.0001).1 To be more precise, we would have to run more simulations. These results show that when people choose numbers haphazardly, the outcome is highly non-random. This can make mentalists appear to have telepathic powers when the only power they possess is that of statistics. Bootstrap standard errors and confidence intervals The bootstrap is a computer-intensive procedure used to approximate the sampling distribution of an estimate. Bootstrapping creates this sampling distribution by taking new samples randomly and repeatedly from the data themselves. Unlike simulation, the bootstrap is not directly intended for testing hypotheses. Instead, the bootstrap is used to find a standard error or confidence interval for a parameter estimate. The bootstrap is especially useful when no formula is available for the standard error or when the sampling distribution of the estimate of interest is unknown. Bootstrapping uses resampling from the data to approximate the sampling distribution of an estimate. Recall from Section 4.1 that the sampling distribution is the probability distribution of sample estimates when a population is sampled repeatedly in the same way. The standard error is the standard deviation of this sampling distribution. In principle, therefore, we might obtain a standard error of an estimate by taking repeated samples from the population, calculating the sample estimate each time, and then taking the standard deviation of the many sample estimates. In reality, however, we can’t do repeated sampling; collecting data is expensive, and it is best to put all individuals collected into one sample if we had more data. However, if the size of our sample from the population is large, then we do have easy access to a part of the popula-tion—namely, the part that was already sampled. Bootstrapping is a kind of repeated sampling, but instead of taking individuals from the population directly, we use a computer to draw the samples from the data, a procedure called “resampling.” If the data set is large enough, then bootstrap samples drawn in this way will have statistical properties very similar to the distribution of possible sample estimates obtained from the population itself. The bootstrap is therefore a bit strange: we resample from the data itself to generate many new data sets, and from these we infer the sampling distribution of the estimate. If you think about it, this is almost cheating, because we use the one and only data set to infer the distribution of estimates from all possible data sets. Hence the name “bootstrap,” coming from the idea of picking yourself up by your own bootstraps.2 The method was proposed by Bradley Efron in 1979, when desktop computers started to become available. The bootstrap is now commonly used in biology and other sciences. Example 19.2 shows how to calculate a standard error and a confidence interval using the bootstrap. This particular example estimates a median, a simple quantity for which it is otherwise difficult to calculate a sampling distribution. Bootstrapping, however, can be applied to essentially any type of estimate.3 EXAMPLE 19.2 The language center in chimps’ brains One of the things that makes humans different from other organisms is our well-developed capacity for complex speech. Chimps and gorillas can learn some rudimentary language, but with a capacity far below that of humans. Speech production in humans is associated with a part of the brain called “Brodmann’s area 44,” which is part of Broca’s area. In humans, this area is larger in the left hemisphere of the brain than in the right, and this asymmetry has been shown to be important for language development. With the advent of magnetic resonance imaging (MRI), it is possible to ask whether this area is asymmetric in other apes’ brains as well. A sample of 20 chimpanzees were scanned with MRI, and the asymmetry of their Brodmann’s area 44 was recorded (Cantalupo and Hopkins 2001). This asymmetry score is left measurement minus the right, divided by the average of the two sides. The raw data are listed in Table 19.2-1. The sample median asymmetry score was 0.14. We want to quantify the uncertainty of this estimate of the population median by calculating its standard error. TABLE 19.2-1 Asymmetry scores for Brodmann’s area 44 in 20 chimpanzees. Name of chimp Asymmetry score Austin Carmichael Chuck Dobbs Donald Hoboh Jimmy Carter Lazarus Merv Storer Ada Anna Atlanta Cheri Jeannie Kengee Lana Lulu Mary Panzee 0.30 0.16 −0.24 −0.25 0.36 0.17 0.11 0.12 0.34 0.32 0.71 0.09 1.12 −0.22 1.19 0.01 −0.24 0.24 −0.30 −0.16 The frequency distribution of asymmetry scores shown in Figure 19.2-1 is skewed to the right and might even be bimodal. A transformation of these data would be difficult to find because the range of values includes negative numbers. What to do? Bootstrapping provides a suitable approach. FIGURE 19.2-1 The frequency distribution of asymmetry scores for Brodmann’s area 44 in 20 chimpanzees. A negative score indicates that the area is larger on the right side of the chimp’s brain, while chimps with positive scores show a larger area in the left hemisphere. Bootstrap standard error To generate a bootstrap standard error, there are four steps to follow. First, we list the steps all at once here, and then we go through them again with the data. 1. Use the computer to take a random sample of individuals from the original data. Each individual in the data has an equal chance of being sampled. The bootstrap sample should contain the same number of individuals as the original data. Each time an observation is chosen, it is left available in the data set to be sampled again, so the probability of it being sampled remains unchanged.4 2. Calculate the estimate using the measurements in the bootstrap sample from step 1. This is the first bootstrap replicate estimate. 3. Repeat steps 1 and 2 a large number of times (10,000 times is reasonable). The frequency distribution of all bootstrap replicate estimates approximates the sampling distribution of the estimate. 4. Calculate the sample standard deviation of all the bootstrap replicate estimates obtained in steps 1–3. The resulting quantity is called the bootstrap standard error. The bootstrap standard error is the standard deviation of the bootstrap replicate estimates obtained from resampling the data. The last point is worth repeating: the standard error is the standard deviation of the sampling distribution of estimates.5 We can now apply these four steps to the chimp data. There are 20 data points in the sample, so each bootstrap sample must also have 20 measurements. Each of the 20 measurements in the bootstrap sample is chosen with equal probability from the values in the original data. Applying step 1, the following is the first bootstrap replicate that we obtained: 0.24 0.36 0.30 0.16 0.34 −0.24 0.30 1.19 0.32 0.32 0.36 0.01 0.01 0.11 0.11 −0.25 0.12 0.32 −0.24 0.17 Each of the measurements in this first bootstrap sample is present in the original data set. By chance, some of the original data points are present more than once in the bootstrap sample. For example, the score 0.32 (from the chimp named Storer) is present three times. Also by chance, some of the original data points are absent from this first bootstrap sample. For example, the score 0.71 (from the chimp named Ada) was not sampled. The sample median of this bootstrap sample is 0.205, so this is our first bootstrap replicate estimate of the median asymmetry score (step 2). We repeated this process 10,000 times, calculating the sample median of the measurements each time (step 3). Figure 19.2-2 shows the frequency distribution of the bootstrap replicate estimates of the sample median. FIGURE 19.2-2 The distribution of 10,000 bootstrap replicate estimates for the median asymmetry of Brodmann’s area 44 in chimpanzees. The mean of the bootstrap replicate estimates is 0.142, which is very close to the estimated median from the original data (0.14). Remember that the bootstrap procedure is calculating a sampling distribution for an estimate, not a null distribution for a hypothesis test. As such, the overall mean of the bootstrap replicate estimates should be close to the estimate first calculated on the original data.6 The standard deviation of these bootstrap replicate estimates is 0.085 (step 4). This is the bootstrap standard error of our sample median: SE = 0.085. Because the bootstrap samples come from the data, which generally do not represent the full population, the bootstrap standard error tends to be slightly smaller than the true standard error. This effect is negligible when the sample size is large. Confidence intervals by bootstrapping The approximate sampling distribution generated by the bootstrap can also be used to calculate an approximate confidence interval for the population parameter. We present the most commonly used method here.7 The bootstrap 1 − α confidence interval can be obtained from the bootstrap sampling distribution by finding the points that separate α/2 of the distribution into each of the left and right tails. In other words, an approximate 95% confidence interval ranges from the 0.025 quantile to the 0.975 quantile of the bootstrap sampling distribution. For example, let’s compute a 95% confidence interval for the population median asymmetry using the bootstrap sampling distribution displayed in Figure 19.2-2. To determine the lower bound of the 95% confidence interval, we must find the 0.025 quantile—the value that was greater than or equal to the 250th (i.e., 0.025 × 10,000) sorted bootstrap replicate estimate. In our example, after sorting the 10,000 bootstrap replicate estimates from small to large, the 250th sorted value was −0.075. To determine the upper bound of the confidence interval, we must find the value that is less than or equal to 2.5% of the sorted bootstrap estimates—in other words, the 9751st measurement in the sorted bootstrap estimates (note that 250 out of 10,000 values are equal to or greater than the 9751st estimate). For the chimp brain data, this number is 0.31. As a result, the 95% bootstrap confidence interval for the population median asymmetry of Brodmann’s area 44 in chimps is −0.075,<median<0.31. This is a relatively wide confidence interval, indicating that the data are consistent with a broad range of possible values for the population median asymmetry of Brodmann’s area 44 in chimps. This asymmetry of brain structure, thought to be so important for language development in humans, might have a median as low as zero (or very slightly larger on the right side than on the left side of the brain) or as large as 0.31 in our closest relative.8 Bootstrapping with multiple groups Bootstrapping can be used for just about any statistic that can be estimated and for which conventional methods are unavailable. For example, bootstrapping may be used to obtain standard errors and confidence intervals for measures of the difference between groups. Here we show a bootstrap method to compare two groups whose frequency distributions are not normally distributed and might not have the same shape. The data from each group is resampled separately, and the bootstrap sampling uses the original sample sizes for each group. This bootstrap procedure generates “new” data sets that mimic sampling repeatedly from the original populations. Let’s revisit the data from Example 13.5, which was problematic because the normality assumption was not met and transforming the data didn’t fix the problem. In this example, we examined the relationship between female starvation treatment (female crickets were either starved or fed before the experiment) and the number of hours females waited before mating with a male (“time to mate”). (Remember that in this species the females sometimes munch on the males’ wings during mating, so hunger may play a role in willingness to mate.) The frequency distribution of mating times was skewed in both treatments (see Figure 13.5-1), and it is useful to ask how experimental treatment affected the median time to mate. We can use the bootstrap to determine a confidence interval for the difference between the medians of the two populations (starved and fed females). Using the data in Table 13.5-1, we first calculate an estimate of the difference between the median mating times of starved and fed crickets. This estimate is medianstarved−medianfed=13.0−22.8=−9.8 hours. To find a bootstrap confidence interval for the difference between the population medians, we resample the measurements of individual females separately for each group for each bootstrap replicate. There were n1 = 11 females in the starved treatment and n2 = 13 females in the fed group; so for each bootstrap replicate, we resample with replacement 11 females from the fed group and 13 from the starved crickets. These sample sizes are probably too small to yield an accurate confidence limit for the difference in medians between the two populations. However, the small sample sizes makes it easier to show how resampling works. For example, here is one bootstrap sample from these data: Starved: 17.9, 9.6, 9.0, 9.0, 14.7, 21.7, 72.3, 2.1, 13.0, 9.0, 1.9 Fed: 39.0, 3.6, 2.4, 22.6, 1.7, 79.5, 1.5, 3.6, 54.4, 72.1, 22.8, 3.6, 22.6 Compare these values to the original data in Table 13.5-1. For each case, the values for hunger status and the time to mating in the bootstrap sample come from the same individual in the original data set. In each treatment, some of the original females were, by chance, sampled more than once in this bootstrap replicate (for example, the individual from the starved group with time to mating of 9.0). Some of the original individuals were not sampled at all in this bootstrap sample (for example, the individual from the starved goup with time to mating of 3.8). The median female time to mate in the starved treatment in this bootstrap sample was 9.6. The median in the fed treatment in this bootstrap sample is 22.6. Using this bootstrap sample, we get a bootstrap replicate estimate for the difference between treatment medians equal to 9.6 − 22.6 = −13 hours. We repeated the steps of this bootstrap procedure a total of 1000 times. The frequency distribution of bootstrap replicate estimates is shown in Figure 19.2-3. The 2.5% quantile of this distribution is −62.5, and the 97% quantile is at 12.2. The 95% confidence interval for the difference between population medians is therefore −62.5<difference between medians<12.2. FIGURE 19.2-3 The distribution of bootstrap replicate estimates of the difference in median time to mating, in hours, between starved and fed female crickets. These data yield a fairly wide confidence interval for the difference in median time to mating between starved and fed female crickets. Assumptions and limitations of the bootstrap The main assumption of the bootstrap is that each sample is a random sample from its corresponding population. Additionally, each sample must be large enough that the frequency distribution of the measurements in the sample is a good approximation of the frequency distribution in the population. The larger the sample, the greater the resemblance between the frequency distribution of measurements in the sample and that in the population. Bootstrap analyses based on small samples will, on average, produce standard errors that are too small and confidence intervals that are too narrow, overestimating the precision of the estimate. 9.3 Summary ■ Simulation is a method for hypothesis testing in which a computer is used to mimic repeated sampling from an imaginary population whose properties conform to those stated in the null hypothesis. The frequency distribution of test statistics calculated on the simulated samples gives a null distribution of the test statistic. ■ The simulated null distribution of a test statistic is used to calculate P-values for hypothesis testing. ■ Bootstrapping is a method for calculating standard errors of estimates and confidence intervals for parameters. It uses resampling from the data to approximate the sampling distribution for an estimate. ■ The standard deviation of the bootstrap sampling distribution for an estimate is the bootstrap standard error of the estimate. ■ Bootstrap confidence intervals can be calculated from quantiles of the distribution of bootstrap replicate estimates. ■ Bootstrapping requires large sample sizes to generate reliable estimates of the sampling distribution. PRACTICE PROBLEMS 1. The following are very small data sets of birth weights (in kg) of either singleton births or individuals born with a twin. Singleton: 3.5, 2.7, 2.6, 4.4 Twin: 3.4, 4.2, 1.7 We are interested in the difference in mean weight between singleton babies and twin babies. a. Construct a valid bootstrap sample from these data for this difference. b. Assume that we wanted to estimate and test the difference in medians between these two groups. Would that change the way in which the bootstrap sample should be created? 2. Using the data from Practice Problem 1, state whether each of the following data sets (a through f) is a possible bootstrap replicate sample for use in determining a bootstrap confidence interval for the difference in mean birth weight. If not, explain why. Singletons a. b. c. d. e. f. Twins 3.5, 2.7, 2.6, 4.4 3.4, 4.2, 1.7, 3.5 2.7, 2.6, 4.4 3.5, 3.5, 3.5, 3.5 3.8, 3.8, 3.8, 3.8 3.5, 3.5, 4.4, 2.7 3.4, 4.2, 1.7 2.7, 2.6, 4.4 3.4, 4.2, 1.7, 3.4 3.4, 3.4, 3.4 3.4, 4.2, 1.7 4.2, 1.7, 4.2 3. The following table lists 100 bootstrap replicate estimates for an estimate of the mean length (in cm) of timber wolf jaws. (The values have been sorted into ascending order for your convenience.) Use the numbers to approximate a 90% confidence interval for the mean length. 10.15 10.20 10.21 10.23 10.24 10.25 10.26 10.27 10.27 10.28 10.29 10.30 10.31 10.17 10.20 10.21 10.23 10.24 10.25 10.26 10.27 10.27 10.28 10.29 10.30 10.32 10.17 10.20 10.22 10.23 10.24 10.25 10.26 10.27 10.28 10.29 10.29 10.31 10.32 10.18 10.21 10.22 10.23 10.24 10.25 10.26 10.27 10.28 10.29 10.30 10.31 10.32 10.20 10.21 10.23 10.24 10.25 10.26 10.26 10.27 10.28 10.29 10.30 10.31 10.32 10.33 10.34 10.35 10.36 10.37 10.38 10.33 10.35 10.35 10.37 10.38 10.39 10.33 10.35 10.35 10.37 10.38 10.40 10.34 10.35 10.36 10.37 10.38 10.40 10.34 10.35 10.36 10.37 10.38 10.40 10.40 10.41 10.41 10.44 10.48 4. Prairie voles are monogamous and social, whereas their close relative the meadow voles are polygamous and solitary. The species differ in their expression of a vasopressin receptor gene (V1aR) in their forebrains (expression measures the rate of protein production by a gene). The scientific hypothesis is that this receptor of vasopressin, an important neurotransmitter, might influence the voles’ social behavior. Geneticists were able to experimentally increase the expression of the V1aR gene in a sample of meadow voles, and they compared the behavior of the resulting individuals with that of control individuals without excess V1aR (Lim et al. 2004). They measured the time each vole spent huddling with a partner, under the assumption that greater huddling time is indicative of a more social animal, as described in Assignment Problem 14 in Chapter 3. The following graph shows the bootstrap sampling distribution of the difference in median huddling time between the two groups, based on 1000 bootstrap replicates. Estimate by eye the bootstrap standard error of the mean difference in median huddling time. 5. The “broken stick” model is often used in ecology to represent how species should divide up space or resources if these are divided randomly. For example, Waldron (2007) compared the geographic range sizes of closely related bird species in North America. For each of 65 pairs of bird species, he used maps to calculate the total range sizes of the two species (in km2), and then he calculated the ratio of the size of the smaller of the two ranges over the larger range size. This ratio can range from nearly zero, if one species of a pair has a much larger range than the other, to one, if both species have the same range sizes. He observed that the average ratio for all the pairs was 0.48. He then used simulation to test the null hypothesis that the mean of this ratio was no different than that expected if species divided the region “randomly” according to a randomly broken stick, with one species of every pair getting the short end and the other species the long end. Using a computer, he took a stick of unit length, randomly broke it into two parts, and then calculated the ratio of the length of the smaller piece to the length of the larger piece. He did this 65 times, once for every pair of closely related species, and then took the average ratio. This process was repeated 10,000 times. The resulting 10,000 values for the ratio are shown in the accompanying graph. a. What does this frequency distribution of simulated values for the ratio estimate? b. Only 42 of the 10,000 simulated values for the ratio were greater than or equal to 0.48, the observed ratio in the bird data. Using this information, test the null hypothesis that the observed mean ratio of the bird range sizes is no different than expected if the species divided the region randomly as a broken stick. 6. The most common use of the bootstrap in biology is to estimate the uncertainty in trees of evolutionary relationships (phylogenies) estimated from DNA sequence data. For example, the tree provided for this problem is a phylogeny of the carnivores, a group of mammals, based on their gene sequences (Flynn et al. 2000). The arrangement of branches on the tree indicates how species are related by descent. Pick any two named species on the right (representing the present time) and follow their branches backward in time (to the left) until the branches meet at a node representing their common ancestor. Species are closely related if they share a recent common ancestor. In this tree, for example, the cat is more closely related to the mongoose than either of these two species is to the wolf. The tree is not necessarily the true phy-logeny, but an estimate based on the assumption that species sharing a recent common ancestor will have had less time to become different in their DNA sequences than more distantly related species. The bootstrap is used to calculate the uncertainty in the arrangement of the branches of the estimated evolutionary tree. To do this, the bootstrapping procedure resamples the DNA sequence data and recalculates a phylogenetic tree from each new bootstrap sample (we’ll spare you the details). This is repeated a large number of times. The method then examines every bootstrap replicate tree and compares it to the tree estimated from the original data. Every node of the carnivore tree shown here gives a number between 0 and 100 that refers to the percentage of bootstrap replicate trees that have exactly the same set of named species descended from that node. For example, the number 100 on the branch leading to the two skunks means that 100% of the bootstrap replicate trees had a node just like it leading to the same two skunks. The number 54 leading to the trio of red panda (pictured on the first page of this chapter), striped skunk, and spotted skunk indicates that just 54% of the bootstrap replicate trees had a node containing only those same three species. The remaining 46% of the bootstrap replicate trees had different arrangements (e.g., the red panda might have been grouped with the weasel or raccoon). a. What percentage of the bootstrap replicate trees showed the raccoon as the species most closely related to the kinkajou? b. What percentage of the bootstrap replicate trees grouped the giant panda with the two bear species? c. From the results of this bootstrap, which of the following statements can be made with greater confidence, that “red pandas are most closely related to skunks” or that “giant pandas are most closely related to bears”? 7. Use a six-sided die to generate 10 bootstrap replicates of the following data set, which has three numbers from each of two groups:9 Group A: 2.1, 4.5, 7.8 Group B: 8.9, 10.8, 12.4 a. Write out the bootstrap replicate data sets. b. Use these 10 bootstrap replicates to get an approximate 80% confidence interval for the difference between the medians of Group A and Group B. 8. Suppose we want to calculate the bootstrap standard errors for the difference between two groups in their medians. We would also like to do the same for the difference between the two groups in their interquartile ranges. Could we use the same procedure to generate the bootstrap replicate samples for both standard errors? 9. In a hypothetical study of a predatory fish, the bootstrap was used to help generate confidence intervals for the mean waiting time between meals. A waiting-time measurement was obtained for 79 fish in a random sample from the predatory fish population. The mean waiting time was 109 seconds. Ten thousand bootstrap replicates were used to calculate a sampling distribution for the estimate. The mean of the bootstrap replicates was 108.6 seconds, and the standard deviation of the bootstrap replicate estimates was 10.4 seconds. What is the bootstrap standard error of the estimated mean waiting time? 10. Assume that you have two moderately small samples, both drawn from separate populations that are normally distributed but with unequal variances. You want to compare the means of these two populations. What technique would you use to test the null hypothesis that the means are equal? 11. The snail Helix aspersa has an unusual mating system. Each individual is a hermaphrodite, producing both sperm and eggs, but the truly bizarre part is that these snails have evolved a “love dart” with which they try to stab their mate. (If you look closely at the photo of the snails, you can see the love dart protruding through the head of the lower snail.) This love dart is coated with a drug that enhances the amount of sperm that the stabbed snail receives and stores. This storage effect is expected to be greater when the recipient is smaller, because the relative dose would therefore be larger. Rogers and Chase (2001) measured the size of stabbed snails (i.e., the volume of their shells) and the number of stabber sperm that were stored. Is the size of the recipient snail correlated with the amount of sperm it stores? The data are listed in the table provided. The table also gives two computer-generated data sets. One of the computer-generated data sets was created by a bootstrap procedure that resampled the observed data, and the other was made by a permutation procedure (see Chapter 13). a. Could “computer-generated data A” have been created for a bootstrap estimate? Describe how you determined this. b. Could “computer-generated data B” have been created for a bootstrap estimate? Describe how you determined this. c. What statistic should be used on these data to measure whether snail size is correlated with the number of sperm stored? Original data Shell volume (cm3) Number of sperm stored 2.2 2474 2.5 2.6 2.6 2.7 2.9 3.2 3.7 3.3 2897 2658 2471 2250 2606 2978 2516 2332 3.3 2.9 3.0 2.5 3.5 4.2 6.0 5.9 4.0 2009 1807 1552 1843 1158 440 260 788 1138 Computer-generated data A Shell volume (cm3) 2.2 2.5 2.6 2.6 2.7 2.9 3.2 3.7 3.3 3.3 2.9 3.0 2.5 3.5 4.2 6.0 5.9 4.0 Number of sperm stored 1807 2897 1843 2474 260 440 2332 2009 2250 1158 2516 2606 1138 2978 2658 1552 788 2471 Computer-generated data B Shell volume (cm3) Number of sperm stored 6.0 3.7 2.5 3.3 260 2516 2897 2009 3.3 2.5 3.0 2.5 2009 1843 1552 2897 4.0 2.5 2.7 4.2 2.5 2.5 4.2 3.7 2.7 3.3 1138 1843 2250 440 2897 2897 440 2516 2250 2009 ASSIGNMENT PROBLEMS 12. Using the data in Practice Problem 11 on shell volume and the number of sperm stored, we carried out a bootstrap analysis of the correlation coefficient r between the two variables. The cumulative frequency distribution provided here summarizes the 1000 bootstrap replicates of the estimate of the correlation coefficient. The correlation coefficient calculated from the original data was −0.78. From this graph, determine the approximate 95% bootstrap confidence interval for the correlation between shell volume and the number of sperm stored. 13. The surface of plant leaves is a thriving ecosystem, home to many microorganisms such as bacteria. These bacteria can affect the health of the plant, because some cause disease and some serve as ice-nucleation sites, thus causing frost damage. Hirano et al. (1982) were interested in the probability distribution of the number of bacteria on corn plant leaves. The frequency distribution of the number of bacteria on a random sample of single leaves is shown in the accompanying figure. The number of bacteria is calculated as the number per gram of leaf. a. Imagine that we wished to estimate the mean number of bacteria per gram of leaf and the uncertainty of our estimate. Would a confidence interval for the mean based on the tdistribution be appropriate? Explain your answer. b. Name two appropriate methods to calculate a confidence interval for the mean with these data. c. The authors also measured the frequency distribution of the number of bacteria on soybean leaves. The distribution was shaped similarly to the one shown here for corn leaves. List two valid methods that could be used to test for a difference in the median numbers of bacteria on corn and soybean leaves. 14. In Example 13.1, we described the results of a study that measured the change in biomass in marine areas after an increase in protection. The distribution of biomass ratios (i.e., the biomass with protection divided by the biomass without protection) was highly skewed, so we performed a log transformation to overcome this problem when estimating the mean. Here we take a different approach, and we use the bootstrap to estimate the median biomass ratio. The distribution of bootstrap replicate estimates of the median are shown in the following histogram. The mean and standard deviation of this distribution are 1.484 and 0.134, respectively, and there are 1000 bootstrap replicates. a. What does this frequency distribution estimate? b. What is the bootstrap standard error of the median biomass ratio? c. What is the 95% confidence interval for the median biomass ratio? Interpret the histogram to give an approximate answer. 15. Outliers can occur in a sample of data for several reasons, including instrument failure, measurement error, and real variation in the population. The “trimmed mean” is a method developed for estimating a population mean when a measurement is prone to producing outliers. A trimmed mean is an ordinary sample mean calculated on data after the most extreme measurements have been dropped according to a percentile criterion. A 5% trimmed mean drops the measurements below the 5th percentile (0.05 quantile) and above the 95th percentile. In 1882, Simon New-comb estimated the speed of light by measuring the time it took for light to return to his lab after being bounced off a mirror, a round trip of 7442 meters. The following is the frequency distribution of 66 measurements Newcomb made of this time, in microseconds (i.e., in millionths of a second) (Stigler 1977). The distribution includes outliers, and the trimmed mean is an objective way to increase the precision of the estimate. For the light data, the 5% trimmed mean drops the three smallest and three largest values, yielding the value 24.8274 for the mean of the remaining measurements. Bootstrapping is an excellent way to calculate the uncertainty of the trimmed mean. The results of 1000 bootstrap replicate estimates of the trimmed mean are shown in the following histogram. Of the bootstrap estimates, 2.5% were below 24.8254, 5% were below 24.8260, 95% were below 24.8285, and 97.5% were below 24.8288. a. Calculate a 95% confidence interval for the trimmed mean. b. The ordinary, “untrimmed” mean of the 66 data points, including the outliers, is 24.8262. Is the ordinary sample mean contained in the 95% confidence interval for the trimmed mean? 16. The variance-to-mean ratio is a useful measurement of how clumped or dispersed events are in space or time relative to the random expectation (see Section 8.6). A high ratio indicates that events are clumped, whereas a low ratio indicates “overdispersion” of events. Davis et al. (2009) used this approach to examine the dispersion of “compensatory mutations” affecting protein sequences. Most mutations to genes are harmful, but compensatory mutations occasionally occur that counteract some of the damage caused by other mutations. The authors gathered data on 77 harmful mutations of varying lengths and a total of 328 compensatory mutations that have been discovered in the same genes. These data come from organisms ranging from viruses to fruit flies. The authors recorded the number of compensatory mutations at each amino acid position in the genes. They calculated the mean and variance in the number of compensatory mutations per amino acid position and calculated variance/mean = 2.64. They used simulation to test the null hypothesis that mutations were randomly and independently located in the genes. Compensatory mutations were placed at random and independently. After each simulation, the computer calculated the resulting variance-to-mean ratio in the number of compensatory mutations per amino acid position. The results of 10,000 such simulations are plotted in the accompanying histogram. a. What does this frequency distribution estimate? b. Using the results shown in the frequency distribution and the observed variance/mean ratio of 2.64, test the null hypothesis that the true variance/mean ratio is as expected from the random placement of compensatory mutations. 20 Likelihood Biologists gather data in hopes of discovering the correct value of a population parameter among its many possible values. Consider, for example, the relationship between the metabolic rate of organisms and their body mass. According to one theory, the rate should increase in proportion to body mass raised to the power of 3/4, whereas another theory predicts the power should be 2/3, instead (Savage et al. 2004). Which of these alternatives is correct? We must look to the data to find out. Likelihood measures how well alternative values of a parameter fit the data. The approach is based on the idea that the best choice of a parameter value, among all the possibilities, is the one with the highest likelihood—the one for which the data have the highest probability of occurring. If the probability of obtaining the data is much higher for one possible value of the parameter than for another, then the first is probably closer to the truth. Likelihood is a very general approach that can be applied to every type of problem encountered so far in this book, though we have not made much direct use of it until now. One advantage of likelihood techniques is that they can be applied to data that are not normally distributed. Here we introduce the concept of likelihood and identify some of its applications in biology. Our goals are to review the key features of the approach so that its uses in the scientific literature may be more readily understood. What is likelihood? Likelihood measures how well a set of data supports a particular value for a parameter. The likelihood of a specific value for a parameter is the probability of obtaining the observed data if the parameter were equal to that specific value. Using the likelihood method involves calculating the probability of obtaining the observed data for each possible value of the parameter and then comparing this probability between the different possible values. Likelihood measures how well the data support a particular value for a parameter. It is the probability of obtaining the observed data if the parameter equaled that value. The likelihood of a particular value tells us little by itself. Rather, the likelihood of a value gains meaning when compared with the likelihoods of other possible values. The likelihood should be relatively high for those values close to the true population parameter and relatively low for values that are far from it. The parameter value gaining greatest support among all possible values is called the maximum likelihood estimate. It is the value for which the probability of obtaining the observed data is highest. With the likelihood approach, the maximum likelihood estimate is our best estimate of the true value of the parameter. The maximum likelihood estimate is the value of the parameter for which the likelihood is highest. It is our best estimate of the parameter. Nearly all of the estimation techniques we have learned about in this book (for example, the mean and proportion) yield maximum likelihood estimates, although we haven’t referred to them in that way. Two uses of likelihood in biology Here we showcase two of the most frequent uses of likelihood in biology—namely, phylogeny estimation and gene mapping. We don’t present any of the computations or details. Our goal is to illustrate how likelihood is applied. Phylogeny estimation Thomas Henry Huxley1 was right when he said that either the chimpanzee or the gorilla was the closest living relative of the human species. But which ape is our closest cousin? At least three possible relationships between humans and other apes have been proposed, as shown by the tree diagrams in Figure 20.2-1. According to the hypothesis represented by the leftmost tree, we share our most recent common ancestor with the chimpanzees (the bonobo is the pygmy chimp). According to the tree on the right, however, we are closest to the gorilla, instead. Finally, the tree in the middle represents the hypothesis that we stand apart from the rabble, equally related to chimp and gorilla, who are each other’s closest relatives. FIGURE 20.2-1 Three proposed trees of ancestor-descendant relationships between humans and the other great apes. The human branch and our shared ancestor with the other apes is highlighted. Numbers at the bottom are the likelihoods of each proposal based on gene sequence data (Rannala and Yang 1996). The likelihood of the leftmost tree is the highest at e−10133, where e is the base of the natural logarithm. Likelihood is frequently used to estimate trees of ancestor-descendant (phyloge-netic) relationships using DNA sequence data. Rannala and Yang (1996) determined the likelihood of each of the three proposed trees in Figure 20.2-1 by using gene sequence data from five apes. The likelihood of a given tree refers to the probability of obtaining the observed gene sequences for the five species if that tree were the correct tree. This probability is based on a probability model of gene sequence evolution. In this model, the gene sequences for any pair of species were identical at the moment they split from their most recent common ancestor, after which each species accumulated differences gradually and randomly over time. Under this model, the most closely related species today should have the most similar gene sequences, and the most distantly related species should have the most different sequences. The tree on the left in Figure 20.2-1 had the highest likelihood, making it the maximum likelihood estimate. Humans are most closely related to the chimps, justifying our nickname, the “third chimpanzee” (Diamond 1992). Gene mapping Huge efforts are under way to find the genes that underlie inherited forms of human diseases. One approach tests whether genetic “markers” in the human genome differ between individuals having the disease of interest and those not having the disease. A marker is a unique, easily identifiable site in the genome whose gene sequence (“state”) tends to vary among individuals in the population. Many markers throughout the human genome are available for gene-mapping studies. Hamshere et al. (2005) used this approach to find a gene responsible for human schizoaffective disorder. This debilitating mental illness is known to run in families, and it is therefore likely to have a genetic component. The data were the marker states of a sample of individuals afflicted with the disease and their healthy family members. If a gene involved with the disease were present, then the states of markers closest to this gene should show differences in frequency between diseased and healthy individuals. Their approach is illustrated in Figure 20.2-2 with results from chromosome 1. At each marker along the chromosome, the researchers calculated two likelihoods, one for each of two hypotheses. One hypothesis was that a gene increasing risk of schizoaffective disorder was really present at that site. The second hypothesis was the null hypothesis that no gene affecting the disorder was present at that site. The likelihood in each case is the probability of obtaining the observed data (the differences between healthy and diseased individuals in the frequency of marker states) if the given hypothesis is correct. The log of the ratio of these two likelihoods (the first likelihood divided by the second likelihood) measures the strength of evidence that a gene is located at that site.2 This procedure was repeated at every marker along the chromosome, yielding the curve shown in Figure 20.2-2. FIGURE 20.2-2 Evidence for a gene affecting schizoaffective disorder on human chromosome 1. Data from Hamshere et al. (2005), with permission. The highest point on the curve is located approximately at position 260, which indicates where, along chromosome 1, a gene for schizoaffective disorder is most likely to be present. Maximum likelihood estimation Likelihood is a probability. The likelihood L of a particular value for a parameter is the probability of obtaining the data if the parameter equaled that value. Using a vertical bar to represent given or conditional upon (see Chapter 5), this definition of L can be written as L[value|data]=Pr[data|parameter=value]. Maximum likelihood estimation involves finding the parameter value that has the largest L. Here we show how this is accomplished using Example 20.3, which comes from a study that investigated a population proportion. EXAMPLE 20.3 Unruly passengers The tiny wasp, Trichogramma brassicae, parasitizes eggs of the cabbage white butterfly, Pieris brassicae. The wasp rides on a female butterfly (the arrow in the photo at the right points to a small wasp on a butterfly’s leg). When the butterfly lays her eggs, the wasp climbs down and parasitizes the freshly laid eggs. Fatouros et al. (2005) tested whether the wasps could distinguish mated female butterflies (with fertilized eggs) from unmated females. They carried out a series of trials in which a single wasp was presented simultaneously with two female butterflies, one of them a virgin female and the other recently mated. Of the 32 wasps that rode on females, 23 chose the mated female, whereas nine chose the unmated female. We can use these data to provide an interval estimate of the population proportion. You learned in Chapter 7 how to estimate a population proportion and to calculate its confidence interval. The probability model is relatively simple for this case, however, so we can use it to demonstrate how to estimate a parameter using maximum likelihood. Probability model Maximum likelihood estimation requires a probability model that specifies the probabilities of different outcomes of the data-sampling process depending on the parameter being estimated. In the wasp study (Example 20.3), the outcome measured was the number of wasps that chose the mated female butterfly. The parameter of interest is p, the unknown proportion of wasps in the population that would choose the mated female. Let’s assume that each of the n wasps tested represented a single random trial and that separate trials were independent. Under these conditions, the number of wasps choosing the mated female should follow a binomial distribution with the probability of “success” in any one trial equal to p. In this case, the probability that exactly Y females choose mated females depends on p as follows: Pr[Y choose mated | p]=(nY)pY(1-p)n-Y. This formula for the binomial distribution3 was first introduced in Chapter 7, except that here we have written the probability of Y successes as conditional on the unknown p. We use this formulation because we need to vary p to see how this affects the calculated probability of obtaining the observed data. The likelihood formula We use the symbols L[p | Y choose mated] to indicate “the likelihood of a particular value of p, given that Y wasps chose the mated female.” This likelihood is defined as “the probability that Y wasps choose the mated female in n trials, given that the population parameter equals the particular value for p.” Thus, the formula for the likelihood is L[p | Y choose mated]=Pr[Y choose mated | p]=(nY)pY(1-p)n-Y. To calculate the likelihood of a specific value for p, set Y equal to 23, the observed number of wasps choosing mated females. Also, plug in the total number of trials, n = 32, yielding L[p | 23 choose mated]=(3223)p23(1-p)9. The likelihood of p = 0.5 is L[p=0.5 | 23 choose mated]=(3223)(0.5)23(1-0.5)9=0.00653. This likelihood represents the support for the possibility that exactly half of the wasps in the population would choose the mated female, given that 23 of 32 sampled wasps did so. We cannot interpret this number in isolation, however, because likelihoods are informative only when compared with the likelihoods of other values for the parameter. It is usually easier to work with the log of the likelihood rather than the likelihood itself.4 The log-likelihood is the natural log of the likelihood. The formula for the log-likelihood of p, given the observed Y, is ln L[p | Y choose mated]=ln[(nY)]+Yln[p]+(n-Y)ln[1-p]. The log-likelihood of a value for the population parameter is the natural log of its likelihood. To understand how to use this formula, plug in the values n = 32 and Y = 23. The loglikelihood of p = 0.5, given the data, is ln L[p | 23 choose mated]=ln[(3223)]+23ln[0.5]+9ln[1-0.5]=-5.03125. This value is the same as the natural log of L[0.5 | 23 choose mated] = 0.00653 calculated previously (except for rounding errors). The maximum likelihood estimate The maximum likelihood estimate of a parameter is the specific value having the highest likelihood, given the data. The value of the parameter that maximizes the likelihood is also the one that maximizes the log-likelihood, so we can work with the log-likelihood to find the maximum likelihood estimate. A straightforward way to find the maximum likelihood estimate is to use the computer to calculate the log-likelihood across the range of possible parameter values and then pick the highest one. This is the approach we use here. Alternatively, we could use calculus to find the maximum. The work of finding the maximum likelihood value for p is much reduced by using a program on the computer, rather than your calculator. In either case, start by evaluating the loglikelihood for several values of p over a broad range to see its overall shape. In Table 20.3-1, for example, we calculate the log-likelihood of values of p between 0.1 and 0.9 in increments of 0.1. You can see that the log-likelihood is highest when the proportion p is about 0.7 and that it declines at larger and smaller values. TABLE 20.3-1 The log-likelihood calculated for a range of values of p using a spreadsheet program on the computer. Proportion p Log-likelihood 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −36.758 −21.876 −13.752 −8.523 −5.031 −2.846 −1.890 −2.468 −5.997 Next, narrow the search by using a finer sequence of values for p. Table 20.3-2 shows the results for values of p between 0.52 and 0.88 in increments of 0.01. The log-likelihood reaches its maximum when p is 0.72. This value is therefore the maximum likelihood estimate. TABLE 20.3-2 The log-likelihood calculated for a narrower range of values for p. The log-likelihood is maximized at pˆ = 0.72. The third column calculates the difference between each log-likelihood and the maximum loglikelihood. Proportion, p Log-likelihood Log-likelihood – maximum 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 −4.497 −4.248 −4.012 −3.787 −3.575 −3.375 −3.187 −3.010 −2.634 −2.385 −2.149 −1.924 −1.712 −1.512 −1.324 −1.147 0.60 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 −2.846 −2.694 −2.554 −2.426 −2.310 −2.207 −2.117 −2.039 −1.976 −1.926 −1.890 −1.869 −1.863 −1.873 −1.900 −1.944 −2.007 −2.089 −2.192 −2.318 −2.468 −2.644 −2.848 −3.084 −3.354 −3.663 −4.014 −4.416 −4.873 −0.983 −0.831 −0.691 −0.563 −0.447 −0.344 −0.254 −0.176 −0.113 −0.063 −0.027 −0.006 0.000 −0.010 −0.037 −0.081 −0.144 −0.226 −0.329 −0.455 −0.605 −0.781 −0.985 −1.221 −1.491 −1.800 −2.151 −2.553 −3.010 In Figure 20.3-1, we plot the log-likelihood of all possible values of p between 0.4 and 0.9. The resulting curve is called the log-likelihood curve. The maximum5 of the log-likelihood curve occurs at the value pˆ = 0.72, so this is the maximum likelihood estimate of the population proportion. We give this value the symbol p to indicate that 0.72 is our maximum likelihood estimate.6 FIGURE 20.3-1 The log-likelihood curve for p, the estimated proportion of wasps choosing the mated female butterfly. The log-likelihood is maximized at pˆ = 0.72. Likelihood-based confidence intervals The log-likelihood curve also allows us to calculate a confidence interval for the population parameter. The range of values for p whose log-likelihood lies within χα,12/2 units of the maximum constitutes the 1 − α likelihood-based confidence interval (Meeker and Escobar 1995). For example, an approximate 95% confidence interval for p is the range of values whose log-likelihood lies within 1.92 units of the maximum, since χ0.05, 12/2=3.84/2=1.92. The 95% confidence interval is therefore determined directly from the loglikelihood curve. Figure 20.3-2 shows that the limits of the 95% confidence interval for the proportion p are 0.55 < p < 0.86. A table of calculation results (e.g., Table 20.3-2) can also help to find these limits. The likelihood-based confidence interval for a population parameter is calculated directly from the log-likelihood curve. FIGURE 20.3-2 The likelihood-based 95% confidence interval for p. The top horizontal line indicates the highest log-likelihood (−1.863). The line immediately below corresponds to 1.92 units below the maximum. The 95% confidence interval, indicated by the red lines, is the range of values for the parameter whose loglikelihoods fall within 1.92 units of the maximum. This interval ranges from 0.55 to 0.86 for the wasp-butterfly data. Based on these data, we conclude that the proportion of wasps in the population that would choose the mated female butterfly lies between 0.55 and 0.86 (which is very similar to the confidence interval we could calculate using the Agresti-Coull method (Section 7.3). This most-plausible range for the population proportion is relatively broad and includes relatively weak preference values (but still greater than 0.5) as well as relatively strong preference values. A larger sample size would be needed to narrow the range further. This confidence-interval method based on the likelihood curve is more accurate than other methods used to generate confidence intervals for the maximum likelihood estimates (Meeker and Escobar 1995). It also requires no formula other than that needed to calculate the loglikelihood curve. Versatility of maximum likelihood estimation The great advantage of maximum likelihood estimation is its versatility. It can be applied to almost any situation in which it is possible to write a model describing the probability of different outcomes. Using Example 20.4, we demonstrate a less familiar problem than estimating a proportion, in which likelihood methods have proved invaluable. EXAMPLE 20.4 Conservation scoop Counting elephants is more challenging than you might think, at least when they live in dense forest and feed at night. Eggert et al. (2003) used “capture-recapture” to estimate the total number of forest elephants inhabiting Kakum National Park in Ghana without having to see a single one. The researchers spent about two weeks in the park collecting elephant dung, from which they were able to extract pure elephant DNA. Using five genes, the researchers generated a unique DNA fingerprint for every elephant “encountered” via dung deposits. By using this fingerprint method, they identified 27 elephant individuals over the first seven days of sampling. We call this the first sample, and refer to these 27 elephants as “marked.” Over the next eight days, they sampled 74 individuals, of which 15 had already been detected in the first sample. We’ll refer to these 15 elephants as “recaptured.” Based on the number of recaptures in the second sample, what is the total population size of elephants in the park? This kind of data can be used to estimate population size because the first sample tells us how many have been marked, and the second sample tells us the proportion of the total population that has been marked. By dividing the number marked by the proportion marked, we can estimate the total number of individuals in the population, N, which is the parameter of interest. Probability model The simplest probability model for population size estimation makes the following assumptions: ■ The population of elephants in the park is constant—there were no births, deaths, immigrants, or emigrants while the study was being carried out. ■ Random sampling—the dung of every elephant, marked or unmarked, had an equal and independent chance of being sampled. Our data are the number of “recaptures” in the second sample—the individuals that had already been marked in the first sample. We know that 15 individuals were recaptured in the data, but, if we could go back and take another “second” sample of 74 individuals under identical conditions, we would probably obtain a different value for the number recaptured. In other words, the number recaptured has a probability distribution that depends on how many individuals there are in the population (N) and on the sample sizes. To use the likelihood approach, we need to know the probability of obtaining the observed number of recaptures for different possible values of the parameter N. Finding this probability can be one of the biggest challenges of using the likelihood method. Often, though, someone has already investigated similar data and has published the needed formula in the scientific literature. For mark-recapture data, the formula is already available—we found it in Gazey and Staley (1986). Use n1 to represent the number of individuals captured and marked in the first seven days of study. In this case, n1 = 27. Let n2 equal the size of the second sample of individuals obtained over the next eight days. In this case, n2 = 74. The probability of Y recaptures, given the population size N, is7 Pr[number recaptures=Y | N]=(n1Y)(N-n1n2-Y)(Nn2). The term on the right of the equal sign represents the proportion of all possible random samples of n2 elephants that include exactly Y recaptures (previously marked) and n2 − Y new individuals (not previously marked). The likelihood formula From this point on, the steps are the same as those we followed in Example 20.3 when estimating a proportion. Let L[N | Y recaptures] indicate the likelihood of a particular value for N, given that there were Y recaptures. The likelihood is defined as the probability of obtaining Y recaptures if N equals the specified value—namely, Pr [number recaptures = Y | N]. Thus, the formula for the likelihood is L[N | Y recaptures]=(n1Y)(N-n1n2-Y)(Nn2). It is easier to work with the log-likelihood rather than the likelihood itself: lnL[N | Y recaptures]=ln[(n1Y)]+ln[(N-n1n2-Y)]-ln[(Nn2)]. To calculate the log-likelihood of a specific value of N, substitute Y = 15 into the preceding equation. Plugging in the fixed sample sizes, n1 = 27 and n2 = 74, as well, yields lnL[N | 15 recaptures]=ln[(2715)]+ln[(N-2774-15)]-ln[(N74)]. We are now ready to calculate the log-likelihood for specific values of N. The maximum likelihood estimate We plugged the formula for the log-likelihood into a computer program and calculated the loglikelihood of all possible values of N between 90 and 250. Try this yourself to see if you obtain the same answers we did. Our resulting log-likelihood curve is shown in Figure 20.4-1. The maximum of the log-likelihood curve occurs at the value Nˆ = 133, so this is the maximum likelihood estimate of the elephant population size. FIGURE 20.4-1 The log-likelihood curve for N, the total number of elephants in Kakum National Park in Gha na. The log-likelihood is maximized at Nˆ = 133. The values 104 and 193 are the limits of the likelihood-based 95% confidence interval for N. We also used the computer to determine the likelihood-based 95% confidence interval for N. This confidence interval includes all parameter values whose log-likelihood is within χ0.05, 12/2=3.84/2=1.92 units below the maximum log-likelihood. These limits occurred at the N values 104 and 193, as shown in Figure 20.4-1. Based on these data, the population size of elephants in Kakum National Park, Ghana, is likely to be in the range 104<N<193. Such a broad interval estimate for the population size estimation is not unusual for capturerecapture methods. Bias The maximum likelihood estimate should be relatively close to the true population parameter, compared with other values having lower likelihood. Nevertheless, maximum likelihood estimates are often biased. On average, that is, a maximum likelihood estimate tends to fall to one side of the population parameter that it is intended to estimate. For example, the maximum likelihood method for estimating population size using capture-recapture techniques tends to underestimate population size, on average, even when all the assumptions are met (Krebs 1999). In some cases, corrections are known that will compensate for bias. The bias is usually small, however, compared with the breadth of the 95% confidence interval. Bias diminishes as sample size increases (Edwards 1992). Log-likelihood ratio test The log-likelihood ratio test uses likelihood to compare how well two probability models fit the data. In one of the models, the parameter or parameters of interest are constrained to match that specified by a null hypothesis. The constraint is relaxed in the second probability model, which corresponds to the alternative hypothesis. This second model is fit using parameter values that best match the data. If the likelihood of the second model, in which the parameters best match the data, is significantly higher than the likelihood of the first model, in which the parameters are constrained to match H0, then we reject the null hypothesis. Importantly, one of the hypotheses has to be a special case of the other; for example, a parameter that is allowed to vary in one hypothesis is set to a specific value in the other. In this section, we introduce the log-likelihood ratio test. We analyze the wasp choice data from Example 20.3 (Fatouros et al. 2005) and test the null hypothesis that the population proportion p is 0.5. We use the binomial distribution to calculate the log-likelihood of p = 0.5, and we compare it with the log-likelihood of the alternative model, in which the parameter p is set to the maximum likelihood estimate. Likelihood ratio test statistic The test statistic for the likelihood ratio test, G, is twice the natural log of the ratio of the likelihoods of the two hypotheses. For the simplest case of only a single parameter, G=2 ln(L[maximum likelihood value of parameter | data]L[parameter value under H0 | data]). The formula is similar when there is more than one parameter to be estimated. The remarkable feature of the log-likelihood ratio is that, if H0 is true, then G is approximately χ2 distributed. This means that we can use the χ2 distribution to calculate a P-value and thus decide whether the null hypothesis should be rejected. The approximation is reliable only when the sample sizes are large.8 The degrees of freedom for G equal the difference between the two hypotheses in the number of parameters that require estimation using the data. In the case of only a single parameter, there is one degree of freedom. Testing a population proportion In the wasp experiment of Example 20.3, more wasps chose the mated female butterfly (23) than the unmated female (9). Is this by itself evidence that wasps in the population prefer mated female butterflies? We will use the log-likelihood ratio test to answer this question. The hypotheses are as follows. H0: Wasps choose mated and unmated females with equal probability (p = 0.5). HA: Wasps prefer one type of female over the other (p ≠ 0.5). We could use the binomial test (Chapter 7) or the x2 goodness-of-fit test (Chapter 8) to test these hypotheses. The probability models are relatively simple for this case, though, so we will use it to demonstrate the log-likelihood ratio test. This is a two-tailed test because a preference for either the mated or the unmated female is possible if the null hypothesis is false. Our test statistic is the log-likelihood ratio, G=2 ln(L[pˆ | Y chose mated female]L[p0 | Y chose mated female]), where L[p0 | Y chose mated female] is the likelihood of p0, the value of p specified by the null hypothesis. L[pˆ | Y chose mated female] is the likelihood of pˆ, the maximum likelihood estimate of p. It is generally easiest to work with log-likelihoods, so we rewrite the previous formula as G=2(lnL[p^|Y chose mated female]−lnL[p0|Y chose mated female]). Under the null hypothesis, p0 = 0.5, and in the data Y = 23 wasps chose the mated butterflies. Plugging these values into the likelihood formula for p, we get ln L[p0 | 23 chose mated female]=ln[(3223)(0.5)23(10.5)9]=ln[(3223)]+23 ln(0.5)+9 ln(0.5)=-5.031. Under the alternative hypothesis, p is unknown and is estimated using maximum likelihood. We already determined in Section 20.3 that the maximum likelihood estimate is pˆ = 0.72. The corresponding log-likelihood of this pˆ was found to be −1.863 (Figure 20.3-2). We now have what we need to calculate the test statistic: G=2(lnL[p^|23 chose mated female]−lnL[p0|23 chose mated female])=2[−1.863−(−5.031)]=6.336. Under the null hypothesis, the test statistic G is approximately χ2 distributed. The difference between the two hypotheses in the number of parameters needing estimation from the data was one. The test statistic G therefore has df = 1. The critical value for the χ2 distribution having one degree of freedom and α = 0.05 is available in Statistical Table A: χ0.05, 12=3.84. Because G > 3.84, P < 0.05. Using a computer to calculate the exact probability for the χ2 distribution, we find that P = 0.012. We reject the null hypothesis based on these data. The wasps indeed prefer mated female butterflies over unmated females.9 0.6 Summary ■ Likelihood measures the level of support provided by data for a particular value of a population parameter. The likelihood of a specified value is the probability of obtaining the observed data if the parameter equaled that value. ■ The maximum likelihood estimate of a parameter is the value having highest support among all possible values. It is the parameter value for which the probability of obtaining the observed data (the likelihood) is highest. ■ Maximum likelihood estimation depends on a probability model that specifies the probabilities of different outcomes from the process that produced the data. ■ The log-likelihood is the natural log of the likelihood. The hypothesis that maximizes the log-likelihood is the same as the one maximizing the likelihood. ■ The log-likelihood curve describes the log-likelihood of a range of values of the parameter. The maximum likelihood estimate can be found from the value that gives the highest point on the curve. ■ The likelihood-based confidence interval for a single parameter is the range of values of the parameter whose log-likelihoods lie within χα,12/2 units of the maximum. The range for an approximate 95% confidence interval includes all values for the parameter whose log-likelihoods are within 3.841/2 = 1.92 units of the maximum. ■ The log-likelihood ratio test uses likelihood to compare the fits of two probability models to the data. In the case of a single parameter, one of the models constrains the parameter to equal that specified by a null hypothesis. The alternative probability model is fit using the maximum likelihood estimate of the parameter. Quick Formula Summary Likelihood Formula: L[value | data] = Pr[data | parameter = value]. Likelihood-based confidence interval for a single parameter What is it for? To obtain a confidence interval estimate for a parameter. What does it assume? Data are randomly sampled from a population. The specific assumptions of the probability model are correct. Formula: Obtained directly from the log-likelihood curve. To do so, find the range of parameter values whose log-likelihood lies within χα,122 units of the maximum. Log-likelihood ratio test for a single parameter What is it for? To compare the fit of two probability models to data using likelihood. In one model, the parameter is fixed to that specified by the null hypothesis. The other model is fit using the maximum likelihood estimate of the parameter. What does it assume? Data are randomly sampled from a population. The specific assumptions of the probability model are correct. Test statistic: G Distribution under H0: Approximately a χ2 distribution with degrees of freedom equal to the difference between the two hypotheses in the number of parameters requiring estimation from the data. Formula:G=2ln(L[maximum likelihood value of parameter | data]L[parameter value under H0 | data]). PRACTICE PROBLEMS 1. Albino animals are well known, but what about plants? Apiron and Zohary (1961) found a recessive mutation in orchard grass (Dactylis glomerata) that causes a chlorophyll deficiency. Individuals with two deficient copies of the gene lack the green pigment entirely (and die), but heterozygous individuals, which have one normal copy of the gene and one deficient copy, appear to produce normal levels of chlorophyll. Using progeny testing, the authors found that 12 of 47 plants sampled from the Matzleva Valley in Israel were heterozygotes. Assume that the plants were randomly sampled. a. What probability distribution should we use to calculate the probability of obtaining a specific number of heterozygous individuals (p) in a random sample of size 47? b. Write the formula for the likelihood of p, given the data. What does the likelihood measure? c. Write the formula for the log-likelihood of p. d. Calculate the log-likelihood of the value p = 0.50, given the data. 2. Refer to Practice Problem 1. The accompanying table lists log-likelihoods of values of p between 0.1 and 0.4, in increments of 0.02. a. Using only the table calculations, identify the maximum likelihood estimate of p to within 0.02 units. b. Using these same calculations, generate a likelihood-based 95% confidence interval for p. Proportion, p Log-likelihood Log-likelihood – maximum 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 −6.639 −5.238 −4.193 −3.414 −2.844 −2.444 −2.186 −2.051 −2.024 −2.094 −2.252 −2.492 −2.809 −3.201 −3.663 −4.195 −4.615 −3.214 −2.169 −1.390 −0.820 −0.420 −0.162 −0.027 0.000 −0.070 −0.228 −0.468 −0.785 −1.177 −1.639 −2.171 3. Noonan et al. (2006) used DNA sequence data and the method of maximum likelihood to add the extinct Neanderthal species to the tree of humans and other apes (see Figure 20.2-1). The accompanying graph is a likelihood curve10 for the time of the split between humans and Neanderthals, which is more recent than that between humans and the other living apes. a. What does the likelihood curve measure in this example? b. With the aid of a ruler, use the likelihood curve to find the maximum likelihood estimate of divergence time. c. Using this same curve, approximate the likelihood-based 95% confidence interval for divergence time. d. What does the interval in part (c) measure? Explain. 4. Yashin et al. (2000) compared a sample of 197 centenarians (people 100 years old or older) with a group of 465 younger people (aged between 5 and 80) to examine whether the two groups differed in the frequencies of genetic markers. As part of an attempt to estimate which genetic markers are associated with mortality, the researchers tested for the presence of “hidden heterogeneity,” intrinsic differences in mortality rate among individuals within groups. To accomplish this, they generated a likelihood model for a heterogeneity parameter, σ2, and calculated the log-likelihood for all values of σ2 between 0 and 0.8, given the data. Their methods are too complicated to describe fully here, but their results are summarized in the accompanying graph. The maximum likelihood estimate for σ2 was 0.66. The null and alternative hypotheses are as follows. H0: Heterogeneity in mortality is absent (σ2 = 0). HA: Heterogeneity in mortality is present (σ2 ≠ 0). a. Using the log-likelihood curve, find the approximate value of the log-likelihood under the null hypothesis that heterogeneity is zero. Assume that the probability model used depends only on the single parameter shown. b. With the aid of a ruler, use the log-likelihood curve to find the log-likelihood corresponding to the maximum likelihood estimate. c. Using your results from parts (a) and (b), calculate the test statistic for a log-likelihood ratio test. d. Under H0, what is the approximate null distribution for the log-likelihood ratio statistic? e. Using your results from parts (a)−(d), carry out the log-likelihood ratio test. Report your conclusion. 5. “Have you ever stolen something worth more than $10?” Anyone asked this question in a survey might be reluctant to answer truthfully, especially if he or she did not wish to make known a past misbehavior. The respondent might be more willing to tell the truth if the questioner doesn’t know which of two questions, randomly chosen, the responder is answering (Warner 1965). This approach was used to estimate the true fraction of thieves among the third-year biology undergraduate population on a university campus in 2006. A total of 185 students participated. Each student was instructed to flip a coin and conceal the outcome. He or she was to respond with a yes if the outcome was heads. If the outcome was tails, the student was to answer the theft question truthfully with a yes or a no. The result: 113 of the 185 students responded with a yes, whereas the remaining 72 answered no. Assume that students answered truthfully and independently, that the probability of heads was 0.5, and that the sample of students was a random sample. Use likelihood to estimate the fraction of thieves in the student population. a. Construct a probability tree (Chapter 5) to show that the probability of a student answering yes is (1 + s)/2, where s is the fraction of thieves in the population. b. If the assumptions are met, the number of yes answers in the survey of n students, Y, should follow a binomial distribution with the probability of a yes equal to (1 + s)/2: Pr[Y yeses | s]=(nY)(1+s2)Y(1-s2)n-Y. Write the formula for the likelihood of s, given the data. c. Write the formula for the log-likelihood of s, given the data. d. Calculate the log-likelihood that the fraction of thieves s is zero. 6. Refer to Practice Problem 5. a. Using a spreadsheet or other program, calculate the log-likelihood of values of s between 0.0 and 0.5 in increments of 0.01. Using this information, determine the maximum likelihood estimate of the fraction of thieves. b. Using the same approach as in part (a), calculate the likelihood-based 95% confidence interval for the parameter s. c. Are there truly thieves among us? Using the values for the likelihood calculated in your answers to Practice Problem 5, use the log-likelihood ratio test to test the null hypothesis that s is zero. 7. A regulatory gene controls the expression of other genes—it turns them on and off. Many of these targets are themselves regulatory genes, instructing genes to turn on or off. Guelzim et al. (2002) determined the number of regulatory genes controlled by a sample of genes in yeast. Their data are listed in the following table. Number of regulatory genes controlled, i Frequency, fi 0 1 2 3 4 5 72 18 10 5 1 3 Total 109 The shape of this frequency distribution suggests that it might be approximated by a probability distribution known as the geometric distribution. Under a geometric distribution, the fraction of genes controlling no regulatory genes (here, i = 0) is p. The fraction controlling exactly one regulatory gene (i = 1) is (1 − p)p. The fraction controlling exactly two regulatory genes (i = 2) is (1 − p)2p, and so on, yielding Pr[i]=(1−p)ip, where i is 0, 1, 2, or more. Maximum likelihood methods can be used to estimate the parameter p, the fraction controlling no regulatory genes, from data. The log-likelihood formula for p is ln[L[p | data]]=ln[1-p](∑iifi)+n ln[p], where fi is the frequency of observations corresponding to i = 0, 1, 2, and so on, and n is the total sample size. a. Using the data in the table and the preceding formula, calculate the log-likelihood of values of p between 0.1 and 0.9 in increments of 0.01. Use a computer. Draw the relationship between the log-likelihood and p with a log-likelihood curve. b. What is the maximum likelihood estimate pˆ, to an accuracy of two decimal places? c. What is the value of the log-likelihood at the maximum likelihood estimate pˆ ? d. Using the formula for the geometric distribution, Pr[i] = (1 − p)i p, calculate the predicted proportion of genes regulating i = 0 to 5 genes based on pˆ. Plot these values on a histogram with the observed proportions. Based on the result, do the data appear to follow a geometric distribution? e. Identify one method you might use to test the null hypothesis that a geometric distribution fits the data. 8. Phylogenetic trees like those in Figure 20.2-1, if dated, permit us to estimate the rate at which new species have formed over time. For example, the following tree indicates the timing of events in the history of the living species of Dubautia, the Hawaiian silverswords (Baldwin and Sanderson 1998), a famously diverse group of plants found only on the Hawaiian islands. Assume that the number of Dubautia species has increased exponentially over time and has suffered no extinctions. In this case, the time interval between successive branching events on the tree provides information about the rate of increase of the number of species, λ (lambda). The earliest four intervals are indicated in red on the tree; with 23 species, there are n = 21 such intervals in total. Let Yi measure the duration of each interval i multiplied by the number of species alive at that time. We’ll call these Yi the “waiting times.” The n = 21 values of Yi from the above tree are 1.0, 3.0, 0.8, 5.0, 2.4, 2.1, 0.8, 3.6, 5.0, 0.0, 0.0, 0.0, 1.4, 0.0, 1.6, 3.4, 0.0, 0.0, 2, 0.0, 4.4. Under the assumptions given, the Yi values should follow an exponential distribution (Hey 1992), with probability highest at zero and declining smoothly with increasing Y. To illustrate, an exponential distribution is superimposed on the following histogram of the Yi values. Use the waiting-time data to estimate the rate λ. The formula for the log-likelihood of λ, given the Yi, is ln L[λ | observed waiting times] =nln[λ]−λ∑iYi. a. Using a computer program, calculate the log-likelihood of values of λ between 0.1 and 0.9 in increments of 0.01. Draw the relationship between the log-likelihood and A with a loglikelihood curve. b. Find the maximum likelihood estimate of λ to two decimal places. This estimates the number of new species produced per species per million years in Dubautia. c. Using a similar approach as in part (b), find the likelihood-based 95% confidence interval for λ. ASSIGNMENT PROBLEMS 9. Huntington disease is an inherited neurodegenerative disease caused by a mutation in the hun-tingtin gene. When did this mutation arise? García-Planells et al. (2005) found that most Spanish cases of the disease share the same ancestral mutation. Using gene sequence data from patients at genetic loci linked to the mutation, the authors applied a likelihood method similar to that used to date events in the human-ape lineage (Figure 20.2-1). Their calculations produced the following log-likelihood curve for the date of origin of the huntingtin mutation, measured as the number of generations before the present. a. With the aid of a ruler, use this curve to find the maximum likelihood estimate for the date of origin of the mutation shared by most Spanish cases of Huntington disease. b. Using this same curve, approximate a 95% confidence interval for the date of origin. c. What does the interval in part (b) measure? Explain. 10. The Mediterranean shrub Thymelaea hirsute has five sexual types, the most curious of which is “gender labile.” Such individuals change their predominant sex from year to year. Ramadan et al. (1994) found 13 gender-labile individuals in a sample of 68 shrubs from a single habitat in Egypt. a. What probability distribution would we use to calculate the probability of a specific number of gender-labile individuals in a sample of size 68? b. Write the formula for the likelihood of p, given the data. What does this likelihood measure? c. What are your assumptions in part (b)? d. Write the formula for the log-likelihood of p. e. Calculate the log-likelihood of the value p = 0.40, given the data. 11. Refer to Assignment Problem 10. The following table lists log-likelihoods for a range of values of p between 0.1 and 0.3, in increments of 0.01. Proportion, p Log-likelihood Log-likelihood – maximum 0.10 −4.652 −2.550 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 −4.027 −3.517 −3.105 −2.778 −2.524 −2.336 −2.207 −2.130 −1.925 −1.415 −1.003 −0.676 −0.422 −0.234 −0.105 −0.028 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 −2.102 −2.119 −2.176 −2.272 −2.404 −2.570 −2.768 −2.996 −3.254 −3.539 −3.853 −4.192 0.000 −0.170 −0.074 −0.170 −0.302 −0.468 −0.666 −0.894 −1.152 −1.437 −1.751 −2.090 a. Using only our calculations based on these increments, identify the maximum likelihood estimate. b. Using these same calculations, generate a likelihood-based 95% confidence interval for p. 12. Sacktor et al. (2000) measured the neuropsycho-logical performance of 33 HIV-positive patients undergoing antiretroviral therapy. Hand-use performance improved in 23 of the 33 patients but deteriorated in the remaining 10 patients. The graph below shows the loglikelihood curve for the population proportion p of patients whose hand-use performance improved after antiretroviral therapy. a. Using only the information in the log-likelihood curve, test the null hypothesis that the probability p is 0.5 using the log-likelihood ratio test. b. Using the same curve, provide an approximate 95% confidence interval for the proportion of patients improving in the population. 13. Ants are capable of chemically discriminating between nestmates and non-nestmates. Ozaki et al. (2005) discovered sensory cells in the antennae of ants that respond only to the cuticular hydrocarbons (CHCs) of non-nestmates. They showed that 42 of 48 head-attached antenna preparations responded to chemical extracts of CHC compounds obtained from nonnestmates. Use likelihood to estimate the proportion of preparations p responding to nonnestmate CHC extracts. a. Using a computer program, calculate the log-likelihood of a series of values of p between 0.50 and 0.99 in increments of 0.01. Draw the relationship between the log-likelihood and p with a log-likelihood curve. b. Find the maximum likelihood estimate for p to two decimal places. c. Calculate the likelihood-based 95% confidence interval for the parameter p. 14. Although we used the log-likelihood ratio test to analyze the data on the preference of parasitic wasps for female butterflies (Example 20.3), we could have used the binomial test (Chapter 7) or the χ2 goodness-of-fit test (Chapter 8). The data are reproduced in the following frequency table. Preference Observed frequency Chose mated female Chose virgin female 23 9 Total 32 Analyze the data again, but this time use the G-test, a goodness-of-fit test that we learned about in Chapter 9. The formula for the G-statistic is given in the following equation, where Observedi and Expectedi refer to the observed and expected frequencies in category i, respectively. Under the null hypothesis, G is approximately χ2 distributed with k − 1 degrees of freedom, where k is the number of categories in total. G=2∑iObservediln[ObservediExpectedi]. a. Restate the hypotheses for the parasitic wasp and butterfly data, and provide the expected frequencies under the null hypotheses. b. Calculate the G-statistic using the formula provided. c. Compare your result in part (b) with the critical value from the χ2 distribution with the appropriate degrees of freedom, and state your conclusion. d. Compare the value of G that you calculated in part (b) with the value of the log-likelihood ratio statistic that we calculated in Section 20.3. Based on your comparison, what can you infer about the G-test? 15. Dispersal and the movement distance of organisms are sometimes described using a probability distribution known as the geometric distribution (see Practice Problem 7). For example, the following histograms show the number of home ranges separating the locations where individual male and female field voles, Microtus agrestis, were first trapped, and the locations they were caught in a subsequent trapping period (Sandell et al. 1991). Most individuals didn’t move at all, a few moved just one home range away, and a small fraction moved two home ranges. If these frequencies are described by a geometric distribution, then the fraction of observations in the first category (here, i = 0) is given by the parameter p. The fraction in the second category (i = 1) is (1 − p)p. The fraction in the third category (i = 2) is (1 − p)2p, and so on, yielding Pr[i]=(1−p)ip, where i is 0, 1, 2, or more. Maximum likelihood methods can be used to estimate the parameter p from the data. The log-likelihood formula for p is lnL[p | data]=ln[1-p](∑iifi)+n ln[p], where fi is the frequency of observations corresponding to i = 0, 1, 2, and so on. a. Forty-eight of 56 male voles stayed put (i = 0), whereas the remaining eight moved one home range. Using a computer program, calculate the log-likelihood of values of p between 0.70 and 0.99 with intervals of 0.01. Draw the relationship between the loglikelihood and p with a log-likelihood curve. b. Find the maximum likelihood estimate pˆ for male voles to two decimal places. c. Obtain a likelihood-based 95% confidence interval for p. 16. Refer to Assignment Problem 15. Seventy-five of 89 female voles remained where they were first caught (i = 0), 12 moved just one home range away (i = 1), and two moved two home ranges away (i = 2). a. Repeating the procedures described in Assignment Problem 15, calculate the maximum likelihood estimate of p for females. b. Similarly, find a likelihood-based 95% confidence interval for p. Does it overlap that for males? (See Assignment Problem 15.) 17. Is the perception of eye-gazing in humans acquired through experience, or is it innate? To test this, Farroni et al. (2002) presented 17 infants (3–5 days old) with paired photographs of faces. The face in one photograph had a direct gaze, whereas the other had an averted gaze (see image). Fifteen of 17 infants spent more time gazing at the face with the direct gaze. Use the log-likelihood ratio test to test the null hypothesis of no preference. Show the steps of your calculations. 18. Life spans of individuals in a population often approximate an exponential distribution, a continuous probability distribution having probability density f (Y) = λe−λY, where λ is the mortality rate. To estimate the mortality rate of foraging honey bees, Visscher and Dukas (1997) recorded the entire foraging life span of 33 individual worker bees in a local bee population in a natural setting. The 33 life spans (in hours) are listed as follows and in a histogram. 2.3, 2.3, 3.9, 4.0, 7.1, 9.5, 9.6, 10.8, 12.8, 13.6, 14.6, 18.2, 18.2, 19.0, 20.9, 24.3, 25.8, 25.9, 26.5, 27.1, 30.0, 33.3, 34.8, 34.8, 35.7, 36.9, 41.3, 44.2, 54.2, 55.8, 65.8, 71.1, 84.7 If life span follows an exponential distribution, then the log-likelihood of A, given the data is lnL[λ | observed life spans]=n ln(λ)-λ∑iYi, where Yi is the life span of individual i, and n is the sample size. a. Estimate λ, the mortality rate of bees per hour. Using a computer, find the maximum likelihood estimate of λ to two decimal places. b. What is the value of the log-likelihood at the maximum likelihood estimate λˆ ? c. What is likelihood-based 95% confidence interval for λ? 21 Meta-analysis combining information from multiple studies Most papers in scientific journals present the outcome of just one experiment or of one observational study. Each study is typically based on a single random sample or on a small number of related samples. Compelling issues in biology, however, deserve more than one study. You will rarely find only one published study on a topic, unique in addressing an interesting problem. The first test of a truly original idea is almost always followed by studies that duplicate it to some extent. Most new studies typically address a question that has been asked and answered before, often many times. For example, by 1985 at least 118 data sets had been published testing whether the phase of the moon affects human behavior (Rotton and Kelly 1985). At least 24 studies have investigated whether acupuncture can help you stop smoking (White et al. 2006). Studies on a topic accumulate in the literature, and at some point these must be summarized to yield an overall conclusion. How do we combine the information that we get from multiple studies? The review article is the traditional answer to this question. An expert in the field assembles the studies published on a topic, thinks about them carefully and (hopefully) fairly, and then writes a review article summarizing the overall conclusions reached. There is much to be said for this approach. A first-rate review article advances a field far beyond a mere summary. Such a review will propose new hypotheses, uncover previously unnoticed relationships, and point to new paths of research. A strong review provides a solid indication of the state of thought and knowledge about a particular topic. But the traditional review usually lacks a quantitative methodology, which might lead to two problems. First, the conclusions of one reviewer are often partly subjective, perhaps weighing studies that support the author’s preferences more heavily than studies with opposing views. An extreme example of this is Linus Pauling’s (1986) book How to Live Longer and Feel Better, in which the author cited 30 studies supporting his own idea that taking large daily doses of vitamin C reduces the risk of contracting the common cold. Pauling1 cited no studies opposing this idea, even though a number of such studies had been published (Knipschild 1994). Not all reviews are blatantly biased, but complete objectivity is difficult for any reviewer to achieve when evaluating a diverse set of published studies. The second problem is that it is extremely difficult to balance multiple studies by intuition alone without quantitative tools. The combined strength of statistical support and the average magnitude of an effect simply cannot be determined without some calculations. “Vote counting” is a step in the right direction—that is, count the number of published studies that have rejected the null hypothesis of zero effect and compare this count with the number of studies failing to reject the null hypothesis. Yet, even this approach ignores important information on the size of the study, the strength of an effect, and the magnitude of our uncertainty. A set of techniques called meta-analysis provide quantitative tools for the synthesis of information from multiple studies. In this chapter, we give an overview of the approach using examples, and we point out advantages and limitations. We also indicate how best to report the results of your own studies so that your results will be useful to future meta-analyses. What is meta-analysis? Meta-analysis is the “analysis of analyses.” It is not a single statistical test but rather a set of techniques that allows us to address statistical questions about collections of data obtained from the published literature and other sources. The researcher begins a meta-analysis by gathering all known studies about a particular topic. Typically, these studies report a broad range of results, with some rejecting the null hypothesis and others not rejecting the null hypothesis. Given such a broad range of results, how can we decide what the truth is? Meta-analysis provides ways to combine information from independent studies. A meta-analysis compiles all known scientific studies estimating or testing an effect and quantitatively combines them to give an overall estimate of the effect. With meta-analysis, we can address questions like these: ■ ■ ■ ■ ■ How strong is the effect we are studying, on average? Does the collection of studies reject a null hypothesis? How variable is the effect? What factors might explain this variability? Is there any bias resulting from the publication of some studies and not others? Are there enough unpublished studies to change our conclusions? The quantitative details of meta-analysis lie beyond the scope of this book. If you want to know more about the techniques, have a look at Borenstein et al. (2009). Why repeat a study? Why would anyone bother to test an idea that has already been addressed in one or more previous studies that have already provided answers? Repetition is necessary in science, because errors—both statistical errors and methodological flaws—can be made in individual studies. A single study might yield a significant result when the null hypothesis is true (a Type I error) or it may fail to detect a true effect (a Type II error). Moreover, what is true in one circumstance may not be true in another. For example, one kind of animal might tend to evolve to larger size on islands (think of the three-meter Komodo dragon on the Indonesian islands of Komodo and Flores), whereas another kind of animal may do the opposite (for example, the five-foot pygmy elephants living on Flores, thought to be the original diet of the dragons). Moreover, we all make mistakes sometimes, so experimental designs can be flawed or errors can be made in recording data. In short, we need to have more than one study on any phenomenon that we care about. The power of meta-analysis One of the most useful aspects of meta-analysis is that it can greatly increase the statistical power of our collective statistical tests. Any one study will have limitations, sometimes severe, on the amount of time and money available to do research. As a result, the sample size of any given study is limited, yet the power of a study depends on sample size. We can combine the power of individual studies when we address the scientific question with meta-analysis. As a result, weak but important effects can sometimes be detected. This power can be particularly important in medical studies, when even a small reduction in mortality rates from an inexpensive and relatively harmless treatment can result in thousands of lives saved. EXAMPLE 21.2 Aspirin and myocardial infarction People at risk of stroke or myocardial infarction (a “heart attack”) are often advised to take aspirin or another antiplatelet medication. These kinds of medication are known to reduce the risk of future stroke, myocardial infarction, and death (all of which we will refer to as “vascular events”). But how do we know this is true? It turns out that the effect of aspirin and other antiplatelet agents was confirmed by a large meta-analysis, conducted by a large collaboration of researchers (Antiplatelet Trialists’ Collaboration 1994). They combined information from 142 randomized trials of antiplatelet treatments on patients who had previously had a stroke or similar disease. In total, the trials included more than 70,000 patients. The results of these trials are summarized in Figure 21.2-1. FIGURE 21.2-1 The vascular event rate for patients in antiplatelet therapy compared with patients in the corresponding control group. Each dot is a separate study, except at the origin, where 28 studies overlap. The one-to-one line marks equal vascular event rates in the two treatments. Before the meta-analysis, whether aspirin helped prevent vascular events was far from clear. Of the 142 studies reviewed, only 87 showed a better result for patients on antiplatelet therapy than for the control patients. Moreover, only 19 of these 87 showed significantly better results for the treated patients than the control patients at α = 0.05, while 68 had nonsignificant results. Two of the 142 studies even showed a significantly worse rate of vascular events with aspirin treatment! A simple vote-counting procedure does not give overwhelming support for aspirin. When the results are combined in a meta-analysis, however, the answer is surprisingly robust. It was relatively straightforward to combine the results, because all of the experiments involved the same kind of data: frequencies of control and treated patients who either did or did not have subsequent vascular events. Each of these studies had roughly the same numbers of treated individuals as control individuals. For display purposes only, we added up the frequencies of patients in each category and produced the contingency table shown in Table 21.2-1. TABLE 21.2-1 Numbers of patients in each treatment group and outcome category in 142 studies combined. Treated with Control (no Row antiplatelet therapy antiplatelet therapy) totals Vascular event (stroke, infarction, death) No vascular event 4,183 5,400 9,583 32,353 31,311 63,664 Column Totals 36,536 36,711 73,247 In total, 14.7% (5400/36,711) of the patients in the control groups had subsequent vascular events, compared with 11.4% (4183/36,536) in the treated group. This difference may seem small, but it corresponds to the control patients having a 30% higher chance of serious problems than the treated patients. Put another way, if just the individuals in the controls of these experiments had been given antiplatelet treatment, there would have been about 1200 fewer strokes, infarctions, and deaths. Since the side effects associated with antiplatelet treatment are low, this result (if statistically supported) would justify using aspirin. We cannot do a simple odds-ratio (OR) calculation on the combined table, because the data points are not all independent. We can, however, combine the odds ratios for each study using the Mantel-Haenszel confidence interval (see the Quick Formula Summary in Section 21.8). The Mantel-Haenszel procedure is specifically designed to combine information across multiple contingency analyses. Doing so, the researchers found that taking aspirin had a clear beneficial effect (0.71 < OR < 0.75). Without the analysis of the data across all available studies, there would have been little chance of confirming the value of this relatively important treatment. In fact, the meta-analysis came 10 years after it would have been possible to show an effect by combining across studies that already existed at the time. It took another six years before the results of this meta-analysis were accepted by the medical community, perhaps in part due to the relative newness of metaanalysis techniques (Hunt 1997). The number of extra deaths and suffering in the intervening years that could have been avoided by earlier application and acceptance of these meta-analysis techniques is sobering to think about. Meta-analysis can give a balanced view Studies that find dramatic results are sometimes given more attention in the press and scientific literature than better studies with less “newsworthy” results. As a consequence, we require fair summaries of the breadth of knowledge on a question to give us a balanced view. Metaanalysis attempts to cover all studies fairly. EXAMPLE 21.3 The Transylvania effect Many people believe that a full moon can affect human behavior. The word “lunacy” is derived from the Latin luna, moon, and legends of strange beings, such as werewolves and vampires, have been connected to full moons for centuries. Some studies have shown that abnormal human behavior increases during full moons, as measured by such variables as homicide rates, psychiatric hospital admissions, suicide rates, crisis calls, etc. This phenomenon is called the “Transylvania effect.” Other studies, however, have reported no significant effects. A metaanalysis of these studies (Rotton and Kelly 1985) found no statistically significant change in “lunacy” during the full moon. Moreover, the best estimate of the average effect was that less than 1% of the variation in these events was explained by moon phase. Even if the moon did affect human abnormal behavior, the average effect is so small as to be unimportant. The results are shown in the funnel plot in Figure 21.3-1. (See Interleaf 10 for a discussion of funnel plots.) FIGURE 21.3-1 A funnel plot of the effect sizes from the meta-analysis of the “Transylvania effect.” The horizontal line marks zero correlation between moon phase and unusual behavior. The literature on the Transylvania effect includes some studies with apparently very strong relationships between moon phase and human behavior. The most extraordinary of these papers have gotten a great deal of media attention and word of mouth among police and hospital workers. Thus, there is a tendency to believe that the sensational studies are typical of all those performed. This meta-analysis shows that these are just the tips of the sampling iceberg; when all data sets are examined, there is little or no effect of moon phase on aberrant human behavior. The steps of a meta-analysis In this section, we describe the process of combining information using meta-analysis. We focus on some of the issues that arise when summarizing data, but we don’t provide mathematical details. Define the question A key step in meta-analysis is defining the question and the breadth of studies to which it applies. Some meta-analyses apply a question to a very narrow set of studies, all of which are expected to estimate the same true value. Meta-analyses in medical research are usually of this type. For example, in the aspirin/heart attack meta-analysis of Example 21.2, the question was whether aspirin affected the risk of future myocar-dial events. To answer this question, the authors reviewed studies that had very similar properties. They included only studies that were all done on human participants who had suffered a stroke or heart attack. All of the studies were randomized clinical trials comparing an aspirin treatment with a control treatment, and all measured the future risks to the participants in terms of significant health problems—namely, heart attacks, strokes, or death. Different studies, therefore, estimated very similar effects, and all should give more or less the same answer, except for sampling error. The goal of the metaanalysis was to combine the multiple studies—in effect, to create one large study with more power than any of the single studies. A meta-analysis of such a homogeneous set of studies is called a fixed-effect meta-analysis. At the other extreme, a meta-analysis might address a question whose answer requires reviewing a broad and heterogeneous mix of studies. Meta-analyses in non-medical areas of biology are often of this type. For example, Kingsolver et al. (2001) conducted a meta-analysis to estimate the average strength of natural selection in the wild. Answering this question required combining studies carried out on different species living in different kinds of environments. The variables measured were not the same from study to study. A meta-analysis of such a heterogeneous set of studies is called random-effect meta-analysis. In this case, the goal is to obtain an average over the separate effects of all the studies. If the studies in the meta-analysis are similar enough, and if we can standardize their responses in some way, then the average effect may be informative. Example 21.4 describes a meta-analysis of a heterogeneous mix of studies, even though all were carried out on the same species. EXAMPLE 21.4 Testosterone and aggression Using a meta-analysis, Book et al. (2001) asked the question, “Are testosterone levels and aggression correlated in humans?” The studies reviewed for this meta-analysis were all done on humans, but the ways aggression was measured were extremely diverse. Some compared the levels of testosterone in prisoners convicted of violent crimes with those of prisoners convicted of property crimes. One correlated the levels of testosterone in university students with their answers to questionnaires that asked them for levels of agreement to statements like “If somebody hits me, I hit back.” Others were less prosaic. One measured the levels of aggression in !Kung San males by counting “their scars and sometimes still open wounds in the head region.” Another compared drunken Finnish spouse-abusers with drunken Finns drinking quietly in a bar. Another compared members of “rambunctious” fraternities with “responsible” frats.2 In this example, “aggression” was defined in a variety of ways. They tested the null hypothesis: “Testosterone has no average effect on human aggression levels.” The authors wanted to ask a rather broad question with a variety of types of people and types of aggression, so many studies were relevant. These studies do not repeat themselves; they each address different specific questions, but they all address the same general question. Meta-analyses in biology can be even broader, often including results from a large number of species. This is appropriate, if the question is general enough, but it should always be remembered that the different subgroups in the meta-analysis may actually have different responses. Review the literature Once the question is defined, the hardest part of a meta-analysis ensues. We must collect all of the available information that pertains to the question at hand. In principle, this is simple, but in practice, collecting even a fraction of the pertinent literature can be a Herculean task. Metaanalysts often take years to collect the information for their reviews, although computerized databases are making the task easier. Most libraries do not carry subscriptions to all journals, so getting copies of some articles can be difficult. More importantly, many good studies are not published in journals at all and therefore are not in computer databases. Sometimes they are reported in the “gray literature”—monographs and in-house publications of various foundations, institutes, and government agencies. Lots of science is reported only in doctoral dissertations and master’s theses. Finally, many studies are not published at all, so an effort has to be made to contact researchers in the field to discover what information exists but is not widely available. It is crucial to find everything, to minimize the effects of “publication bias.” As discussed in Interleaf 10, studies that find large and significant effects are more likely to be published, more likely to be in “first-rate” journals, and more likely to be referenced in other articles. All of this means the studies that we can find easily are different from those that we cannot so easily find. Some statistical techniques exist to partially account for such biases (see the discussion of funnel plots in Interleaf 10 and the “file-drawer problem” in Section 21.5), but they are not nearly as reliable as a proper literature review. Meta-analyses contain from as few as a handful of studies to as many as thousands, depending on the amount of information available, the amount of effort expended by the metaanalysts themselves, and the breadth of the question asked. The more information there is available for analysis, the more reliable the final results are likely to be. One key issue that arises when compiling studies is whether to include studies of apparently poor quality. An unfortunate fact of science is that not all studies are done well; in spite of long training, large amounts of money spent on research, and the peer review process, some bad science is still published. What do we do about these studies? We could just ignore them. But this raises a thorny point: we may be more likely to reject a study as “bad” if its conclusions do not agree with our own opinions about the scientific question at hand. Good meta-analyses circumvent this problem by having the methods (alone, separated from the results) of each study rated by independent scientists and then weighting the study’s contribution to the metaanalysis by the judges’ scores. Others set standards a priori, such as collecting only studies that experimentally and randomly assigned individuals to treatments and controls. This question of how to deal with varying quality of science is not fully resolved, and it represents one of the greatest challenges for the future of meta-analysis. Compute effect sizes The core quantity of meta-analysis is the effect size, which measures how strong the association is between the explanatory variable X and the response variable Y. The magnitude is important—for example, we might need to know whether the effect of a drug is large enough to exceed any negative consequences or costs the treatment might have. The effect size is a standardized measure of the magnitude of the relationship between two variables. In meta-analysis, we are combining across multiple studies, and usually these studies do not measure their results on the same yardstick. Some may even be measuring completely different things. How do we combine measures of “aggression” that range from the number of scars (a numerical variable) to whether or not someone is imprisoned for murder (a categorical variable)? How do we combine information that comes from a correlation analysis with results from comparisons of means and contingency tables? Fortunately, methods have been developed for putting quantities obtained from different kinds of data onto a common, standardized scale. Most meta-analyses use one of three common measures of effect size. These are the odds ratio (OR; see Section 9.2), the correlation coefficient (r; see Section 16.1), and the standardized mean difference (SMD; see below). Which measure of effect size is used depends on the variables. The odds ratio compares the frequencies of “success” and “failure” between two groups or treatments. The correlation coefficient typically measures the relationship between two numerical variables, although it can also be generalized to describe many kinds of data (Rosenthal 1991). The standardized mean difference is useful when most studies being combined compare the mean of a numerical response variable between two groups. SMD is defined as SMD=Y¯1−Y¯2spooled, where spooled is the square root of the pooled sample variance. In other words, SMD is the difference in the sample means of two groups divided by the pooled estimate of their standard deviation (the square root of the pooled sample variance). SMD measures the difference between groups in units of standard deviation.3 Each of these effect size measures allows the results of multiple studies to be compared with one another. The effect sizes from the different studies included in a meta-analysis must be calculated in the same way. For example, when using SMD, the mean of the control group should be subtracted from the mean of the treatment group in every study. If a positive effect size indicates improvement in one study, then every other study that showed an improvement associated with the treatment ought to have a positive effect size, too. Table 21.4-1 lists results from nine of 54 studies in the meta-analysis of testosterone and aggression (Book et al. 2001; only nine are shown here to keep the table brief). Of these nine studies, the original measure of association was either a correlation coefficient (r), or a difference of group means. Finally, all measures of association were converted to the equivalent correlation coefficient r to put them on the same scale. (See Rosenthal 1991 for computational details.) Degrees of freedom (df) are also listed in Table 21.4-1, reflecting the sample size of each study, along with the P-value. TABLE 21.4-1 A subset of the results of a meta-analysis on the relationship between testosterone and aggression in humans. Effect size References Type of study (r) df P Gray et al. (1991) Houser (1979) Olweus et al. (1980) Banks and Dabbs (1996) Aging men (39-70); questionnaires 0.02 20-something males; questionnaires 0.086 3 0.45 0.24 56 0.035 0.30 61 0.008 Swedish male adolescents; selfreports of aggression College students compared with “Americans who belonged to a deviant and delinquent urban subculture” Harris et al. University students; questionnaires (1996) Christiansen and Winkler (1992) Lindman et al. (1992) Dabbs et al. (1996) !Kung San men; number of scars from fights Drunken Finns arrested for spousal abuse vs. drunken Finns in bars “Rambunctious” frats vs. “responsible” frats 1677 0.21 0.36 1.7 × men 153 106 0.41 women 149 5.8 × 108 0.06 112 0.27 0.18 34 0.15 0.26 93 0.006 Several of these nine studies rejected the null hypothesis that there is no association between aggression and testosterone, but not all. The funnel plot in Figure 21.4-1 shows how variable the results are among all 54 studies. Some studies show a strong positive relationship between testosterone and aggression, whereas others show no significant effect. One study even shows a significant negative correlation. This variability stems in part from sampling error, but it also likely reflects differences between studies in the types of variables measured and the populations sampled. FIGURE 21.4-1 A funnel plot of studies comparing human aggression to levels of testosterone. The curves show the approximate boundaries of the critical regions that would reject the null hypothesis in any one study with α = 0.05. Determine the average effect size Once we have all the effect sizes in comparable units, we can calculate the average effect over all of the studies. Typically, however, we do not take just a simple average of the individual effect sizes. Instead, we acknowledge that some studies give us more information than others, and we weight studies according to the precision of their estimates. The most common reason that some studies have more information is simply that their sample sizes were larger. It is also possible to weight studies by their quality: studies done with superior methodology might be weighted more heavily than those with more suspect technique. Calculate confidence intervals and test hypotheses Meta-analysis techniques make it possible to calculate confidence intervals and test hypotheses for the mean effect size, accounting for the input of all studies included. Typically, the confidence intervals for the effect size averaged over all studies is much smaller than that obtained from any one study, and hypothesis tests are much more powerful than in any of the individual studies. Different techniques are used for fixed-effects and random-effects metaanalyses. See Gurevitch and Hedges (1999) and Cooper and Hedges (1994) for further discussion. Our example data show an average correlation between testosterone and aggression of r¯=0.096, with a 95% confidence interval for the mean correlation of 0.055<ρ¯<0.136. In other words, the evidence points to a real effect, but it is quite small. Most of the individual studies were not powerful enough to detect an effect this small, because their sample sizes were too small. The mean effect size has to be understood in the context of the studies that are used to calculate it. Averaging over systematically biased estimates will still give a biased estimate. Moreover, the results can only be interpreted relative to the population of individuals addressed in the studies. If the studies are all about humans, then the interpretation can be reasonably applied only to humans. If the studies come from a diversity of species, then the conclusions can be applied to a mix of species like those in the studies. It is rare—well, to be honest, impossible—that the group of studies available in the literature truly reflects a random selection of all possible studies that may have been done on a given topic. A mixture of studies to determine the effects of a particular agent of human mortality, for example, might be heavily biased toward studies done in first-world hospitals, so we would have to interpret the results according to this bias. Medical studies are done disproportionately in large research hospitals, and it is certainly possible that the standard of care in these hospitals is systematically different from other, less studied healthcare settings. The efficacy rate of a treatment could be different in these other settings, meaning that the meta-analysis could be systematically biased. Look for effects of study quality Not all studies in a meta-analysis are the same. They vary in sample size and in their methodologies. As part of a meta-analysis, we can and should ask whether these differences among studies influence the outcome. Larger published studies (those based on larger sample sizes) are more likely to give more reliable estimates of effect size, because effect of publication bias is likely to be lower. Small studies not yielding a statistically significant outcome are less likely to be published than those yielding a significant result (see Interleaf 10), inflating the average effect size for those small studies that make it to publication. Large studies would be less affected by this problem, because the larger sample sizes are more likely to detect an effect if it is present. Large studies also require much more effort and funds than small studies; therefore, researchers are more likely to publish their conclusions whatever the results. One way to address this possibility is to look across all studies included in the meta-analysis for a correlation between sample size and effect size. For example, there is a negative relationship between sample size and effect size in the meta-analysis comparing morphological asymmetry to mating success discussed in Interleaf 10 (Spearman rS = -0.30, df = 138, P = 0.003). The effect size is much smaller for large studies than for small studies. (See the funnel plot in Interleaf 10.) As a result, we should distrust the smaller studies in that meta-analysis. In contrast, in the meta-analysis of the effects of aspirin on vascular events (Example 21.2), there was no detectable relationship between sample size and effect size. In that meta-analysis, we have no reason to think that the smaller studies were influenced by publication bias. Moreover, there are often differences in average effect size between studies of high quality and those of low quality. For example, studies without blinding (Section 14.3) have systematically larger effect sizes than those studies including this method to reduce bias (Jüni et al. 2001). Meta-analyses commonly find differences in average effect size between observational and experimental studies, between studies that did and did not include randomization, and so on. When such differences are found, we should use only the betterquality studies to draw our conclusions. Look for associations Another advantage of meta-analysis is that the different studies can be used to examine the effects of methodological or other differences between studies. That is, we are looking for moderator variables—variables that can explain some of the variation in effect size. Scientifically interesting factors can be responsible for the heterogeneity of effect size across studies. Meta-analysis can suggest relationships that were not even addressed in any of the component studies. For example, a meta-analysis of the efficacy of homework showed that homework stimulated learning, despite claims to the contrary in the education literature (Cooper and Valentine 2001). More surprising, homework seemed to have little benefit for elementary schoolchildren but large bene fits for high school students. This effect was detected in the meta-analysis only; none of the individual studies being analyzed tested the effect of grade level directly, and no one had predicted it theoretically. A potentially important conclusion was reached by combining information across studies and looking at the effect of a key moderator variable that had not been explicitly addressed in any one study. File-drawer problem All of the meta-analysis techniques we have described assume that the studies being reviewed are a random sample of all possible studies on that topic. But in one very important way, the studies available for synthesis are not random: they tend to be drawn mainly from the published literature, and this literature usually suffers from publication bias. As mentioned in Interleaf 10, publication bias is the difference in mean effect size between published studies and all studies on the topic. In meta-analysis, the difficulties caused by publication bias are called the file-drawer problem, in reference to the unknown studies sitting unavailable in researchers’ file drawers or hidden in obscure journals. The file-drawer problem is the possible bias in estimates and tests caused by publication bias. Meta-analysis can increase power and reduce the Type II error rate—that is, combining across studies makes it easier to reject a false null hypothesis. In some cases, if the available data are biased, meta-analysis may also have a higher chance of rejecting a true null hypothesis than expected (Type I error). A few methods partially address these problems. Funnel plots (see Figure 21.4-1 and Interleaf 10), for example, can give some indication of the bias resulting from small studies. (If you haven’t done so already, please read Interleaf 10. We won’t repeat that material here.) Another partial solution to the file-drawer problem is to calculate how many missing studies would be needed to change the overall result of the meta-analysis. A standard technique assumes that all of these missing studies failed to reject the null hypothesis of no effect. The method then calculates the number of missing studies required to reach the point at which the null hypothesis is no longer rejected by the meta-analysis. This number is called the fail-safe number.4 If the fail-safe number is small (i.e., roughly the same as the number of published studies included in the initial meta-analysis), then the results of that meta-analysis would be regarded as unreliable. If the fail-safe number is very large (e.g., in the millions), then we can be more certain that the meta-analysis is giving us the right answer—it is simply too unlikely for a million unpublished studies on the subject to exist. The fail-safe number calculation assumes that the direction of the effect detected in a study does not influence the likelihood of publication. Unfortunately, studies that detect an effect opposite to that of most published studies might also be lost from publication. For example, studies that detect significant harm to patients taking a drug may be less likely to be published than those that find an advantage to taking the drug. If studies in the opposite direction from that desired also go missing from a meta-analysis database, then the fail-safe number cannot be reliably interpreted. How to make your paper accessible to meta-analysis Many published papers do not report enough information for meta-analysts to extract the numbers that they need. As a result, many otherwise relevant papers have to be discarded. This difficulty can be avoided by a few simple changes in the way information is presented. Here are some suggestions. ■ Always give estimates of the sizes of the effects and provide their standard errors. Much too often, the size of the effect (e.g., the odds ratio, the correlation coefficient, etc.) is not given, and an author presents only a P-value. Also, give estimates of the mean and standard deviation of the important variables, as both may be needed for some effect-size calculations. It is surprising how often estimates of means and the sizes of effects are not given, even though we care far more about the size of the effect than about the P-value. ■ Give the values of your test statistics and the number of degrees of freedom. The degrees of freedom are essential for most of meta-analysis. In particular, the degrees of freedom are needed for calculating a weighted average of effect size across studies. Similarly, the values of the test statistics are needed for some methods used to calculate effect sizes. ■ Make the data accessible. Publish the raw data in the paper or on an online archive, such as datadryad.org. If the data are available for scrutiny by others, the information in the study can be faithfully transcribed into future meta-analyses. Summary ■ Most scientific questions have been addressed by more than one published study. Metaanalysis is a general name for methods that quantitatively combine information from multiple studies that address the same question. ■ Meta-analysis increases power and decreases the Type II error rate. ■ An effect size is a standardized measure of the results of a study, used for all studies included in a meta-analysis. Common effect sizes include odds ratios, correlation coefficients, and standardized mean differences. ■ The studies easily found in the literature are not a random sample of all studies done. Because of publication bias, the mean effect size in collections of published studies will often be larger than the true mean. Because meta-analysts have more difficulty finding unpublished studies than published studies, the values estimated via meta-analysis can be biased. This difficulty is called the file-drawer problem. ■ One measure of the possible effect of publication bias is the fail-safe number. The failsafe number is the number of hypothetical, unpublished studies failing to reject the null hypothesis that would need to be added to the meta-analysis to change its results. ■ The confidence intervals of estimates of effect sizes are smaller, usually much smaller, for a meta-analysis than for its component studies, because the meta-analysis uses more information. On the other hand, meta-analysis can give estimates of effect size that are more biased than those from the larger studies they summarize, due to publication bias. ■ Meta-analysis can also investigate the influence of study methodology and other variables not studied in the original articles. ■ To make your study amenable to meta-analysis, provide exact calculations of effects, standard errors, standard deviations, test statistics, and degrees of freedom. Quick Formula Summary Mantel-Haenszel estimate of odds ratios from combined studies What is it for? To estimate a confidence interval for an odds ratio based on data from multiple studies. What does it assume? Data in each study are from random samples, there is no publication bias, and the odds ratio in the population is the same for all studies. Degrees of freedom: 1 Formula: ORMH=∑i−1Saidi/ni∑i=1Sbici/ni, where a, b, c, and d are defined for each study according to the following table, s is the number of studies, and ni = ai + bi + ci + di. Explanatory variable Response variable Treatment Control Success ai bi Failure ci di The confidence interval for a Mantel-Haenszel estimate should be calculated with the aid of a computer. Mantel-Haenszel test What is it for? To test the null hypothesis that the odds ratio estimated in multiple studies equals one. What does it assume? Data in each study are from random samples, there is no publication bias, and the odds ratio in the population is the same for all studies. Test statistic: χ2MH Distribution under H0: χ2 distribution with one degree of freedom. Formula: χMH2=[|∑i=1S(ai−(ai+bi)(ai+ci)ni)|−0.5]2∑i=1S(ai+bi)(ai+ci)(bi+di) (ci+di)ni2(ni−1), where the terms are defined as for the Mantel-Haenszel estimate of odds ratios and || denotes the absolute value of its contents. PRACTICE PROBLEMS Note: Some of these problems address topics related to publication bias, as discussed in Interleaf 10. 1. Give two reasons why researchers are more likely to publish results when P < 0.05 than when P > 0.05. 2. The accompanying funnel plot shows the results of studies that estimated the heritability of various traits in a large number of species. Heri-tability is the proportion of the variation among individuals in a population that can be explained by genetic differences among individuals. Because it is a proportion, heritability is a unit-less number between zero and one. The methods of estimating heritability, however, allow estimates outside of these limits. The curve on this graph marks the critical values with α = 0.05 for the null hypothesis that the population heri-tability equals zero. a. Look carefully at this funnel plot. Is there any evidence for publication bias? b. What kinds of decisions made by researchers and editors are likely to be behind the pattern? 3. The meta-analysis on the relationship between testosterone and aggression mentioned in Example 21.2 also included several studies on whether success in certain sports, such as tennis, was correlated with testosterone levels. Discuss the value of including these studies in this particular meta-analysis. 4. The file selection.csv, available online,5 contains the effect sizes and sample sizes from a meta-analysis of studies measuring the strength of natural selection on morphological traits in different wild animal and plant populations (Kingsolver et al. 2001). Make and examine a funnel plot of these results. (You don’t need to draw the curve showing statistical significance.) 5. Choose a paper from the scientific literature on a topic that you are interested in. Identify an interesting question that the paper addresses statistically, and attempt to extract from the paper the necessary information for a meta-analysis. What is the sample size? What is the effect size? What test statistic was used, and what was its value? 6. A hypothetical meta-analysis on the effects of a new drug on the reduction in the size of ovarian cancer tumors finds that no published studies failed to find a statistically significant effect of the drug and that the average size of the effect of the drug is small in the largest studies but larger in the smaller studies. The authors found the mean effect size was large, and they interpreted this as evidence for a large benefit of the drug. Why might you doubt the results of this meta-analysis? 7. Why might meta-analysis increase the Type I error rate, relative to individual studies? ASSIGNMENT PROBLEMS 8. A meta-analysis on 25 studies testing the effects of a new enzymatic treatment on pityriasis (a common skin disease) finds a significant effect of the treatment. The fail-safe number was calculated to be 5. a. How much confidence should you give the results of this meta-analysis? b. If the fail-safe number had been 1500, how would you feel about the results? Explain your reasoning. 9. In what ways is meta-analysis an improvement over a traditional review of the literature on a research topic? 10. What measure of effect size would most likely be used in each of the following metaanalyses? ____________ a. A meta-analysis on the effect of the drug ibuprofen on the frequency of individuals contracting kidney disease. b. A meta-analysis of whether excluding parasites from a population affects mean growth rate. c. A meta-analysis of the association between human height and life span. d. A meta-analysis comparing the survival of parasitized and unparasitized birds. e. A meta-analysis comparing the difference in body size of males and females across species. 11. A meta-analysis of the effects of predators on their prey populations collected the results of 45 studies (Salo et al. 2007). The effect size used to describe these studies was Hedges’ g, similar to the SMD. As one part of the meta-analysis, the researchers discovered that predators that had been artificially introduced by humans (“alien predators”) had larger effects on prey than did native predators. None of the 45 individual studies had compared alien and native predators. a. What kind of variable is the classification of predators as alien or native? (Answer using the specific language of meta-analysis.) b. Australia has had a disproportionately large number of alien species introduced with great ecological damage. As a result, most of the studies compared in this meta-analysis on alien predators were done in Australia. Comment on how this should affect our understanding of the results. c. None of the original studies compared alien and native predators. Explain how metaanalysis made it possible to answer a novel question based on these studies, even though none of the individual studies addressed this question. 12. In most meta-analyses, some studies find answers that are opposite to those found by the average of all studies. Give two reasons for this phenomenon. Footnotes Chapter 1 1. “The diagnosis of high-rise syndrome is not difficult. Typically, the cat is found outdoors, several stories below, and a nearby window or patio door is open.” (Ruben 2006). 2. In biology, a “blood sample” or a “tissue sample” might refer to a substance taken from a single individual. In statistics, we reserve the word sample to refer to a subset of individuals drawn from a population. 3. Other than by the removal from the population of those individuals already selected, which prevents them from being sampled again. 4. Methods are available for more complicated sampling designs incorporating non-random sampling, but we don’t discuss them in this book. 5. For example, one is available at www.random.org. 6. We biologists are generally happier to find such flaws in other researchers’ data than in our own. 7. The absolute frequency is the number of times that a value is observed. The relative frequency is the proportion of individuals which have that value. 8. Beak depth is the height of the beak near its base. 9. Quoted J. R. Newman, The World of Mathematics (New York: Simon & Schuster, 1956). Chapter 2 1. Serotonin is a neurotransmitter in most animals, including humans. Some antidepressant drugs improve feelings of well-being by manipulating serotonin levels. 2. If you are in doubt, load your graphic file into a colorblindness simulator such as Vischeck (vischeck.com). 3. This pattern is a remarkably general one in nature, found in many types of organisms. Typically, only a few species are common, whereas most species are rare. 4. The nomenclature of skew seems backward to many people. Focus on the sharp tail of the distribution extending to the left in the third distribution in Figure 2.2-4. We say it is skewed left (or has a negative skew) because it seems to have a “skewer” sticking out to the left toward negative numbers, like the skewer through a shish kebab. 5. The two distinct body-size classes in this salmon population correspond to two age groups. 6. Malaria is a common cause of death in humans, but avian forms of the disease are even more prevalent in many bird species. For example, many native bird species in Hawaii are threatened with extinction after the inadvertent introduction of mosquitoes and avian malaria by humans in the 19th century. 7. Mysteriously, oxygen levels in the blood of highland Ethiopian men are just as high as those in men living at sea level (data not shown), despite their similar concentrations of hemoglobin. The physiological mechanism behind this feat is not yet known. 8. Vaccination rates dropped below 85% in the years after a 1998 publication that appeared to link the measles, mumps, and rubella vaccine to an increased risk of autism (e.g., see Gilmour et al. 2011). This controversial conclusion was refuted in subsequent reviews (e.g., Canadian Paediatric Society 2007). The original article has since been retracted by most of its authors and by the medical journal that published it. In 2010, the General Medical Council of the U.K. found the senior author of the research guilty of ethical breaches, dishonesty, and conflict of interest and banned him from practicing medicine. 9. Data taken from http://en.wikipedia.org/wiki/2010_Atlantic_hurricane_season and similar articles for other years. Chapter 3 1. See films of these snakes flying at http://www.flyingsnake.org/video/video.html. 2. We have adopted the simple convention of using uppercase letters (e.g., Y) when referring to both variable names and data, and prefer to distinguish the two by context. This is a departure from mathematical convention, which reserves uppercase exclusively for random variables. 3. We could have averaged the absolute values of the deviations instead, to yield the mean absolute deviation. Averaging the square of the deviations is more common because the result, the variance, has many more useful mathematical properties. 4. The reason is that the sample mean is itself calculated using each data point. Therefore, the measurements in the sample are slightly closer on average to the sample mean than they are to the true population mean. This causes a bias that is corrected by dividing by n−1 instead of by n. 5. Don’t be surprised if your computer program gives slightly different values from ours for the quartiles and the interquartile range. The method given here is simple to calculate, but it does not give the most accurate estimates of the population quantities. Several improved methods are available (Hyndman and Fan 1996). 6. Some computer programs extend whiskers all the way to the most extreme values on each end and do not indicate extreme values with isolated dots. There is no universally agreed-upon method for drawing whiskers. 7. Mutations in the same gene in humans cause the loss of hair, teeth, and sweat glands. 8. When listing descriptive statistics in tables, put the same quantity calculated on different groups into one column. Numbers stacked in a single column are easier to compare than numbers placed side by side in a row. Exchange the columns and rows in Table 3.3-1, and you’ll see what we mean. 9. The “hat” in pˆ is used to indicate an estimate of the true proportion p. 10. Rigor mortis is the muscular stiffening that occurs after death. It is caused by linkages forming between actin and myosin in muscle when muscle glycogen is depleted, pH drops, and the level of ATP falls below a critical level. 11. In monogamous vole species, single males and females form stable mating pairs. In promiscuous voles, no stable pairs form and voles might mate with multiple partners. Chapter 4 1. We will often use Greek letters (like μ and σ) to describe parameters and roman letters (e.g., Y¯ and s) for their estimates. In addition, it is common to put a circumflex, or “hat” (ˆ), over a variable name to show it is an estimate (e.g., pˆ is an estimate for p). 2. The photo at the beginning of this chapter is a crystal of pure DNA. 3. We used the largest known transcript of each gene in release 35 of the human genome (http://www.ensembl.org/Homo_sapiens). Accessed Dec 2, 2005. 4. The longest human gene, with nearly 100,000 nucleotides, encodes the gigantic protein titin, which is expressed in heart and skeletal muscle. The protein was named for the titans of Greek mythology, giants who ruled the earth until overthrown by the Olympians. Some mutations in the titin gene cause heart muscle disease and muscular dystrophy. 5. All genes were listed in a file, one gene per line, with lines numbered from 1 to 20,290. We then used a computer program to generate 100 random integers between the values 1 and 20,290, without allowing duplicates. Each random number was then used to draw a gene according to its line number in the list of genes. 6. Notice how the shape of the sampling distribution resembles that of a normal distribution, or “bell curve,” more so than the distribution of gene lengths themselves. Most of the methods presented in this book rely on the normal distribution to approximate the sampling distribution of an estimate. 7. Redrawn from a figure at http://www.pnl.gov/env/Helpful2.html. Accessed January 15, 2006. 8. See video at http://rsbl.royalsocietypublishing.org/content/suppl/2009/06/03/rsbl.2009.0311.DC1/rsbl20090311supp3.m Accessed February 3, 2014. 9. http://www.itl.nist.gov/div898/handbook/prc/section1/prc14.htm. Accessed February 3, 2014. 10. We’ll keep the specific example fictional, to avoid singling out one set of authors unfairly, but this is based on examples in the literature. 11. The term pseudoreplication was coined by Hurlbert (1984) in an extremely readable exposé of the shocking ubiquity of pseudoreplication in biology. Chapter 5 1. Believe it or not, the word “probability” has other definitions, even within statistics. In one alternative, “probability” refers to a subjective state of belief by the researcher about the truth. For example, “There’s a 95% probability that I turned the gas off before we left for vacation.” In this book, though, we define probability only as a proportion. 2. Don’t panic! We won’t ask you to carry out this integration in this book. 3. This excess of boys is a highly repeatable pattern, measured over tens of millions of births. More boys than girls are born. The fraction of males is even higher at conception than at birth, but this fraction declines during pregnancy because male fetuses die at a higher rate than female fetuses. It is thought that sperm bearing a Y chromosome might swim faster—and thus reach the egg sooner—than Xbearing sperm, which would account for the excess of boys. 4. Answers: Pr[at least one girl] = 0.738; Pr[at least one boy] = 0.762; Pr[both same sex] = 0.500. 5. Wasps, like ants and bees, have a very different mechanism than humans for determining the sex of their offspring. All a female has to do to determine the sex of an egg at the time of laying is to control whether or not she fertilizes it with the sperm she has stored. If she fertilizes it, it becomes a female. If not, it’s a male. 6. The value to her of sons has risen in the second case because there are now plenty of unrelated females to mate with. 7. This theorem is named after its discoverer, the Reverend Thomas Bayes, an 18th-century English Presbyterian minister. 8. Actually, later studies showed that smokers were about 23 times more likely than nonsmokers to get lung cancer, but for the purpose of this problem, we’ll use the numbers given in this study. Chapter 6 1. In this book the P-value is denoted by an uppercase, italicized P, which stems from the word “probability.” Don’t confuse the P-value with the lowercase p, used here to indicate the proportion of right-handed toads in the population. 2. At the same time, let’s not get carried away. A P-value of 0.051 is not much different from a P-value of 0.049. The boundary of 0.05 should be seen as a guide to interpretation, not as a clear boundary between truth and fiction. 3. In a few cases, the most powerful test method and the best method for calculating a confidence interval for the parameter use different approximations or slightly different sampling distributions, so they would not yield the identical answer in every instance. 4. Globe and Mail, November 15, 2005. Chapter 7 1. For example, if we are measuring the fraction of hurricanes that hit Florida in a given year, then we might call a hurricane hit a success and a miss a failure. We are clearly not cheering for the hurricane; the categories are for convenience only. The terms have their origins in gambling, where success and failure correspond more appropriately to winning and losing, respectively. 2. This term is also called the binomial coefficient. 3. Some simple multiplication tricks will keep your calculator from maxing out. For example, 20!18!2! can be rewritten as 20×19×18!18!2!, which simplifies to give 20×192!, because 18! cancels out. 4. New, recessive mutations are expressed only in males if they occur on the X chromosome, because they are not masked by the old allele at a companion chromosome. As a result, recessive mutations that benefit males are more likely to be seen by natural selection if they are on the X chromosome than if they are on other chromosomes. 5. Some computer programs for the binomial test give a slightly different value for P for these same data, because they calculate the probability of extreme results at both tails using a different method. 6. A few textbooks use (n − 1) rather than n in the denominator of the formula for SE pˆ . Using (n − 1) creates an estimate of the standard error that is less biased, but using n gives an estimate that is closer to σ pˆ , on average. 7. See http://www.straightdope.com/columns/read/374/did-john-wayne-die-of-cancer-caused-by-aradioactive-movie-set. 8. A clade is a group of species all descended from the same ancestor. For example, all the rodents form a clade. Sister clades are two clades that are each other’s closest relatives. For example, rodents and rabbits are each other’s closest relatives, so they are sister clades. 9. See the video at http://viscog.beckman.uiuc.edu/grafs/demos/15.html. 10. http://www.counton.org/thesum/issue-07/issue-07-page-05.htm. The cause is probably not the butter, since writing a “B” on the top of a piece of toast also yields an excess of outcomes with the B-side down. 11. http://www.youtube.com/watch?v=k-Fp7flAWMA&list=SP45865A763BAB32CA&index=5. 12. Fisher was a lifelong smoker who consulted with the Tobacco Manufacturers’ Standing Committee at the time he made this claim. Chapter 8 1. In each step of the simulation, assign each of 350 births with equal probability to the 365 days of 1999, count the number of births that fall on each of the seven days of the week, and then calculate χ2 as the discrepancy between the simulated frequencies and the frequencies expected under the proportional model. Repeat this procedure a vast number of times, each time recording the χ2 value. This procedure yields a close approximation to the sampling distribution of the χ2 statistic under H0, provided that the number of repetitions is very large. 2. The number of degrees of freedom for the χ2 goodness-of-fit test is calculated as the number of categories minus the number of constraints imposed on the expected frequencies. In a goodness-of-fit test, the sum of the expected frequencies is constrained to be the same as the total number of observations; we always lose a degree of freedom because of this constraint. Additional constraints are imposed if we use the data to generate other numbers needed to calculate the expected frequencies, as Examples 8.5 and 8.6 will show. The logic behind this calculation is that every constraint causes the expected frequencies to be that much more similar to the observed frequencies, and we must adjust the degrees of freedom to compensate. 3. This discrepancy is largely due to scheduled C-sections and induced labor, but these do not explain the effect completely (Ventura et al. 2001). 4. These restrictions are conservative, because the χ2 goodness-of-fit test has been shown to work well even with smaller expected values. An alternative rule of thumb is that the average expected value should be at least five (Roscoe and Byars 1971). 5. We used release 35 of the human genome, available on the ENSEMBL website in December 2005 (http://www.ensembl.org/Homo_sapiens). 6. We couldn’t resist: the answer is P = 2.64 × 10284. 7. This is not always true. Many statistical packages for computers cheat and use an approximation. 8. Why do we lose another degree of freedom? Using the data to estimate a parameter of the probability model of the null hypothesis (here, the binomial distribution) unfairly improves the fit of the model to the data. We compensate for this by removing one degree of freedom for the test statistic for every parameter we estimate. If we estimate too many parameters, then the number of degrees of freedom would drop to zero and we couldn’t do the test. 9. Poisson is famous for saying, “Life is good for only two things, doing mathematics and teaching mathematics,” an opinion no doubt shared by most readers of this book. 10. Different mutations in the same gene cause red hair and freckles in humans, white fur in Florida beach mice, white skin in lizards of the White Sands of New Mexico, and coat color variation in dogs and horses. DNA sequences of woolly mammoths have even found variation in this gene. 11. Or if they were especially prone to standing behind their horses . . . 12. This newer study also found that the number of injuries sustained increased with increasing numbers of stories fallen, unlike the study reported in Example 1.2. 13. Indian Statistical Congress, Sankhyā, ca. 1938. 14. “A month in the laboratory can save an hour in the library.”—Westheimer’s Discovery Chapter 9 1. In medical studies such as this one, the convention is to calculate the odds of the outcome “diseased” or “died” rather than on the outcome “cured” or “survived.” When calculating odds ratio, the convention is also to put the odds for the treatment group in the numerator and the odds for the control group in the denominator. 2. This formula will not work when a, b, c, or d is zero. One approximate solution in this case is to add 1/2 to each of the four cells in the 2 × 2 table. 3. Z is the critical value for a standard normal distribution. The standard normal distribution is discussed in greater detail in Chapter 10. 4. Toxoplasma is known to affect the behavior of rats and mice. Infected rats lose their fear of cats, and in fact may become attracted to cat smells (Vyas et al. 2007). In this way, the parasite has a higher probability of reaching its final host, the cat. 5. In this section, we calculated the expected frequencies for every combination of row and column by assuming that the data are a random sample from the study population. However, the same calculations work for study designs in which the researcher has fixed the number of individuals in each group for one of the two variables, such as the number of women in each of the two aspirin treatments in Example 9.2, provided that each group contains a random sample of subjects. 6. Many parasites modify the behavior of their hosts, with the result that their chances of making it to their next host are increased. Liver flukes make their ant hosts move to the top of grass blades, where they are more likely to be eaten by grazing cows, the host in which the flukes can reproduce. Wire worms make their cricket hosts find and jump into water, where the crickets drown but the worms complete their life cycle. 7. Recall from Section 8.2 that df = (Number of categories) − 1 − (Number of parameters estimated from the data). In a contingency table, the number of categories is r − c. The number of parameters estimated from the data is (r − 1) + (c − 1), reflecting the number of row and column totals needed to generate the expected cell frequencies. After some algebra, this calculation leads to df = (r − 1)(c − 1). 8. His method for doing this was ingenious. It is difficult for humans to distinguish reliably whether cows are in heat, but bulls are very good at it. So, the researcher harnessed paint sponges onto the undersides of the bulls, which each night marked the cows that had been the object of the bull’s affections. 9. It has been speculated that, by drinking the blood of cows in heat, the bats minimize their intake of hormones, which are in higher concentration in the blood of non-estrous cows and may act as a birth control pill for bats, preventing reproduction. 10. This beautiful fish is shown on the opening page of this chapter. 11. Almost all of the mortality occurred during the descents. 12. The original paper (Huey and Eguskitza 2000) did not make this mistake. 13. We’re not making this up. See http://www.youtube.com/watch?v=q3gxpIJ-f2M, which would look very cute if you didn’t know what was going on. Chapter 10 1. “Singleton” means that the baby was not a twin, a triplet, etc. 2. “Standard normal” may sound redundant and repetitive (like “bunny rabbit” or “vim and vigor”), but there are an infinite number of normal distributions, depending on what mean and variance are put into the formula. The standard normal distribution is just one of these. 3. For example, the Excel function NORMDIST(Z,0,1,TRUE) will return the probability under a standard normal distribution of getting a value less than Z. Also see the introduction to Statistical Table B. 4. Not to be confused with “everyday ordinary pervert.” You don’t often find a jargon term that seems to be both redundant and self-contradictory. 5. NASA has had infamous difficulty in converting between English and metric units. In 1999, the Mars Climate Orbiter missed Mars and flew off into its own independent orbit around the sun because some components of the navigational system used English units and others used metric. Oops. 6. This may be the only time it can be said that 100 babies are less noisy than 10. 7. The proportion specified by the null hypothesis is what’s important, not the proportion in the data. 8. One house in Kansas had more than 2000 brown recluse spiders (Sandidge 2003). 9. Hence the old medical adage: “The common cold will go away in a week with proper treatment, but it will take seven days if left untreated.” 10. “Placebo” is from the Latin for “I will please.” In Chaucer’s time, this word was used for a sycophant or flatterer who would say whatever would please the listener rather than the truth. The word was then adopted by 19th centurycentury physicians to refer to a treatment given to please the patient rather than to cure him or her. Slowly, doctors began to realize that in fact placebos had a therapeutic effect that had to be controlled for in medical trials. Chapter 11 1. The statistic t is called Student’s t after the nom de plume of the man who first discovered its properties, “Student.” (He was more clever at statistics than pseudonyms.) “Student” in reality was William Gosset, an employee of the Guinness Brewing Company in Dublin. Gosset used a pseudonym because Guinness prohibited its employees from publishing, following the unauthorized release of some brewing secrets a few years earlier by another employee. In his honor, Guinness was appointed the official beverage of this book. 2. Data provided by Sam Cotton and Kevin Fowler, University College, London. 3. How did common wisdom get such a basic value wrong for so long? A hint comes from the fact that 98.6°F corresponds exactly to 37°C. The original measures of human body temperature were done on the Celsius scale and rounded to the nearest degree. The estimated mean value of body temperature, 98.25°F, is 36.8°C, which when rounded off gives 37°C. Later, this rounded figure of 37°C was converted back into Fahrenheit, yielding the erroneous 98.6°F. 4. After a rather large number of permits were applied for and granted. 5. https://soundcloud.com/new-scientist/male-koala-bellowWS2_ Chapter 12 1. We will use the term “treatments” in a broad sense to refer to different states or conditions experienced by subjects, not just to formal experimental treatments. 2. We are assuming that you have persuaded an enlightened forest company to go along with this experiment. It is more likely that the forest company has already clear-cut some areas and not others, and you must compare the two treatments after the fact with an observational study. Both the experimental approach and the observational design yield a measure of the association between clearcutting and salamander number, but only the experimental study can tell us whether clear-cutting is the cause of any difference (Chapter 14). The study design issues are the same, however: whether to choose forest plots randomly from each treatment (the two-sample design) or to randomly choose forest plots that straddle both treatments, making it possible to compare cut and uncut sites that are side by side (the paired design). 3. A “weighted average” may count each group differently. In this case, each group is weighted by its degrees of freedom, so that the group with the larger sample size (i.e., with more information available) contributes proportionately more to the calculated average. 4. More generally, the null hypothesized value for the difference between the two population means can be any number: H0: (µ1 − µ2) = (µ1 − µ2)0. In this case, we would calculate the test statistic as t= (Y¯1-Y¯2)-(μ1-μ2)0SEY¯1-Y¯2 instead. 2 5. The results of this folly would be χ = 7.2, df = 1, and P = 0.0071. Thus, we would reject the null hypothesis of no association. 6. Larger follow-up studies have failed to find any difference between mothers and fathers in the degree to which their babies resemble them (Brédart and French 1999). 7. When you’re reading scientific reports, interpret error bars on a graph with care. Sometimes error bars are used to show one standard error above and below the estimate, sometimes they show two standard errors, sometimes they show a confidence interval, and sometimes they just show standard deviations. Figure 12.6-1 shows confidence intervals. 8. http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm 9. It has been suggested that death rates can be influenced by the proximity to big events. For example, the death rate dropped a bit leading up to the new millennium and increased a bit afterward. 10. Vampire bats can even run. See http://www.nature.com/nature/journal/v434/n7Q31/extref/434292as2.mov. Chapter 13 1. Two other options are simulation and bootstrapping, which we discuss in Chapter 19. 2. To find the normal quantile, order the observations from smallest to largest and assign each a rank, called i. The smallest data point has i = 1, the next smallest has i = 2, and so on up to the highest data point, where i = n. The estimated proportion of the distribution lying below an observation ranked i is i/(n + 1). The corresponding normal quantile is the standard normal deviate Z having an area under the standard normal curve below it equal to i/(n + 1). For example, if n is 99, the approximate fraction of the probability density lying below i = 95 is 95/(99 + 1) = 0.95. The corresponding normal quantile is 1.64, which has 0.95 of the area under the normal curve below it (and 0.05 above it; see Statistical Table B). 3. Other mathematical operations can also serve as valid transformations as long as there is a one-to-one correspondence between the data on its original scale and on the transformed scale (that is, it must be possible to transform back to get the original value without ambiguity). 4. In other words, if X = sin [Y], then Y = arcsin [X]. On many calculators, arcsine is given as sin−1. It is also known as the angular transformation and is measured in radians. 5. You will sometimes see the square-root transformation with 0 or 1 instead of 1/2. 6. The geometric mean is calculated by multiplying all data points together and the taking the nth root of the resulting product. 7. These methods are also sometimes called distribution-free. 8. For example, in a number of insect species, male seminal fluid contains chemicals that reduce the tendency of the female to mate again with other males. However, the chemicals also reduce female survival, so females have evolved mechanisms to counter these chemicals. 9. Some statistical packages on the computer use an equivalent test, the Wilcoxon rank-sum test, instead of the Mann–Whitney U-test. The Wilcoxon rank-sum test statistic, W, is calculated differently from U, but both are based on the rank sums and give equivalent results. Don’t confuse the Wilcoxon ranksum test with the Wilcoxon signed-rank test, which is for paired data. 10. But they are disgusting. 11. The null distribution for the U-statistic is not the same when there are ties as when there are no ties, if the ties are between members of different groups. In this case the test is conservative, meaning that using the critical values in Statistical Table E yields a Type 1 error rate lower than the stated value. Ties also reduce the power of the test, but the effect is not great when there are only a few ties. Corrections for ties exist but are tedious to compute by hand. 12. We can use χ2 as our test statistic when we test association between two categorical variables. We can use the correlation coefficient when testing association between two numerical variables. 13. In statistics jargon, this process is called “sampling without replacement.” Once a value is sampled, it is removed from the pool, so the same value cannot be chosen again for the same permuted sample. 14. Females lay just one egg per day, and so the parasitism-first behavior would have the effect of shortening the total laying period in her own nest and reducing the total span of time for all her eggs to hatch. This would get the young quickly out of the nest, where they are vulnerable to predation. 15. Male infanticide has strong conservation implications, too. Most lion hunting in Africa is targeted toward the trophy males, and killing these males may provoke a takeover of their pride by other males, resulting in infanticide. 16. Sorry, we’re not making this up. 17. The data are also available at whitlockschluter.zoology.ubc.ca. 18. In this question and in some that follow, the sample sizes are often too small to provide powerful tests and are presented as exercises only. 19. The data are also available at whitlockschluter.zoology.ubc.ca. 20. See movie at www.biomedcentral.com/content/download/supplementary/1472-6785-11-30-s8.mpeg. 21. And advance the health of billions of mosquitoes. 22. The actress Angelina Jolie, who is a carrier, chose to have a preventive double mastectomy to reduce the odds. 23. Extrapolated from summary statistics in the original paper. Chapter 14 1. The Random.org website at http://random.org/sequences will also do this. 2. What do you do if, by chance, the first four of eight units are all assigned treatment A and the last four are assigned treatment B, yielding the arrangement AAAABBBB? Some biologists might randomize again to ensure the interspersion of treatments, but that is not strictly legitimate. If the first four units are different somehow from the last four, apart from treatment, then blocking (Section 14.4) should be considered as a remedy. 3. This result probably overestimates the effects of a lack of blinding, because the experiments without blinding also tended to have confounding problems, such as a lack of randomization (Bebarta et al. 2003). 4. The margin of error is approximately twice the standard error of the difference between sample means (2SE), or 2σ2(1n+1n)=22σ2/n=8σ2/n. the text. Solving for n gives the rule in 5. Other than a possible social dilemma. Chapter 15 1. In the 1998 experiment, participants’ eyes had been exposed to low levels of light while their knees were being illuminated. 2. If several samples of the same size are taken from a population, the variance among sample means σY¯2 is σ2/n where σ2 is the variance within populations and n is the number of subjects in each sample (sample size). Remember from Chapter 4 that the standard deviation of the sampling distribution for Y¯ is σ/n. This is the standard error of the mean, and squaring it gives the 2 variance, σ /n. 3. Unfortunately, analysis of variance has its own special jargon. You need to learn it so that the terms are familiar when you see them used in published articles and in the output of statistical packages on the computer. 4. Continuing the logic of footnote 2, if the null hypothesis is true and we have several samples of size n, then the variance among sample means σY¯2 equals σ2/n, where σ2 is the variance within populations. Rewriting, nσY¯2=σ2. Under the null hypothesis, MSgroups provides an estimate of nσY¯2, and MSerror is an estimate of σ2. In this case, MSgroups will on average be the same as MSerror, and the F-ratio MSgroups/MSerror should equal 1, except by chance. 5. The ni in the second term comes from the fact that we’re adding up over all the ni individuals in group i. 6. The test statistics and null distributions are basically the same. The F-ratio for two means (having one degree of freedom for groups and “df” degrees of freedom for error) is the same as the square of the two-sample t-statistic having “df” degrees of freedom. 7. For this reason, they are also called a priori comparisons. 8. Testing all pairs of means is not the only kind of unplanned comparison. For example, the Scheffé method tests any linear combination of means, but it is too conservative when applied only to pairs of means. Three other methods to test all pairs of means—Duncan’s multiple-range test, Fisher’s leastsignificant-difference test, and the Newman–Keuls test—provide less protection than the Tukey– Kramer test against Inflated Type I error rates. The Tukey–Kramer method is also called “Tukey’s honestly significant difference (HSD)” test. Unplanned comparisons are also called “post hoc tests,” “a posteriori tests,” or simply “multiple comparison tests.” 9. One or more groups with intermediate means might be assigned two symbols if the Tukey-Kramer results were ambiguous. For example, if it had turned out that the partial-shade mean was not significantly different from the means of either the deep-shade or no-shade groups, yet the deep-shade and no-shade means were different from each other, then partial shade would have been assigned both symbols “a” and “b” to indicate this ambiguity. 10. Some authors refer to fixed-effects ANOVA as “Model 1” and random-effects ANOVA as “Model 2.” 11. Fixed-effects ANOVA also has two levels of variation, but only variation within groups is random. Differences among fixed groups are not random. 12. Methods based on likelihood (Chapter 20) are required if the number of measurements is not identical in every group (see Pinheiro and Bates 2000). Also, sA2 can sometimes be negative by chance, even though σA2 can be only zero or positive. When this happens, sA2 is set to zero. 13. The repeatability is also called the “intraclass correlation.” 14. Spaniards have nicknamed the Egyptian vulture “moñiguero,” which politely translates as “dungeater.” 15. The repeatability of a trait among the offspring of males born from different females helps estimate the “heritability” of that trait, the fraction of variation in the trait in the population that is genetic rather than environmental. Differences among males indicate a genetic component, because they were randomly mated and their offspring were raised in a common (lab) environment. Chapter 16 1. Modified from Jerison (2006). 2. The linear correlation coefficient is usually called simply the “correlation coefficient.” It is also called the Pearson’s correlation coefficient after Karl Pearson, who first defined it. Pearson was one of the founders of the modern field of statistics (see Interleaf 1). 3. Be careful: Don’t mix up the Greek letter r with the roman letter p, which we use to represent proportion (Chapter 7). 4. We provide a shortcut formula for this quantity in the Quick Formula Summary (Section 16.8 at the end of the chapter) to help when you are doing calculations by hand. 5. Another way to write the equation for the correlation coefficient is r=Covariance(X,Y)sXsY, where the covariance of X and Y is in the numerator, and the standard deviations of X and Y are in the denominator. The covariance of X and Y is a measure of their relationship, analogous to the variance of a single variable. Its formula is provided in the Quick Formula Summary (Section 16.8). 6. Testing the more general null hypothesis H0: ρ = ρ0, where ρ0 is a number other than zero, requires a different method that makes use of the Fisher’s z-transformation. We do not present it here (see, e.g., Sokal and Rohlf 2008). 7. There are n independent data points when calculating a correlation coefficient, but there are two fewer degrees of freedom because we have to use two summaries of the data, X¯ and Y¯, when we calculate r. 8. Different computer programs might yield slightly different P-values for the same data, depending on the approximation used to compute it. 9. According to the authors of the report, one eyewitness claimed to have seen the trick performed and had taken a photograph. Examination of the photograph showed only a boy balancing on the end of a long pole. Chapter 17 1. Modified from Prugnolle et al. (2005). 2. Another way to write the formula for the slope is b=Covariance(X,Y)SX2. The term in the numerator is the covariance between X and Y and the term in the denominator is the variance in X. The formula for the covariance can be found in the Quick Formula Summary of the correlation chapter (Section 16.8). 3. To draw the line by hand, plot two points given by the regression equation and use a ruler to connect them. Two convenient points are the intercept (the predicted Y at X = 0) and the mean, because the regression line always passes through the point (X¯,Y¯ ). Don’t extend the actual line to pass through the Y-intercept at X = 0 if this point lies beyond the range of X in the data (as in Figure 17.14). 4. The degrees of freedom are n − 2 rather than n − 1, because we couldn’t calculate the predicted values Ŷi without first calculating two other quantities using the data—namely, the slope and the intercept of the regression line. 5. R2 is sometimes written as r2. In regression with only one explanatory variable, the R2 value is the same as the square of the correlation coefficient r. 6. Regression toward the mean happens even when the overall mean changes from the first to the second measurement. 7. Regression was named for this property by Galton (1886), who studied heights of parents and their grown children. He noticed that tall fathers tended to have sons shorter than their fathers on average, while the reverse was true for short fathers. Galton was concerned at first that this tendency would eventually eliminate variation in height in the population and prevent evolution. He later realized that, in each generation, very tall (and very short) individuals are still present, but they are not necessarily the offspring of the tallest (or shortest) parents. 8. Not all dose-response data are binary. For example, the response variable Y is often the fraction of individuals in a unit that die, where a unit refers to a randomly sampled group of subjects such as members of a family, petri dish, aquarium, or forest plot. Such data might be analyzed with ordinary linear regression after making a suitable transformation. 9. Log-odds(Y) = ln[Y/(1 − Y)]. See Section 9.2, which used the symbol p in place of Y. Log-odds(Y) is also known as the logit function. 10. Independence is a perilous assumption when comparing species (or other taxa) because species are related to one another to varying degrees in the phylogeny (see Interleaf 11). 11. The authors of the study used a method to correct for non-independence introduced by phylogeny (see Interleaf 11), but their prediction was similar. 12. A reply to this study by an art historian (Rich 2011) pointed out that what may have been changing was head size, not portion size. Apparently artists at that time used size as an indication of importance, so they depicted things like Jesus’s head in the Last Supper as larger than life. However, later artists drew head sizes more realistically. Chapter 18 1. To analyze categorical data with a linear model, your computer package converts the categories to numerical variables that indicate the groups to which every subject belongs. This behind-the-scenes trick allows categorical variables to be analyzed in the same way as numerical variables. 2. You would provide the real variable names from your data spreadsheet when generating the model statement on the computer. For example, if the treatment variable was named “LightTreatmentPosition” in your data set, then this is what you should use in the model statement instead of TREATMENT. Some packages might not require you to type the CONSTANT term in the word statement for the sake of brevity, but a constant is nevertheless included in the analysis. 3. In single-factor ANOVA, the predicted value for every data point under the null model is just the grand mean, Y . The predicted value for an observation under the alternative model is the mean for its group Yi. 4. At least two data points are needed for each combination of two categorical variables before an interaction between the variables can be fitted. 5. There is difference of opinion on the appropriate method for testing a main effect. Under the method we use here (called “Type 3”) the improvement in fit when a main effect is added to the model is measured against a null model that includes all other terms in the model, including any interactions. Some statisticians recommend an alternative method (“Type 1”) in which the improvement in fit for a main effect is measured against a null model that includes only those terms that appear before it in the model statement, with interactions always appearing last. In this case the order in which terms appear in your model statement might affect the outcome of F-tests for main effects. Different statistics packages on the computer adopt one or the other of these approaches as their default without necessarily highlighting this fact or making it obvious from the output (e.g., JMP uses Type 3, whereas R uses Type 1). None of this affects the fit of the full model to the data or the test of interaction effects. Chapter 19 1. We can’t say P = 0, because we might find a more extreme value from the null distribution if we ran more simulations. 2. The term supposedly derives from the adventures of Baron Munchausen, who found himself at the bottom of a hole in the ground, lost until he had the idea of picking himself up by his own bootstraps. (Rumored to be from The Travels and Surprising Adventures of Baron Munchausen Illustrated with Thirty-seven Curious Engravings from the Baron’s Own Designs and Five Illustrations, by G. Cruikshank.) Efron (1979) mentions that he was tempted to call this method the “shotgun” because it could “blow the head off any problem if the statistician can stand the resulting mess.” 3. The most common use of the bootstrap in biology is to calculate the uncertainty of estimates of phylogenies—the evolutionary relationships of a sample of species (see Practice Problem 6 for an example). 4. In statistics jargon, this is called “sampling with replacement.” 5. A common mistake is to calculate the standard error of the bootstrap estimates by dividing the standard deviation by the square root of the number of bootstrap replicates, by analogy with the standard error of the mean. This results in a quantity that is much too small to be the bootstrap standard error. 6. The exception is when the estimate itself is biased. The bootstrap can be used to estimate and correct for bias, but we don’t present the method here. 7. Consult the excellent book An Introduction to the Bootstrap, by Efron and Tibshirani (1993), for more details and other options. 8. The 95% confidence interval for mean asymmetry is similarly broad, but it doesn’t overlap zero. 9. In this question and in ones that follow, the sample sizes and numbers of bootstrap replicates are often too small to provide accurate bootstrap calculations and are presented as exercises only. This question calls for the 80% confidence interval only to prevent you from having to roll the die too many times. In general, results based on bootstraps use the same levels of confidence as more traditional statistics. Chapter 20 1. Huxley was known as “Darwin’s bulldog” for his vigorous support of Darwin’s theory of evolution. 2. The arbitrary convention in gene-mapping studies is to use the base 10 logarithm rather than the natural log. The resulting quantity is called the LOD score. 3. Remember that (nY) is called “n choose Y” and is shorthand for n!Y!(n-Y)!. The symbol n! represents “n factorial.” 4. One reason is that some likelihoods can be so small as to push the lower limits of your calculator and even your computer. Using logs also makes it easier to evaluate quantities such as (3223), which might push the upper limits of computer memory. With logs, multiplication becomes addition: ln[A × B] = ln[A] + ln[B], and powers become multiples: ln[AB] = B ln[A]. 5. Log-likelihood curves based on simple probability models usually have only one peak, but there can be more than one peak in more complex models. It is important to check the full range of possible values for the parameter to be sure you find the highest peak. 6. Notice that the maximum likelihood estimate of pˆ = 0.72 is the same as the conventional estimate for the proportion, Y/n = 23 /32 = 0.72. This is no coincidence: Y/n is the formula for the maximum likelihood estimate of a population proportion. This shortcut could have spared us work, except that we wanted to show the general approach for finding a maximum likelihood estimate. 7. This probability distribution is called the hypergeometric distribution, which gives the probability of a given number of individuals of a particular type from a sample of known size and a population of known size, assuming that the individuals are sampled without replacement. 8. For small samples, simulation can be used to find the null distribution of G (see Chapter 19). 9. The researchers later determined that the chemical cue that the wasps use to distinguish mated from unmated females is benzyl cyanide, which the male butterfly passes to the female during mating. The compound is an “anti-aphrodisiac,” rendering the mated female less attractive to other male butterflies (Fatouros et al. 2005). 10. The variable on the vertical axis is the log-likelihood after adding a constant to each log-likelihood value, so that the maximum is zero. This doesn’t change any of the ensuing calculations. Chapter 21 1. Linus Pauling is the only person to have won two unshared Nobel Prizes, Chemistry in 1954 and Peace in 1962, so his behavior does not represent that of a crank. 2. One of the “rambunctious” frat houses was described as “only standing because it was made out of steel and concrete.” At one of the “responsible” frats, they “talked a lot about computers and calculus.” 3. The SMD is often called Cohen’s d (Borenstein et al. 2009). A similar measure, Hedges’ g, includes a correction for small sample size. 4. Calculation details of the fail-safe number, also called the “file-drawer method,” can be found on p. 405 of Cooper and Hedges (1994). 5. whitlockschluter.zoology.ubc.ca Statistical tables This appendix gives probabilities and critical values for a few of the most commonly used probability distributions. More can be found in such references as Rohlf and Sokal (1994), Statistical Tables. Statistical Table A: The χ2 distribution, p. 703 Statistical Table B: The standard normal (Z) distribution, p. 706 Statistical Table C: Student’s t-distribution, p. 708 Statistical Table D: The F-distribution, p. 711 Statistical Table E: Mann-Whitney U-distribution, p. 718 Statistical Table F: Tukey-Kramer q-distribution, p. 720 Statistical Table G: Critical values for the Spearman’s correlation, p. 722 Using statistical tables Statistical tables are used to calculate probabilities and to put bounds on the P-value for a test statistic. They are also used to find the critical values for calculating confidence intervals. With easy access to computers becoming more common, statistical tables are becoming less necessary. One advantage of using computer programs is that they provide precise P-values, whereas statistical tables often only bracket a P-value (such as “0.025 < P < 0.05”). Moreover, the tables in this book only give critical values corresponding to the most commonly used significance levels, such as 0.05 and 0.01. Statistical tables are useful when computers are unavailable, or when putting bounds on P is satisfactory. For some of the simpler distributions, we give instructions for calculating more exact probabilities in both the free statistical package R and the spreadsheet program Excel, on the introductory page of each table. Some probability distributions depend on the degrees of freedom. Examples include the χ2 distribution, the Student’s t-distribution, and the F-distribution. In such cases, the statistical tables in this book provide quantities for many but not all values of the degrees of freedom. What should you do when the probability distribution corresponding to a specific number of degrees of freedom is missing from the tables? For example, you might be looking for critical values of the t-distribution having 149 degrees of freedom, but Statistical Table C provides such quantities only for t-distributions having 140 and 160 degrees of freedom. In this case, there are two approaches to obtaining the missing critical values, as follows: 1. The conservative approach: Find the two critical values in the table corresponding to distributions having degrees of freedom just above and just below the desired degrees of freedom. Use the critical value that makes your confidence interval widest, or that makes it most difficult to reject the null hypothesis of your test. 2. Interpolation: Find the two critical values in the table that correspond to the degrees of freedom just above and just below the desired number of degrees of freedom. Use linear interpolation to estimate the critical value for your desired degrees of freedom. For example, let’s say that you want to find the critical value for the F-distribution having five degrees of freedom in the numerator and 132 degrees of freedom in the denominator, F0.05(1),5,132. Statistical Table D only gives F0.05(1),5,100 = 2.31 and F00.5(1),5 200 = 2.26. Because 132 lies between 100 and 200, the critical value must lie between 2.31 and 2.26. The following figure shows how to find the critical value by linear interpolation. The desired critical value (C) is obtained as C=2.31+(2.26-2.31)(132-100200-100)=2.294. The general formula for calculating a critical value from interpolation is C=Clow+(Chigh-Clow)(df-dflowdfhigh-dflow), where dflow and dfhigh are the degrees of freedom just below and just above df, the desired number of degrees of freedom, and Clow and Chigh are their corresponding critical values. Statistical Table A: The χ2 distribution This table gives critical values of the χ2 distribution. The critical value χ2α,df defines the area α in the right tail of the χ2 distribution with df degrees of freedom. To find a critical value in the table, select the desired α along the top row and the number of degrees of freedom df in the far left column. For example, if df = 5 and α = 0.05, then the critical value is 11.07; that is, the probability of a value greater than or equal to 11.07 is 0.05. The critical value for a χ2 distribution having df = 5 when α = 0.05. The area in red indicates the tail probability corresponding to 0.05, or 5%. The boundary of the red section is χ2 = 11.07. The probability of a value greater than or equal to 11.07 is 0.05. Exact probabilities under the right tail of the χ2 distribution for any value of χ2 can be obtained using one of many computer programs. For example, in the free statistical package R, to find the P-value corresponding to a χ2 value of 11.07 with 5 df, enter the following text into the R console: 1. - pchisq(df = 5, q = 11.07) More generally, replace 5 and 11.07 with the appropriate df and χ2 calculated for your data. The common spreadsheet program Excel can calculate exact P-values for χ2 as well. In a cell, write = 1 - CHISQ.DIST(11.07, 5, TRUE) where the first number in the parentheses is the observed χ2 and the second number is the df. α df 0.999 0.995 0.99 0.975 0.95 0.05 0.025 0.01 0.005 0.001 1 0.0000016 0.000039 0.00016 0.00098 0.00393 3.84 5.02 6.63 2 0.002 0.01 0.02 0.05 0.10 5.99 7.38 9.21 10.60 13.82 3 0.02 0.07 0.11 0.22 0.35 7.81 9.35 11.34 12.84 16.27 4 0.09 0.21 0.30 0.48 0.71 9.49 11.14 13.28 14.86 18.47 5 0.21 0.41 0.55 0.83 1.15 11.07 12.83 15.09 16.75 20.52 6 0.38 0.68 0.87 1.24 1.64 12.59 14.45 16.81 18.55 22.46 7 0.60 0.99 1.24 1.69 2.17 14.07 16.01 18.48 20.28 24.32 8 0.86 1.34 1.65 2.18 2.73 15.51 17.53 20.09 21.95 26.12 9 1.15 1.73 2.09 2.70 3.33 16.92 19.02 21.67 23.59 27.88 10 1.48 2.16 2.56 3.25 3.94 18.31 20.48 23.21 25.19 29.59 11 1.83 2.60 3.05 3.82 4.57 19.68 21.92 24.72 26.76 31.26 12 2.21 3.07 3.57 4.40 5.23 21.03 23.34 26.22 28.30 32.91 13 2.62 3.57 4.11 5.01 5.89 22.36 24.74 27.69 29.82 34.53 14 3.04 4.07 4.66 5.63 6.57 23.68 26.12 29.14 31.32 36.12 7.88 10.83 15 3.48 4.60 5.23 6.26 7.26 25.00 27.49 30.58 32.80 37.70 16 3.94 5.14 5.81 6.91 7.96 26.30 28.85 32.00 34.27 39.25 17 4.42 5.70 6.41 7.56 8.67 27.59 30.19 33.41 35.72 40.79 18 4.90 6.26 7.01 8.23 9.39 28.87 31.53 34.81 37.16 42.31 19 5.41 6.84 7.63 8.91 10.12 30.14 32.85 36.19 38.58 43.82 20 5.92 7.43 8.26 9.59 10.85 31.41 34.17 37.57 40.00 45.31 21 6.45 8.03 8.90 10.28 11.59 32.67 35.48 38.93 41.40 46.80 22 6.98 8.64 9.54 10.98 12.34 33.92 36.78 40.29 42.80 48.27 23 7.53 9.26 10.20 11.69 13.09 35.17 38.08 41.64 44.18 49.73 24 8.08 9.89 10.86 12.40 13.85 36.42 39.36 42.98 45.56 51.18 25 8.65 10.52 11.52 13.12 14.61 37.65 40.65 44.31 46.93 52.62 26 9.22 11.16 12.20 13.84 15.38 38.89 41.92 45.64 48.29 54.05 27 9.80 11.81 12.88 14.57 16.15 40.11 43.19 46.96 49.64 55.48 28 10.39 12.46 13.56 15.31 16.93 41.34 44.46 48.28 50.99 56.89 29 10.99 13.12 14.26 16.05 17.71 42.56 45.72 49.59 52.34 58.30 30 11.59 13.79 14.95 16.79 18.49 43.77 46.98 50.89 53.67 59.70 31 12.20 14.46 15.66 17.54 19.28 44.99 48.23 52.19 55.00 61.10 32 12.81 15.13 16.36 18.29 20.07 46.19 49.48 53.49 56.33 62.49 33 13.43 15.82 17.07 19.05 20.87 47.40 50.73 54.78 57.65 63.87 34 14.06 16.50 17.79 19.81 21.66 48.60 51.97 56.06 58.96 65.25 35 14.69 17.19 18.51 20.57 22.47 49.80 53.20 57.34 60.27 66.62 36 15.32 17.89 19.23 21.34 23.27 51.00 54.44 58.62 61.58 67.99 37 15.97 18.59 19.96 22.11 24.07 52.19 55.67 59.89 62.88 69.35 38 16.61 19.29 20.69 22.88 24.88 53.38 56.90 61.16 64.18 70.70 39 17.26 20.00 21.43 23.65 25.70 54.57 58.12 62.43 65.48 72.05 40 17.92 20.71 22.16 24.43 26.51 55.76 59.34 63.69 66.77 73.40 41 18.58 21.42 22.91 25.21 27.33 56.94 60.56 64.95 68.05 74.74 42 19.24 22.14 23.65 26.00 28.14 58.12 61.78 66.21 69.34 76.08 43 19.91 22.86 24.40 26.79 28.96 59.30 62.99 67.46 70.62 77.42 44 20.58 23.58 25.15 27.57 29.79 60.48 64.20 68.71 71.89 78.75 45 21.25 24.31 25.90 28.37 30.61 61.66 65.41 69.96 73.17 80.08 46 21.93 25.04 26.66 29.16 31.44 62.83 66.62 71.20 74.44 81.40 47 22.61 25.77 27.42 29.96 32.27 64.00 67.82 72.44 75.70 82.72 48 23.29 26.51 28.18 30.75 33.10 65.17 69.02 73.68 76.97 84.04 49 23.98 27.25 28.94 31.55 33.93 66.34 70.22 74.92 78.23 85.35 50 24.67 27.99 29.71 32.36 34.76 67.50 71.42 76.15 79.49 86.66 51 25.37 28.73 30.48 33.16 35.60 68.67 72.62 77.39 80.75 87.97 52 26.07 29.48 31.25 33.97 36.44 69.83 73.81 78.62 82.00 89.27 53 26.76 30.23 32.02 34.78 37.28 70.99 75.00 79.84 83.25 90.57 54 27.47 30.98 32.79 35.59 38.12 72.15 76.19 81.07 84.50 91.87 55 28.17 31.73 33.57 36.40 38.96 73.31 77.38 82.29 85.75 93.17 56 28.88 32.49 34.35 37.21 39.80 74.47 78.57 83.51 86.99 94.46 57 29.59 33.25 35.13 38.03 40.65 75.62 79.75 84.73 88.24 95.75 58 30.30 34.01 35.91 38.84 41.49 76.78 80.94 85.95 89.48 97.04 59 31.02 34.77 36.70 39.66 42.34 77.93 82.12 87.17 90.72 98.32 60 31.74 35.53 37.48 40.48 43.19 79.08 83.30 88.38 91.95 99.61 61 32.46 36.30 38.27 41.30 44.04 80.23 84.48 89.59 93.19 100.89 62 33.18 37.07 39.06 42.13 44.89 81.38 85.65 90.80 94.42 102.17 63 33.91 37.84 39.86 42.95 45.74 82.53 86.83 92.01 95.65 103.44 64 34.63 38.61 40.65 43.78 46.59 83.68 88.00 93.22 96.88 104.72 65 35.36 39.38 41.44 44.60 47.45 84.82 89.18 94.42 98.11 105.99 66 36.09 40.16 42.24 45.43 48.31 85.96 90.35 95.63 99.33 107.26 67 36.83 40.94 43.04 46.26 49.16 87.11 91.52 96.83 100.55 108.53 68 37.56 41.71 43.84 47.09 50.02 88.25 92.69 98.03 101.78 109.79 69 38.30 42.49 44.64 47.92 50.88 89.39 93.86 99.23 103.00 111.06 70 39.04 43.28 45.44 48.76 51.74 90.53 95.02 100.43 104.21 112.32 71 39.78 44.06 46.25 49.59 52.60 91.67 96.19 101.62 105.43 113.58 72 40.52 44.84 47.05 50.43 53.46 92.81 97.35 102.82 106.65 114.84 73 41.26 45.63 47.86 51.26 54.33 93.95 98.52 104.01 107.86 116.09 74 42.01 46.42 48.67 52.10 55.19 95.08 99.68 105.20 109.07 117.35 75 42.76 47.21 49.48 52.94 56.05 96.22 100.84 106.39 110.29 118.60 76 43.51 48.00 50.29 53.78 56.92 97.35 102.00 107.58 111.50 119.85 77 44.26 48.79 51.10 54.62 57.79 98.48 103.16 108.77 112.70 121.10 78 45.01 49.58 51.91 55.47 58.65 99.62 104.32 109.96 113.91 122.35 79 45.76 50.38 52.72 56.31 59.52 100.75 105.47 111.14 115.12 123.59 80 46.52 51.17 53.54 57.15 60.39 101.88 106.63 112.33 116.32 124.84 81 47.28 51.97 54.36 58.00 61.26 103.01 107.78 113.51 117.52 126.08 82 48.04 52.77 55.17 58.84 62.13 104.14 108.94 114.69 118.73 127.32 83 48.80 53.57 55.99 59.69 63.00 105.27 110.09 115.88 119.93 128.56 84 49.56 54.37 56.81 60.54 63.88 106.39 111.24 117.06 121.13 129.80 85 50.32 55.17 57.63 61.39 64.75 107.52 112.39 118.24 122.32 131.04 86 51.08 55.97 58.46 62.24 65.62 108.65 113.54 119.41 123.52 132.28 87 51.85 56.78 59.28 63.09 66.50 109.77 114.69 120.59 124.72 133.51 88 52.62 57.58 60.10 63.94 67.37 110.90 115.84 121.77 125.91 134.75 89 53.39 58.39 60.93 64.79 68.25 112.02 116.99 122.94 127.11 135.98 90 54.16 59.20 61.75 65.65 69.13 113.15 118.14 124.12 128.30 137.21 91 54.93 60.00 62.58 66.50 70.00 114.27 119.28 125.29 129.49 138.44 92 55.70 60.81 63.41 67.36 70.88 115.39 120.43 126.46 130.68 139.67 93 56.47 61.63 64.24 68.21 71.76 116.51 121.57 127.63 131.87 140.89 94 57.25 62.44 65.07 69.07 72.64 117.63 122.72 128.80 133.06 142.12 95 58.02 63.25 65.90 69.92 73.52 118.75 123.86 129.97 134.25 143.34 96 58.80 64.06 66.73 70.78 74.40 119.87 125.00 131.14 135.43 144.57 97 59.58 64.88 67.56 71.64 75.28 120.99 126.14 132.31 136.62 145.79 98 60.36 65.69 68.40 72.50 76.16 122.11 127.28 133.48 137.80 147.01 99 61.14 66.51 69.23 73.36 77.05 123.23 128.42 134.64 138.99 148.23 100 61.92 67.33 70.06 74.22 77.93 124.34 129.56 135.81 140.17 149.45 Statistical Table B: The standard normal (Z) distribution This table gives probabilities under the right tail of the standard normal distribution. To determine the probability of sampling a value greater than or equal to a specific value of Z, find the first two digits of Z (called a and b) in the far left column of the table, and then find the last digit (c) in the top row of the table. For example, to find the probability of sampling a value greater than or equal to Z = 1.96, use a.b = 1.9 and c = 6. This probability is 0.025, or 2.5%. An example of a probability under the standard normal curve: the probability of sampling a value greater than or equal to the value 1.96 is 0.025, or 2.5%. Exact probabilities under the right tail of the standard normal distribution can be obtained in any one of various computer packages. In R, the probability under the standard normal distribution of obtaining a value of Z greater than 1.96 is found using 1. — pnorm(q = 1.96) Replace 1.96 with your desired value of Z. In Excel, this probability can be calculated by entering the following into a spreadsheet cell: = 1 — NORM.DIST(1.96, 0, 1, TRUE) First two digits of a.bc 0.0 Second digit after decimal (c) 0 0.5 1 2 3 4 5 6 7 8 0.49601 0.49202 0.48803 0.48405 0.48006 0.47608 0.47210 0.46812 0.1 0.46017 0.45620 0.45224 0.44828 0.44433 0.44038 0.43644 0.43251 0.42858 0.2 0.42074 0.41683 0.41294 0.40905 0.40517 0.40129 0.39743 0.39358 0.38974 0.3 0.38209 0.37828 0.37448 0.37070 0.36693 0.36317 0.35942 0.35569 0.35197 0.4 0.34458 0.34090 0.33724 0.33360 0.32997 0.32636 0.32276 0.31918 0.31561 0.5 0.30854 0.30503 0.30153 0.29806 0.29460 0.29116 0.28774 0.28434 0.28096 0.6 0.27425 0.27093 0.26763 0.26435 0.26109 0.25785 0.25463 0.25143 0.24825 0.7 0.24196 0.23885 0.23576 0.23270 0.22965 0.22663 0.22363 0.22065 0.21770 0.8 0.21186 0.20897 0.20611 0.20327 0.20045 0.19766 0.19489 0.19215 0.18943 0.9 0.18406 0.18141 0.17879 0.17619 0.17361 0.17106 0.16853 0.16602 0.16354 1.0 0.15866 0.15625 0.15386 0.15151 0.14917 0.14686 0.14457 0.14231 0.14007 1.1 0.13567 0.13350 0.13136 0.12924 0.12714 0.12507 0.12302 0.12100 0.11900 1.2 0.11507 0.11314 0.11123 0.10935 0.10749 0.10565 0.10383 0.10204 0.10027 1.3 0.09680 0.09510 0.09342 0.09176 0.09012 0.08851 0.08691 0.08534 0.08379 1.4 0.08076 0.07927 0.07780 0.07636 0.07493 0.07353 0.07215 0.07078 0.06944 1.5 0.06681 0.06552 0.06426 0.06301 0.06178 0.06057 0.05938 0.05821 0.05705 1.6 0.05480 0.05370 0.05262 0.05155 0.05050 0.04947 0.04846 0.04746 0.04648 1.7 0.04457 0.04363 0.04272 0.04182 0.04093 0.04006 0.03920 0.03836 0.03754 1.8 0.03593 0.03515 0.03438 0.03362 0.03288 0.03216 0.03144 0.03074 0.03005 1.9 0.02872 0.02807 0.02743 0.02680 0.02619 0.02559 0.02500 0.02442 0.02385 2.0 0.02275 0.02222 0.02169 0.02118 0.02068 0.02018 0.01970 0.01923 0.01876 2.1 0.01786 0.01743 0.01700 0.01659 0.01618 0.01578 0.01539 0.01500 0.01463 2.2 0.01390 0.01355 0.01321 0.01287 0.01255 0.01222 0.01191 0.01160 0.01130 2.3 0.01072 0.01044 0.01017 0.00990 0.00964 0.00939 0.00914 0.00889 0.00866 2.4 0.00820 0.00798 0.00776 0.00755 0.00734 0.00714 0.00695 0.00676 0.00657 2.5 0.00621 0.00604 0.00587 0.00570 0.00554 0.00539 0.00523 0.00508 0.00494 2.6 0.00466 0.00453 0.00440 0.00427 0.00415 0.00402 0.00391 0.00379 0.00368 2.7 0.00347 0.00336 0.00326 0.00317 0.00307 0.00298 0.00289 0.00280 0.00272 2.8 0.00256 0.00248 0.00240 0.00233 0.00226 0.00219 0.00212 0.00205 0.00199 2.9 0.00187 0.00181 0.00175 0.00169 0.00164 0.00159 0.00154 0.00149 0.00144 3.0 0.00135 0.00131 0.00126 0.00122 0.00118 0.00114 0.00111 0.00107 0.00104 3.1 0.00097 0.00094 0.00090 0.00087 0.00084 0.00082 0.00079 0.00076 0.00074 3.2 0.00069 0.00066 0.00064 0.00062 0.00060 0.00058 0.00056 0.00054 0.00052 3.3 0.00048 0.00047 0.00045 0.00043 0.00042 0.00040 0.00039 0.00038 0.00036 3.4 0.00034 0.00032 0.00031 0.00030 0.00029 0.00028 0.00027 0.00026 0.00025 3.5 0.00023 0.00022 0.00022 0.00021 0.00020 0.00019 0.00019 0.00018 0.00017 3.6 0.00016 0.00015 0.00015 0.00014 0.00014 0.00013 0.00013 0.00012 0.00012 3.7 0.00011 0.00010 0.00010 0.00010 0.00009 0.00009 0.00008 0.00008 0.00008 3.8 0.00007 0.00007 0.00007 0.00006 0.00006 0.00006 0.00006 0.00005 0.00005 3.9 0.00005 0.00005 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00003 4.0 0.00003 0.00003 0.00003 0.00003 0.00003 0.00003 0.00002 0.00002 0.00002 Statistical Table C: Student’s t-distribution This table gives critical values of the t-distribution. The two-sided or two-tailed critical value tα(2),df defines the combined area a under both tails of the t-distribution having df degrees of freedom. To find a critical value in the table, select the desired value of a(2) along the top and the number of degrees of freedom in the far left column. For example, if df = 5 and α(2) = 0.05, the critical value is 2.57; that is, the probability of a t-value greater than or equal to 2.57, or less than or equal to —2.57, is 0.05. The left panel of the following figure illustrates the critical value for this example. Critical values of the t-distribution having five degrees of freedom when α = 0.05. The left panel shows the two-tailed case. The area in red shows the combined tail probabilities of 0.05, with 2.5% of the probability in each tail. The boundaries of the areas in red are —2.57 and 2.57. The right panel shows the one-tailed case. The area in red indicates the right tail of the distribution corresponding to 5% of the total probability. The boundary of the red section is 2.02. The one-tailed critical value tα(1),df defines the area α under the right tail of the t-distribution having df degrees of freedom. To find a critical value in the table, select the desired value of α(1) along the top and the number of degrees of freedom in the far left column. For example, if df = 5 and α(1) = 0.05, the critical value is 2.02; that is, the probability of a value greater than or equal to 2.02 is 0.05. The right panel of the preceding figure illustrates the critical value for this example. To calculate an exact P-value in R for a two-sided test when t = 2.57 and df = 5, use the command 2 * (1 — pt(q = abs(2.57), df = 5)) Replace 2.57 and 5 with your observed values. The “abs” refers to absolute value and is included in the case of a negative t-value. In Excel, for the same calculation, use = 2 * (1 — T.DIST(ABS(2.57), 5, TRUE)) df 1 α(2): 0.2 0.10 0.05 0.02 0.01 0.001 0.0001 α(1): 0.1 0.05 0.025 0.01 0.005 0.0005 0.00005 3.08 6.31 12.71 31.82 63.66 636.62 6366.20 2 1.89 2.92 4.30 6.96 9.92 31.60 99.99 3 1.64 2.35 3.18 4.54 5.84 12.92 28.00 4 1.53 2.13 2.78 3.75 4.60 8.61 15.54 5 1.48 2.02 2.57 3.36 4.03 6.87 11.18 6 1.44 1.94 2.45 3.14 3.71 5.96 9.08 7 1.41 1.89 2.36 3.00 3.50 5.41 7.88 8 1.40 1.86 2.31 2.90 3.36 5.04 7.12 9 1.38 1.83 2.26 2.82 3.25 4.78 6.59 10 1.37 1.81 2.23 2.76 3.17 4.59 6.21 11 1.36 1.80 2.20 2.72 3.11 4.44 5.92 12 1.36 1.78 2.18 2.68 3.05 4.32 5.69 13 1.35 1.77 2.16 2.65 3.01 4.22 5.51 14 1.35 1.76 2.14 2.62 2.98 4.14 5.36 15 1.34 1.75 2.13 2.60 2.95 4.07 5.24 16 1.34 1.75 2.12 2.58 2.92 4.01 5.13 17 1.33 1.74 2.11 2.57 2.90 3.97 5.04 18 1.33 1.73 2.10 2.55 2.88 3.92 4.97 19 1.33 1.73 2.09 2.54 2.86 3.88 4.90 20 1.33 1.72 2.09 2.53 2.85 3.85 4.84 21 1.32 1.72 2.08 2.52 2.83 3.82 4.78 22 1.32 1.72 2.07 2.51 2.82 3.79 4.74 23 1.32 1.71 2.07 2.50 2.81 3.77 4.69 24 1.32 1.71 2.06 2.49 2.80 3.75 4.65 25 1.32 1.71 2.06 2.49 2.79 3.73 4.62 26 1.31 1.71 2.06 2.48 2.78 3.71 4.59 27 1.31 1.70 2.05 2.47 2.77 3.69 4.56 28 1.31 1.70 2.05 2.47 2.76 3.67 4.53 29 1.31 1.70 2.05 2.46 2.76 3.66 4.51 30 1.31 1.70 2.04 2.46 2.75 3.65 4.48 31 1.31 1.70 2.04 2.45 2.74 3.63 4.46 32 1.31 1.69 2.04 2.45 2.74 3.62 4.44 33 1.31 1.69 2.03 2.44 2.73 3.61 4.42 34 1.31 1.69 2.03 2.44 2.73 3.60 4.41 35 1.31 1.69 2.03 2.44 2.72 3.59 4.39 36 1.31 1.69 2.03 2.43 2.72 3.58 4.37 37 1.30 1.69 2.03 2.43 2.72 3.57 4.36 38 1.30 1.69 2.02 2.43 2.71 3.57 4.35 39 1.30 1.68 2.02 2.43 2.71 3.56 4.33 40 1.30 1.68 2.02 2.42 2.70 3.55 4.32 41 1.30 1.68 2.02 2.42 2.70 3.54 4.31 42 1.30 1.68 2.02 2.42 2.70 3.54 4.30 43 1.30 1.68 2.02 2.42 2.70 3.53 4.29 44 1.30 1.68 2.02 2.41 2.69 3.53 4.28 45 1.30 1.68 2.01 2.41 2.69 3.52 4.27 46 1.30 1.68 2.01 2.41 2.69 3.51 4.26 47 1.30 1.68 2.01 2.41 2.68 3.51 4.25 48 1.30 1.68 2.01 2.41 2.68 3.51 4.24 49 1.30 1.68 2.01 2.40 2.68 3.50 4.24 50 1.30 1.68 2.01 2.40 2.68 3.50 4.23 51 1.30 1.68 2.01 2.40 2.68 3.49 4.22 52 1.30 1.67 2.01 2.40 2.67 3.49 4.21 53 1.30 1.67 2.01 2.40 2.67 3.48 4.21 54 1.30 1.67 2.00 2.40 2.67 3.48 4.20 55 1.30 1.67 2.00 2.40 2.67 3.48 4.20 56 1.30 1.67 2.00 2.39 2.67 3.47 4.19 57 1.30 1.67 2.00 2.39 2.66 3.47 4.18 58 1.30 1.67 2.00 2.39 2.66 3.47 4.18 59 1.30 1.67 2.00 2.39 2.66 3.46 4.17 60 1.30 1.67 2.00 2.39 2.66 3.46 4.17 61 1.30 1.67 2.00 2.39 2.66 3.46 4.16 62 1.30 1.67 2.00 2.39 2.66 3.45 4.16 63 1.30 1.67 2.00 2.39 2.66 3.45 4.15 64 1.29 1.67 2.00 2.39 2.65 3.45 4.15 65 1.29 1.67 2.00 2.39 2.65 3.45 4.15 66 1.29 1.67 2.00 2.38 2.65 3.44 4.14 67 1.29 1.67 2.00 2.38 2.65 3.44 4.14 68 1.29 1.67 2.00 2.38 2.65 3.44 4.13 69 1.29 1.67 1.99 2.38 2.65 3.44 4.13 70 1.29 1.67 1.99 2.38 2.65 3.44 4.13 71 1.29 1.67 1.99 2.38 2.65 3.43 4.12 72 1.29 1.67 1.99 2.38 2.65 3.43 4.12 73 1.29 1.67 1.99 2.38 2.64 3.43 4.12 74 1.29 1.67 1.99 2.38 2.64 3.43 4.11 75 1.29 1.67 1.99 2.38 2.64 3.43 4.11 76 1.29 1.67 1.99 2.38 2.64 3.42 4.11 77 1.29 1.66 1.99 2.38 2.64 3.42 4.10 78 1.29 1.66 1.99 2.38 2.64 3.42 4.10 79 1.29 1.66 1.99 2.37 2.64 3.42 4.10 80 1.29 1.66 1.99 2.37 2.64 3.42 4.10 81 1.29 1.66 1.99 2.37 2.64 3.41 4.09 82 1.29 1.66 1.99 2.37 2.64 3.41 4.09 83 1.29 1.66 1.99 2.37 2.64 3.41 4.09 84 1.29 1.66 1.99 2.37 2.64 3.41 4.09 85 1.29 1.66 1.99 2.37 2.63 3.41 4.08 86 1.29 1.66 1.99 2.37 2.63 3.41 4.08 87 1.29 1.66 1.99 2.37 2.63 3.41 4.08 88 1.29 1.66 1.99 2.37 2.63 3.40 4.08 89 1.29 1.66 1.99 2.37 2.63 3.40 4.07 90 1.29 1.66 1.99 2.37 2.63 3.40 4.07 100 1.29 1.66 1.98 2.36 2.63 3.39 4.05 120 1.29 1.66 1.98 2.36 2.62 3.37 4.03 140 1.29 1.66 1.98 2.35 2.61 3.36 4.01 160 1.29 1.65 1.97 2.35 2.61 3.35 3.99 180 1.29 1.65 1.97 2.35 2.60 3.35 3.98 200 1.29 1.65 1.97 2.35 2.60 3.34 3.97 400 1.28 1.65 1.97 2.34 2.59 3.32 3.93 1000 1.28 1.65 1.96 2.33 2.58 3.30 3.91 Statistical Table D: The F-distribution These tables give critical values of the F-distribution for α(1) = 0.05, α(1) = 0.025, and α(1) = 0.01. The critical value Fα(1),df1,df2 defines the area a under the right tail of the F-distribution having df1 and df2 degrees of freedom. To find a critical value in the table, select the numerator degrees of freedom (df1) listed across the top row and the denominator degrees of freedom (df2) given in the first column. For example, if df1 = 9, df2 = 12, and α = 0.05, then the critical value is 2.80; that is, the probability of a value greater than or equal to 2.80 is 0.05. The critical value for the F-distribution having nine and 12 degrees of freedom when α = 0.05. The area in red indicates the tail probability corresponding to 0.05, or 5%. The boundary of the red section is 2.80. Exact probabilities under the right tail of the F-distribution can be calculated in the R statistical package. For example, to find the the P-value corresponding to F = 2.80 with degrees of freedom 9 (for the numerator) and 12 (for the denominator), use the command 1. — pf(q = 2.80, df1 = 9, df2 = 12) To use this command, replace 2.80, 9, and 12 with your own values. In Excel, enter the following in a cell for the same calculation: = 1 — F.DIST(2.80, 9, 12, TRUE) Critical value of F α(1) = 0.05, α(2) = 0.10 Denominator df Numerator df 1 2 3 4 5 6 7 8 9 10 1 161.45 199.50 215.71 224.58 230.16 233.99 236.77 238.88 240.54 241.88 2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 50 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.07 2.03 60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 70 3.98 3.13 2.74 2.50 2.35 2.23 2.14 2.07 2.02 1.97 80 3.96 3.11 2.72 2.49 2.33 2.21 2.13 2.06 2.00 1.95 90 3.95 3.10 2.71 2.47 2.32 2.20 2.11 2.04 1.99 1.94 100 3.94 3.09 2.70 2.46 2.31 2.19 2.10 2.03 1.97 1.93 200 3.89 3.04 2.65 2.42 2.26 2.14 2.06 1.98 1.93 1.88 400 3.86 3.02 2.63 2.39 2.24 2.12 2.03 1.96 1.90 1.85 Critical value of F, α(1) = 0.05, α(2) = 0.10 Denominator df Numerator df 12 15 20 30 40 60 100 200 400 1000 1 243.91 245.95 248.01 250.10 251.14 252.20 253.04 253.68 254.00 254.19 2 19.41 19.43 19.45 19.46 19.47 19.48 19.49 19.49 19.49 19.49 3 8.74 8.70 8.66 8.62 8.59 8.57 8.55 8.54 8.53 8.53 4 5.91 5.86 5.80 5.75 5.72 5.69 5.66 5.65 5.64 5.63 5 4.68 4.62 4.56 4.50 4.46 4.43 4.41 4.39 4.38 4.37 6 4.00 3.94 3.87 3.81 3.77 3.74 3.71 3.69 3.68 3.67 7 3.57 3.51 3.44 3.38 3.34 3.30 3.27 3.25 3.24 3.23 8 3.28 3.22 3.15 3.08 3.04 3.01 2.97 2.95 2.94 2.93 9 3.07 3.01 2.94 2.86 2.83 2.79 2.76 2.73 2.72 2.71 10 2.91 2.85 2.77 2.70 2.66 2.62 2.59 2.56 2.55 2.54 11 2.79 2.72 2.65 2.57 2.53 2.49 2.46 2.43 2.42 2.41 12 2.69 2.62 2.54 2.47 2.43 2.38 2.35 2.32 2.31 2.30 13 2.60 2.53 2.46 2.38 2.34 2.30 2.26 2.23 2.22 2.21 14 2.53 2.46 2.39 2.31 2.27 2.22 2.19 2.16 2.15 2.14 15 2.48 2.40 2.33 2.25 2.20 2.16 2.12 2.10 2.08 2.07 16 2.42 2.35 2.28 2.19 2.15 2.11 2.07 2.04 2.02 2.02 17 2.38 2.31 2.23 2.15 2.10 2.06 2.02 1.99 1.98 1.97 18 2.34 2.27 2.19 2.11 2.06 2.02 1.98 1.95 1.93 1.92 19 2.31 2.23 2.16 2.07 2.03 1.98 1.94 1.91 1.89 1.88 20 2.28 2.20 2.12 2.04 1.99 1.95 1.91 1.88 1.86 1.85 21 2.25 2.18 2.10 2.01 1.96 1.92 1.88 1.84 1.83 1.82 22 2.23 2.15 2.07 1.98 1.94 1.89 1.85 1.82 1.80 1.79 23 2.20 2.13 2.05 1.96 1.91 1.86 1.82 1.79 1.77 1.76 24 2.18 2.11 2.03 1.94 1.89 1.84 1.80 1.77 1.75 1.74 25 2.16 2.09 2.01 1.92 1.87 1.82 1.78 1.75 1.73 1.72 26 2.15 2.07 1.99 1.90 1.85 1.80 1.76 1.73 1.71 1.70 27 2.13 2.06 1.97 1.88 1.84 1.79 1.74 1.71 1.69 1.68 28 2.12 2.04 1.96 1.87 1.82 1.77 1.73 1.69 1.67 1.66 29 2.10 2.03 1.94 1.85 1.81 1.75 1.71 1.67 1.66 1.65 30 2.09 2.01 1.93 1.84 1.79 1.74 1.70 1.66 1.64 1.63 40 2.00 1.92 1.84 1.74 1.69 1.64 1.59 1.55 1.53 1.52 50 1.95 1.87 1.78 1.69 1.63 1.58 1.52 1.48 1.46 1.45 60 1.92 1.84 1.75 1.65 1.59 1.53 1.48 1.44 1.41 1.40 70 1.89 1.81 1.72 1.62 1.57 1.50 1.45 1.40 1.38 1.36 80 1.88 1.79 1.70 1.60 1.54 1.48 1.43 1.38 1.35 1.34 90 1.86 1.78 1.69 1.59 1.53 1.46 1.41 1.36 1.33 1.31 100 1.85 1.77 1.68 1.57 1.52 1.45 1.39 1.34 1.31 1.30 200 1.80 1.72 1.62 1.52 1.46 1.39 1.32 1.26 1.23 1.21 400 1.78 1.69 1.60 1.49 1.42 1.35 1.28 1.22 1.18 1.15 Critical value of F, α(1) = 0.025, α(2) = 0.05 Denominator df Numerator df 1 2 3 4 5 6 7 8 9 10 1 647.79 799.5 864.16 899.58 921.85 937.11 948.22 956.66 963.28 968.63 2 38.51 39.00 39.17 39.25 39.30 39.33 39.36 39.37 39.39 39.40 3 17.44 16.04 15.44 15.10 14.88 14.73 14.62 14.54 14.47 14.42 4 12.22 10.65 9.98 9.60 9.36 9.20 9.07 8.98 8.90 8.84 5 10.01 8.43 7.76 7.39 7.15 6.98 6.85 6.76 6.68 6.62 6 8.81 7.26 6.60 6.23 5.99 5.82 5.70 5.60 5.52 5.46 7 8.07 6.54 5.89 5.52 5.29 5.12 4.99 4.90 4.82 4.76 8 7.57 6.06 5.42 5.05 4.82 4.65 4.53 4.43 4.36 4.30 9 7.21 5.71 5.08 4.72 4.48 4.32 4.20 4.10 4.03 3.96 10 6.94 5.46 4.83 4.47 4.24 4.07 3.95 3.85 3.78 3.72 11 6.72 5.26 4.63 4.28 4.04 3.88 3.76 3.66 3.59 3.53 12 6.55 5.10 4.47 4.12 3.89 3.73 3.61 3.51 3.44 3.37 13 6.41 4.97 4.35 4.00 3.77 3.60 3.48 3.39 3.31 3.25 14 6.30 4.86 4.24 3.89 3.66 3.50 3.38 3.29 3.21 3.15 15 6.20 4.77 4.15 3.80 3.58 3.41 3.29 3.20 3.12 3.06 16 6.12 4.69 4.08 3.73 3.50 3.34 3.22 3.12 3.05 2.99 17 6.04 4.62 4.01 3.66 3.44 3.28 3.16 3.06 2.98 2.92 18 5.98 4.56 3.95 3.61 3.38 3.22 3.10 3.01 2.93 2.87 19 5.92 4.51 3.90 3.56 3.33 3.17 3.05 2.96 2.88 2.82 20 5.87 4.46 3.86 3.51 3.29 3.13 3.01 2.91 2.84 2.77 21 5.83 4.42 3.82 3.48 3.25 3.09 2.97 2.87 2.80 2.73 22 5.79 4.38 3.78 3.44 3.22 3.05 2.93 2.84 2.76 2.70 23 5.75 4.35 3.75 3.41 3.18 3.02 2.90 2.81 2.73 2.67 24 5.72 4.32 3.72 3.38 3.15 2.99 2.87 2.78 2.70 2.64 25 5.69 4.29 3.69 3.35 3.13 2.97 2.85 2.75 2.68 2.61 26 5.66 4.27 3.67 3.33 3.10 2.94 2.82 2.73 2.65 2.59 27 5.63 4.24 3.65 3.31 3.08 2.92 2.80 2.71 2.63 2.57 28 5.61 4.22 3.63 3.29 3.06 2.90 2.78 2.69 2.61 2.55 29 5.59 4.20 3.61 3.27 3.04 2.88 2.76 2.67 2.59 2.53 30 5.57 4.18 3.59 3.25 3.03 2.87 2.75 2.65 2.57 2.51 40 5.42 4.05 3.46 3.13 2.90 2.74 2.62 2.53 2.45 2.39 50 5.34 3.97 3.39 3.05 2.83 2.67 2.55 2.46 2.38 2.32 60 5.29 3.93 3.34 3.01 2.79 2.63 2.51 2.41 2.33 2.27 70 5.25 3.89 3.31 2.97 2.75 2.59 2.47 2.38 2.30 2.24 80 5.22 3.86 3.28 2.95 2.73 2.57 2.45 2.35 2.28 2.21 90 5.20 3.84 3.26 2.93 2.71 2.55 2.43 2.34 2.26 2.19 100 5.18 3.83 3.25 2.92 2.70 2.54 2.42 2.32 2.24 2.18 200 5.10 3.76 3.18 2.85 2.63 2.47 2.35 2.26 2.18 2.11 400 5.06 3.72 3.15 2.82 2.60 2.44 2.32 2.22 2.15 2.08 Critical value of F, α(1) = 0.025, α(2) = 0.05 Denominator df Numerator df 12 15 20 30 40 60 100 200 400 1 976.71 984.87 993.1 1001.41 1005.60 1009.80 1013.17 1015.71 1016.98 2 39.41 39.43 39.45 39.46 39.47 39.48 39.49 39.49 39.5 3 14.34 14.25 14.17 14.08 14.04 13.99 13.96 13.93 13.92 4 8.75 8.66 8.56 8.46 8.41 8.36 8.32 8.29 8.27 5 6.52 6.43 6.33 6.23 6.18 6.12 6.08 6.05 6.03 6 5.37 5.27 5.17 5.07 5.01 4.96 4.92 4.88 4.87 7 4.67 4.57 4.47 4.36 4.31 4.25 4.21 4.18 4.16 8 4.20 4.10 4.00 3.89 3.84 3.78 3.74 3.70 3.69 9 3.87 3.77 3.67 3.56 3.51 3.45 3.40 3.37 3.35 10 3.62 3.52 3.42 3.31 3.26 3.20 3.15 3.12 3.10 11 3.43 3.33 3.23 3.12 3.06 3.00 2.96 2.92 2.90 12 3.28 3.18 3.07 2.96 2.91 2.85 2.80 2.76 2.74 13 3.15 3.05 2.95 2.84 2.78 2.72 2.67 2.63 2.61 14 3.05 2.95 2.84 2.73 2.67 2.61 2.56 2.53 2.51 15 2.96 2.86 2.76 2.64 2.59 2.52 2.47 2.44 2.42 16 2.89 2.79 2.68 2.57 2.51 2.45 2.40 2.36 2.34 17 2.82 2.72 2.62 2.50 2.44 2.38 2.33 2.29 2.27 18 2.77 2.67 2.56 2.44 2.38 2.32 2.27 2.23 2.21 19 2.72 2.62 2.51 2.39 2.33 2.27 2.22 2.18 2.15 20 2.68 2.57 2.46 2.35 2.29 2.22 2.17 2.13 2.11 21 2.64 2.53 2.42 2.31 2.25 2.18 2.13 2.09 2.06 22 2.60 2.50 2.39 2.27 2.21 2.14 2.09 2.05 2.03 23 2.57 2.47 2.36 2.24 2.18 2.11 2.06 2.01 1.99 24 2.54 2.44 2.33 2.21 2.15 2.08 2.02 1.98 1.96 25 2.51 2.41 2.30 2.18 2.12 2.05 2.00 1.95 1.93 26 2.49 2.39 2.28 2.16 2.09 2.03 1.97 1.92 1.90 27 2.47 2.36 2.25 2.13 2.07 2.00 1.94 1.90 1.88 28 2.45 2.34 2.23 2.11 2.05 1.98 1.92 1.88 1.85 29 2.43 2.32 2.21 2.09 2.03 1.96 1.90 1.86 1.83 30 2.41 2.31 2.20 2.07 2.01 1.94 1.88 1.84 1.81 40 2.29 2.18 2.07 1.94 1.88 1.80 1.74 1.69 1.66 50 2.22 2.11 1.99 1.87 1.80 1.72 1.66 1.60 1.57 60 2.17 2.06 1.94 1.82 1.74 1.67 1.60 1.54 1.51 70 2.14 2.03 1.91 1.78 1.71 1.63 1.56 1.50 1.47 80 2.11 2.00 1.88 1.75 1.68 1.60 1.53 1.47 1.43 90 2.09 1.98 1.86 1.73 1.66 1.58 1.50 1.44 1.41 100 2.08 1.97 1.85 1.71 1.64 1.56 1.48 1.42 1.39 200 2.01 1.90 1.78 1.64 1.56 1.47 1.39 1.32 1.28 400 1.98 1.87 1.74 1.60 1.52 1.43 1.35 1.27 1.22 Critical value of F, α(1) = 0.01, α(2) = 0.02 Denominator df Numerator df 1 2 3 4 5 6 7 8 9 10 1 4052 4999 5403 5624 5763 5859 5928 5981 6022 6055 2 98.50 99.00 99.17 99.25 99.30 99.33 99.36 99.37 99.39 99.40 3 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.35 27.23 4 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.66 14.55 5 16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.16 10.05 6 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7 12.25 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 8 11.26 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 9 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 10 10.04 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 14 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 3.94 15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 17 8.40 6.11 5.18 4.67 4.34 4.10 3.93 3.79 3.68 3.59 18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 19 8.18 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43 20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 23 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 25 7.77 5.57 4.68 4.18 3.85 3.63 3.46 3.32 3.22 3.13 26 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.18 3.09 27 7.68 5.49 4.60 4.11 3.78 3.56 3.39 3.26 3.15 3.06 28 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 3.03 29 7.60 5.42 4.54 4.04 3.73 3.50 3.33 3.20 3.09 3.00 30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 50 7.17 5.06 4.20 3.72 3.41 3.19 3.02 2.89 2.78 2.70 60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 70 7.01 4.92 4.07 3.60 3.29 3.07 2.91 2.78 2.67 2.59 80 6.96 4.88 4.04 3.56 3.26 3.04 2.87 2.74 2.64 2.55 90 6.93 4.85 4.01 3.53 3.23 3.01 2.84 2.72 2.61 2.52 100 6.90 4.82 3.98 3.51 3.21 2.99 2.82 2.69 2.59 2.50 200 6.76 4.71 3.88 3.41 3.11 2.89 2.73 2.60 2.50 2.41 400 6.70 4.66 3.83 3.37 3.06 2.85 2.68 2.56 2.45 2.37 Critical value of F, α(1) = 0.01, α(2) = 0.02 Denominator df 12 Numerator df 15 20 30 40 60 100 200 400 1000 1 6106 6157 6208 6260 6286 6313 6334 6350 6357 6362 2 99.42 99.43 99.45 99.47 99.47 99.48 99.49 99.49 99.50 99.50 3 27.05 26.87 26.69 26.50 26.41 26.32 26.24 26.18 26.15 26.14 4 14.37 14.20 14.02 13.84 13.75 13.65 13.58 13.52 13.49 13.47 5 9.89 9.72 9.55 9.38 9.29 9.20 9.13 9.08 9.05 9.03 6 7.72 7.56 7.40 7.23 7.14 7.06 6.99 6.93 6.91 6.89 7 6.47 6.31 6.16 5.99 5.91 5.82 5.75 5.70 5.68 5.66 8 5.67 5.52 5.36 5.20 5.12 5.03 4.96 4.91 4.89 4.87 9 5.11 4.96 4.81 4.65 4.57 4.48 4.41 4.36 4.34 4.32 10 4.71 4.56 4.41 4.25 4.17 4.08 4.01 3.96 3.94 3.92 11 4.40 4.25 4.10 3.94 3.86 3.78 3.71 3.66 3.63 3.61 12 4.16 4.01 3.86 3.70 3.62 3.54 3.47 3.41 3.39 3.37 13 3.96 3.82 3.66 3.51 3.43 3.34 3.27 3.22 3.19 3.18 14 3.80 3.66 3.51 3.35 3.27 3.18 3.11 3.06 3.03 3.02 15 3.67 3.52 3.37 3.21 3.13 3.05 2.98 2.92 2.90 2.88 16 3.55 3.41 3.26 3.10 3.02 2.93 2.86 2.81 2.78 2.76 17 3.46 3.31 3.16 3.00 2.92 2.83 2.76 2.71 2.68 2.66 18 3.37 3.23 3.08 2.92 2.84 2.75 2.68 2.62 2.59 2.58 19 3.30 3.15 3.00 2.84 2.76 2.67 2.60 2.55 2.52 2.50 20 3.23 3.09 2.94 2.78 2.69 2.61 2.54 2.48 2.45 2.43 21 3.17 3.03 2.88 2.72 2.64 2.55 2.48 2.42 2.39 2.37 22 3.12 2.98 2.83 2.67 2.58 2.50 2.42 2.36 2.34 2.32 23 3.07 2.93 2.78 2.62 2.54 2.45 2.37 2.32 2.29 2.27 24 3.03 2.89 2.74 2.58 2.49 2.40 2.33 2.27 2.24 2.22 25 2.99 2.85 2.70 2.54 2.45 2.36 2.29 2.23 2.20 2.18 26 2.96 2.81 2.66 2.50 2.42 2.33 2.25 2.19 2.16 2.14 27 2.93 2.78 2.63 2.47 2.38 2.29 2.22 2.16 2.13 2.11 28 2.90 2.75 2.60 2.44 2.35 2.26 2.19 2.13 2.10 2.08 29 2.87 2.73 2.57 2.41 2.33 2.23 2.16 2.10 2.07 2.05 30 2.84 2.70 2.55 2.39 2.30 2.21 2.13 2.07 2.04 2.02 40 2.66 2.52 2.37 2.20 2.11 2.02 1.94 1.87 1.84 1.82 50 2.56 2.42 2.27 2.10 2.01 1.91 1.82 1.76 1.72 1.70 60 2.50 2.35 2.20 2.03 1.94 1.84 1.75 1.68 1.64 1.62 70 2.45 2.31 2.15 1.98 1.89 1.78 1.70 1.62 1.58 1.56 80 2.42 2.27 2.12 1.94 1.85 1.75 1.65 1.58 1.54 1.51 90 2.39 2.24 2.09 1.92 1.82 1.72 1.62 1.55 1.50 1.48 100 2.37 2.22 2.07 1.89 1.80 1.69 1.60 1.52 1.47 1.45 200 2.27 2.13 1.97 1.79 1.69 1.58 1.48 1.39 1.34 1.30 400 2.23 2.08 1.92 1.75 1.64 1.53 1.42 1.32 1.26 1.2 Statistical Table E: Mann-Whitney U-distribution These tables give the two-tailed critical values of the U-distribution for α = 0.05 and α = 0.01. U is the larger of U1 and U2. The critical value Uα(2)n1,n2 defines the area α under the right tail of the U-distribution corresponding to sample sizes n1 and n2. To find a critical value in the table, select n1 from the top row and n2 from the far left column. A “—” means that it is not possible to reject a null hypothesis with that α and those sample sizes. For larger sample sizes, use the Z-approximation from Chapter 13. For example, if n1 = 5, n2 = 7, and α = 0.05, then the critical value is 30; that is, the probability of a value greater than or equal to 30 is 0.05 or less (it may be less than 0.05 because the U-distribution is discrete, and no critical value may correspond exactly to 0.05). The following figure illustrates the critical value for this example. The two-tailed critical value of the Mann-Whitney U-distribution when n1 = 5, n2 = 7, and α = 0.05. The area in red indicates the tail probability corresponding to 0.05, or 5%. The boundary of the red section is 30. Thus, the probability of a value greater than or equal to 30 is 0.05 or less. (The distribution gives the two-tailed value, because it measures the probability that either U1 or U2 is greater than U0.05(2), n1, n2.) α(2) = 0.05 n1 n2 3 4 5 6 7 8 9 10 11 12 13 14 15 3 — — 15 17 20 22 25 27 30 32 35 37 40 4 — 16 19 22 25 28 32 35 38 41 44 47 50 5 15 19 23 27 30 34 38 42 46 49 53 57 61 6 17 22 27 31 36 40 44 49 53 58 62 67 71 7 20 25 30 36 41 46 51 56 61 66 71 76 81 8 22 28 34 40 46 51 57 63 69 74 80 86 91 9 25 32 38 44 51 57 64 70 76 82 89 95 101 10 27 35 42 49 56 63 70 77 84 91 97 104 111 11 30 38 46 53 61 69 76 84 91 99 106 114 121 12 32 41 49 58 66 74 82 91 99 107 115 123 131 13 35 44 53 62 71 80 89 97 106 115 124 132 141 14 37 47 57 67 76 86 95 104 114 123 132 141 151 15 40 50 61 71 81 91 101 111 121 131 141 151 161 α(2) = 0.01 n1 n2 3 4 5 6 7 8 3 9 10 11 12 13 14 15 27 30 33 35 38 41 43 4 — — — 24 28 31 35 38 42 45 49 52 55 5 — — 25 29 34 38 42 46 50 54 58 63 67 6 — 24 29 34 39 44 49 54 59 63 68 73 78 7 — 28 34 39 45 50 56 61 67 72 78 83 89 8 — 31 38 44 50 57 63 69 75 81 87 94 100 9 27 35 42 49 56 63 70 77 83 90 97 104 111 10 30 38 46 54 61 69 77 84 92 99 106 114 121 11 33 42 50 59 67 75 83 92 100 108 116 124 132 12 35 45 54 63 72 81 90 99 108 117 125 134 143 13 38 49 58 68 78 87 97 106 116 125 135 144 153 14 41 52 63 73 83 94 104 114 124 134 144 154 164 15 43 53 67 78 89 100 111 121 132 143 153 164 174 Statistical Table F: Tukey-Kramer q-distribution This table gives the critical values of the Tukey-Kramer q-distribution1 for α = 0.05. The critical value q0.05,k,N-k defines the area 0.05 under the right tail of the q-distribution having k groups and N - k degrees of freedom, where N is the total sample size of all groups combined. The probability of a q-value greater than or equal to q0.05,k,N-k is 0.05, or 5%. Number of groups (k) dferror 2 3 4 5 6 7 8 9 10 11 12 13 14 15 10 2.23 2.74 3.06 3.29 3.47 3.62 3.75 3.86 3.96 4.05 4.12 4.20 4.26 4.32 11 2.20 2.70 3.01 3.23 3.41 3.56 3.68 3.79 3.88 3.96 4.04 4.11 4.17 4.23 12 2.18 2.67 2.97 3.19 3.36 3.50 3.62 3.72 3.81 3.90 3.97 4.04 4.10 4.16 13 2.16 2.64 2.94 3.15 3.32 3.45 3.57 3.67 3.76 3.84 3.91 3.98 4.04 4.09 14 2.14 2.62 2.91 3.12 3.28 3.41 3.53 3.63 3.72 3.79 3.86 3.93 3.99 4.04 15 2.13 2.60 2.88 3.09 3.25 3.38 3.49 3.59 3.68 3.75 3.82 3.88 3.94 3.99 16 2.12 2.58 2.86 3.06 3.22 3.35 3.46 3.56 3.64 3.72 3.78 3.85 3.90 3.95 17 2.11 2.57 2.84 3.04 3.20 3.33 3.44 3.53 3.61 3.69 3.75 3.81 3.87 3.92 18 2.10 2.55 2.83 3.02 3.18 3.30 3.41 3.50 3.59 3.66 3.72 3.78 3.84 3.89 19 2.09 2.54 2.81 3.01 3.16 3.16 3.39 3.48 3.56 3.63 3.70 3.76 3.81 3.86 20 2.09 2.53 2.80 2.99 3.14 3.27 3.37 3.46 3.54 3.61 3.68 3.73 3.79 3.84 21 2.08 2.52 2.79 2.98 3.13 3.25 3.35 3.44 3.52 3.59 3.66 3.71 3.77 3.82 22 2.07 2.51 2.78 2.97 3.12 3.24 3.34 3.43 3.51 3.57 3.64 3.69 3.75 3.80 23 2.07 2.50 2.77 2.96 3.10 3.22 3.32 3.41 3.49 3.56 3.62 3.68 3.73 3.78 24 2.06 2.50 2.76 2.95 3.09 3.21 3.31 3.40 3.48 3.54 3.61 3.66 3.71 3.76 25 2.06 2.49 2.75 2.94 3.08 3.20 3.30 3.39 3.46 3.53 3.59 3.65 3.70 3.75 26 2.06 2.48 2.74 2.93 3.07 3.19 3.29 3.38 3.45 3.52 3.58 3.63 3.68 3.73 27 2.05 2.48 2.74 2.92 3.06 3.18 3.28 3.36 3.44 3.51 3.57 3.62 3.67 3.72 28 2.05 2.47 2.73 2.91 3.06 3.17 3.27 3.35 3.43 3.50 3.56 3.61 3.66 3.71 29 2.05 2.47 2.72 2.91 3.05 3.16 3.26 3.35 3.42 3.49 3.55 3.60 3.65 3.70 30 2.04 2.46 2.72 2.90 3.04 3.16 3.25 3.34 3.41 3.48 3.54 3.59 3.64 3.68 31 2.04 2.46 2.71 2.89 3.04 3.15 3.25 3.33 3.40 3.47 3.53 3.58 3.63 3.68 32 2.04 2.46 2.71 2.89 3.03 3.14 3.24 3.32 3.40 3.46 3.52 3.57 3.62 3.67 33 2.03 2.45 2.70 2.88 3.02 3.14 3.23 3.32 3.39 3.45 3.51 3.56 3.61 3.66 34 2.03 2.45 2.70 2.88 3.02 3.13 3.23 3.31 3.38 3.45 3.50 3.56 3.60 3.65 35 2.03 2.45 2.70 2.88 3.01 3.13 3.22 3.30 3.37 3.44 3.50 3.55 3.60 3.64 36 2.03 2.44 2.69 2.87 3.01 3.12 3.22 3.30 3.37 3.43 3.49 3.54 3.59 3.64 37 2.03 2.44 2.69 2.87 3.00 3.12 3.21 3.29 3.36 3.43 3.48 3.54 3.58 3.63 38 2.02 2.44 2.69 2.86 3.00 3.11 3.21 3.29 3.36 3.42 3.48 3.53 3.58 3.62 39 2.02 2.44 2.68 2.86 3.00 3.11 3.20 3.28 3.35 3.42 3.47 3.52 3.57 3.62 40 2.02 2.43 2.68 2.86 2.99 3.10 3.20 3.28 3.35 3.41 3.47 3.52 3.57 3.61 41 2.02 2.43 2.68 2.85 2.99 3.10 3.19 3.27 3.34 3.41 3.46 3.51 3.56 3.60 42 2.02 2.43 2.67 2.85 2.99 3.10 3.19 3.27 3.34 3.40 3.46 3.51 3.56 3.60 43 2.02 2.43 2.67 2.85 2.98 3.09 3.18 3.26 3.33 3.40 3.45 3.50 3.55 3.59 44 2.02 2.43 2.67 2.84 2.98 3.09 3.18 3.26 3.33 3.39 3.45 3.50 3.55 3.59 45 2.01 2.42 2.67 2.84 2.98 3.09 3.18 3.26 3.33 3.39 3.44 3.50 3.54 3.59 46 2.01 2.42 2.67 2.84 2.97 3.08 3.17 3.25 3.32 3.39 3.44 3.49 3.54 3.58 47 2.01 2.42 2.66 2.84 2.97 3.08 3.17 3.25 3.32 3.38 3.44 3.49 3.53 3.58 48 2.01 2.42 2.66 2.83 2.97 3.08 3.17 3.25 3.32 3.38 3.43 3.48 3.53 3.57 49 2.01 2.42 2.66 2.83 2.97 3.07 3.17 3.24 3.31 3.37 3.43 3.48 3.53 3.57 50 2.01 2.42 2.66 2.83 2.96 3.07 3.16 3.24 3.31 3.37 3.43 3.48 3.52 3.57 Statistical Table G: Critical values for the Spearman’s rank correlation This table gives two-tailed critical values of the Spearman rank correlation under the null hypothesis that the population correlation is zero. The critical value rS(α,n) defines the combined area a under both tails of the null distribution, where n is the sample size. The probability of a value greater than or equal to rS(α,n) or less than or equal to - rS(α,n) is a. These critical values were obtained using the SuppDist package (Wheeler 2009) implemented in the statistical computer package R, according to the methods of Kendall and Smith (1939). n α = 0.05 α = 0.01 5 0.900 6 0.943 1.000 7 0.821 0.929 8 0.762 0.881 9 0.700 0.833 10 0.648 0.782 11 0.618 0.755 12 0.587 0.720 13 0.560 0.692 14 0.538 0.670 15 0.521 0.645 16 0.503 0.626 17 0.485 0.610 18 0.472 0.593 19 0.458 0.578 20 0.447 0.564 21 0.434 0.550 22 0.425 0.539 23 0.415 0.528 24 0.406 0.516 25 0.398 0.506 26 0.389 0.497 27 0.382 0.487 28 0.375 0.479 29 0.368 0.471 30 0.362 0.464 31 0.356 0.456 32 0.350 0.449 33 0.345 0.443 34 0.339 0.436 35 0.334 0.430 36 0.329 0.424 37 0.325 0.419 38 0.320 0.413 39 0.316 0.408 40 0.312 0.403 41 0.308 0.398 42 0.305 0.393 43 0.301 0.389 44 0.297 0.385 45 0.294 0.380 46 0.291 0.376 47 0.288 0.372 48 0.285 0.369 49 0.282 0.365 50 0.279 0.361 51 0.276 0.358 52 0.273 0.354 53 0.271 0.351 54 0.268 0.348 55 0.266 0.345 56 0.263 0.342 57 0.261 0.339 58 0.259 0.336 59 0.256 0.333 60 0.254 0.330 61 0.252 0.327 62 0.250 0.325 63 0.248 0.322 64 0.246 0.320 65 0.244 0.317 66 0.242 0.315 67 0.241 0.313 68 0.239 0.310 69 0.237 0.308 70 0.235 0.306 71 0.234 0.304 72 0.232 0.302 73 0.230 0.300 74 0.229 0.298 75 0.227 0.296 76 0.226 0.294 77 0.224 0.292 78 0.223 0.290 79 0.221 0.288 80 0.220 0.286 81 0.219 0.285 82 0.217 0.283 83 0.216 0.281 84 0.215 0.280 85 0.213 0.278 86 0.212 0.276 87 0.211 0.275 88 0.210 0.273 89 0.208 0.272 90 0.207 0.270 91 0.206 0.269 92 0.205 0.267 93 0.204 0.266 94 0.203 0.264 95 0.202 0.263 96 0.201 0.262 97 0.200 0.260 98 0.199 0.259 99 0.198 0.258 100 0.197 0.257 1. Some other statistics books use a slightly different formula to calculate q (multiply our q by 2 to get the alternative version of the test statistic), which also leads to different critical values in statistical tables for the Tukey-Kramer distribution (multiply our critical values here by 2 to convert). Our formula is same as that used in the computer statistics package R. Literature cited ABC News. 2006. Dwarfs better known than US justices: poll. http://www.abc.net.au/news/2006-0815/dwarfs-better-known-than-us-justices-poll/1239126. Accessed February 10, 2013. Abd-El-Al, A. M., A. M. Bayoumy, and E. A. Abou Salem. 1997. A study on Demodex folliculorum in rosacea. Journal of the Egyptian Society of Parasitology 27: 183 –195. Adolph, S. C., and J. S. Hardin. 2007. Estimating phenotypic correlations: correcting for bias due to intraindividual variability. Functional Ecology 21: 178 –184. Agresti, A. 2002. Categorical Data Analysis. Hoboken, NJ: Wiley. Agresti, A., and B. A. Coull. 1998. Approximate is better than “exact” for interval estimation of binomial proportions. American Statistician 52: 119–126. Altman, D. G., and J. M. Bland. 1998. Generalisation and extrapolation. British Medical Journal 317: 409– 410. Alvarez G., F. C. Ceballos, and C. Quinteiro. 2009. The role of inbreeding in the extinction of a european royal dynasty. PLoS ONE 4(4): e5174. American Society of Microbiology. 2005. Women Better at Hand Hygiene Habits, Hands Down. Press release, September 25. http://www.asm.org/index.php/component/content/article/92-news-room/pressreleases/1831-women-better-at-hand-hygiene-habits-hands-down. Accessed December 3, 2013. Anderson, R. N. 2001. Deaths: Leading Causes for 1999. National vital statistics reports, vol. 49, no. 11. Hyattsville, MD: National Center for Health Statistics. Andersson, M., and M. Åhlund 2012. Don’t put all your eggs in one nest: spread them and cut time at risk. American Naturalist 180: 354 –363. Andrade, M. C. B. 1996. Sexual selection for male sacrifice in the Australian redback spider. Science 271: 70 –72. Anstey, M. L., S. M. Rogers, S. R. Ott, M. Burrows, and S. J. Simpson. 2009. Serotonin mediates behavioral gregarization underlying swarm formation in desert locusts. Science 323: 627– 630. Antiplatelet Trialists’ Collaboration. 1994. Collaborative overview of randomized trials of antiplatelet therapy—I: Prevention of death, myocardial infarction, and stroke by prolonged antiplatelet therapy in various categories of patients. British Medical Journal 308: 81–106. Aparicio, S., et al. 2002. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297: 1301–1310. Apiron, D., and D. Zohary. 1961. Chlorophyll lethal in natural populations of the orchard grass (Dactylis glomerata L.). A case of balanced polymorphism in plants. Genetics 46: 393 –399. Arnqvist, G., M. Edvardsson, U. Friberg, and T. Nilsson. 2000. Sexual conflict promotes speciation in insects. Proceedings of the National Academy of Sciences (USA) 97: 10460 –10464. Attwood, A. S., N. E. Scott-Samuel, G. Stothart, and M. R. Munafò. 2012. Glass shape influences consumption rate for alcoholic beverages. PLoS ONE 7: e43007. Avon, N. 2009. Are men or women more likely to be hit by lightning? http://www.popsci.com/scitech/article/2009-09/are-men-or-women-more-likely-be-hit-lightning. Accessed February 24, 2013. Balazs, A. B., et al. 2011. Antibody-based protection against HIV infection by vectored immunoprophylaxis. Nature 481: 81–84. Baldwin, B. G., and M. J. Sanderson. 1998. Age and rate of diversification of the Hawaiian silversword alliance (Compositae). Proceedings of the National Academy of Sciences (USA) 95: 9402–9406. Banks, T., and J. M. Dabbs Jr. 1996. Salivary testosterone and cortisol in a delinquent and violent urban subculture. Journal of Social Psychology 136: 49–56. Barber, V. A., G. P. Juday, and B. P. Finney. 2000. Reduced growth of Alaskan white spruce in the twentieth century from temperature-induced drought stress. Nature 405: 668 – 673. Barker-Plotkin, A., D. Foster, A. Lezberg, and W. Lyford. 2006. Lyford mapped tree plot. http://harvardforest.fas.harvard.edu/data/p03/hf032/HF032-data.html. Accessed May 26, 2006. Barnes, A. I., S. Wigby, J. M. Boone, L. Partridge, and T. Chapman. 2008. Feeding, fecundity and lifespan in female Drosophila melanogaster. Proceedings of the Royal Society of London, Series B: Biological Sciences 275: 1675–1683. Barnes, I., A. Duda, O. G. Pybus, and M. G. Thomas. 2011. Ancient urbanization predicts genetic resistance to tuberculosis. Evolution 65: 842–848. Barss, P. 1984. Injuries due to falling coconuts. Journal of Trauma 24: 990 –991. Bass, M. S., et al. 2010. Global conservation significance of Ecuador’s Yasuní National Park. PLoS ONE 5: e8767. Beall, C. M., et al. 2002. An Ethiopian pattern of human adaptation to high-altitude hypoxia. Proceedings of the National Academy of Sciences (USA) 99: 17215–17218. Beath, D. D. 1996. Pollination of Amorphophallus johnsonii (Araceae) by carrion beetles (Phaeochrous amplus) in a Ghanaian rain forest. Jou