Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Statistics II For Dummies
Statistics II For Dummies
Statistics II For Dummies
Ebook766 pages6 hours

Statistics II For Dummies

Rating: 3.5 out of 5 stars



Read preview

About this ebook

Continue your statistics journey with this all-encompassing reference 

Completed Statistics through standard deviations, confidence intervals, and hypothesis testing? Then you’re ready for the next step: Statistics II. And there’s no better way to tackle this challenging subject than with Statistics II For Dummies! Get a brief overview of Statistics I in case you need to brush up on earlier topics, and then dive into a full explanation of all Statistic II concepts, including multiple regression, analysis of variance (ANOVA), Chi-square tests, nonparametric procedures, and analyzing large data sets. By the end of the book, you’ll know how to use all the statistics tools together to create a great story about your data. 

For each Statistics II technique in the book, you get an overview of when and why it’s used, how to know when you need it, step-by-step directions on how to do it, and tips and tricks for working through the solution. You also find:  

  • What makes each technique distinct and what the results say 
  • How to apply techniques in real life 
  • An interpretation of the computer output for data analysis purposes 
  • Instructions for using Minitab to work through many of the calculations 
  • Practice with a lot of examples 

With Statistics II For Dummies, you will find even more techniques to analyze a set of data. Get a head start on your Statistics II class, or use this in conjunction with your textbook to help you thrive in statistics! 

Release dateOct 12, 2021
Statistics II For Dummies

Read more from Deborah J. Rumsey

Related to Statistics II For Dummies

Related ebooks

Mathematics For You

View More

Related articles

Reviews for Statistics II For Dummies

Rating: 3.4999999687499996 out of 5 stars

16 ratings1 review

What did you think?

Tap to rate

Review must be at least 10 words

  • Rating: 3 out of 5 stars
    I bought Statistics for Dummies to help with the statistical portion of my Master's thesis. Somehow, I had managed to get through college and grad school without taking a statistics course. Unfortunately, this book was almost no help with learning statistics at all. The reason, it isn't intended to help you do statistics; it is intended to help you interpret them. It does a very good job at it's real purpose—helping you make sense of the statistics bandied in the new media.Journalists tend to report on relative risk because they are easy to say and can sound impressive. For example: Say one person per billion in the population at large typically experiences having their brains blow out the back of their head when they sneeze. Now say that two people per billion have that happen when they are filling up their cars with premium fuel, but there is no difference in people who fill up their cars with regular. That means you are 100% more likely to sneeze and blow out the back of your head while filling your car with premium. So you should never use premium fuel! Right?What journalists would ignore in the previous fallacious scenario is that your actual risk is only two in a billion. But a 100% increase in risk sounds a lot more interesting and scary, doesn't it. Sigh.The book is very readable and even humorous at times. Humor is a major accomplishment in a subject as dry as this one. One of the most important lessons it teaches is to distrust relative risk comparisons.

Book preview

Statistics II For Dummies - Deborah J. Rumsey


So you’ve gone through some of the basics of statistics. Means, medians, and standard deviations all ring a bell. You know about surveys and experiments and the basic ideas of correlation and simple regression. You’ve studied probability, margin of error, and a few hypothesis tests and confidence intervals. Are you ready to load your statistical toolbox with a new level of tools? Statistics II For Dummies, 2nd Edition, picks up right where Statistics For Dummies, 2nd Edition, (John Wiley & Sons) leaves off and keeps you moving along the road of statistical ideas and techniques in a positive, step-by-step way.

The focus of Statistics II For Dummies, 2nd Edition, is on finding more ways of analyzing data. I provide step-by-step instructions for using techniques such as multiple regression, nonlinear regression, one-way and two-way analysis of variance (ANOVA), and Chi-square tests, and I give you some practice with big data sets, which are all the rage right now. Using these new techniques, you estimate, investigate, correlate, and congregate even more variables based on the information at hand, and you see how to put the tools together to create a great story about your data (nonfiction, I hope!).

About This Book

This book is designed for those who have completed the basic concepts of statistics through confidence intervals and hypothesis testing (found in Statistics For Dummies, 2nd Edition) and are ready to plow ahead to get through the final part of Stats I, or to tackle Stats II. However, I do pepper in some brief overviews of Stats I as needed, just to remind you of what was covered and to make sure you’re up to speed. For each new technique, you get an overview of when and why it’s used, how to know when you need it, step-by-step directions on how to apply it, and tips and tricks from a seasoned data analyst (yours truly). Because it’s very important to be able to know which method to use when, I emphasize what makes each technique distinct and what the results tell you. You will also see many applications of the techniques used in real life.

I also include interpretation of computer output for data analysis purposes. I show you how to use the software to get the results, but I focus more on how to interpret the results found in the output, because you’re more likely to be interpreting this kind of information than doing the programming specifically. Because the equations and calculations can get too involved if you are solving them by hand, you often use a computer to get your results. I include instructions for using Minitab to conduct many of the calculations in this book. Most statistics teachers who cover these topics use this approach as well. (What a relief!)

This book is different from the other Stats II books in many ways. Notably, this book features the following:

Full explanations of Stats II concepts. Many statistics textbooks squeeze all the Stats II topics at the very end of their Stats I coverage; as a result, these topics tend to get condensed and presented as if they’re optional. But no worries; I take the time to clearly and fully explain all the information you need to survive and thrive.

Dissection of computer output. Throughout the book, I present many examples that use statistical software to analyze the data. In each case, I present the computer output and explain how I got it and what it means.

An extensive number of examples. I include plenty of examples to cover the many different types of problems you’ll face. Some examples are short, and some are quite extensive and include multiple variables.

Lots of tips, strategies, and warnings. I share with you some trade secrets, based on my experience teaching and supporting students and grading their papers.

Understandable language. I try to keep things conversational to help you understand, remember, and put into practice statistical definitions, techniques, and processes.

Clear and concise, step-by-step procedures. In most chapters, you can find steps that intuitively explain how to work through Stats II problems — and remember how to do it on your own later on.

Throughout this book, I’ve used several conventions that I want you to be aware of:

I indicate multiplication by using a times sign, indicated by a lowered asterisk *.

I indicate the null and alternative hypotheses as Ho (for the null hypothesis) and Ha (for the alternative hypothesis).

The statistical software package I use and display throughout the book is Minitab 18, but I simply refer to it as Minitab.

Whenever I introduce a new term, I italicize it.

Keywords and numbered steps appear in boldface.

At times I get into some of the more technical details of formulas and procedures for those individuals who may need to know about them — or just really want to get the full story. These minutiae are marked with a Technical Stuff icon. I also include sidebars along with the essential text, usually in the form of a real-life statistics example or some bonus information you may find interesting. You can feel free to skip those icons and sidebars because you won’t miss any of the main information you need (but by reading them, you may just be able to impress your stats professor with your above-and-beyond knowledge of Stats II!).

Foolish Assumptions

Because this book deals with Stats II, I assume you have one previous course in introductory statistics under your belt (or at least have read Statistics For Dummies, 2nd Edition), with topics taking you up through the Central Limit Theorem and perhaps an introduction to confidence intervals and hypothesis tests (although I review these concepts briefly in Chapter 4). Prior experience with simple linear regression isn’t necessary. Only college algebra is needed for the math details. Some experience using statistical software is also a plus, but not required.

As a student, you may be covering these topics in one of two ways: either at the tail end of your Stats I course (perhaps in a hurried way, but in some way nonetheless); or through a two-course sequence in statistics in which the topics in this book are the focus of the second course. If so, this book provides you the information you need to do well in those courses.

You may simply be interested in Stats II from an everyday point of view, or perhaps you want to add to your understanding of studies and statistical results presented in the media. If this sounds like you, you can find plenty of real-world examples and applications of these statistical techniques in action, as well as cautions for interpreting them.

Icons Used in This Book

I use icons in this book to draw your attention to certain text features that occur on a regular basis. Think of the icons as road signs that you encounter on a trip. Some signs tell you about shortcuts, and others offer more information that you may need; some signs alert you to possible warnings, while others leave you with something to remember.

Computeroutput When you see this icon, it means I’m explaining how to carry out that particular data analysis using Minitab. I also explain the information you get in the computer output so you can interpret your results.

Remember I use this icon to reinforce certain ideas that are critical for success in Stats II, such as things I think are important to review as you prepare for an exam.

Technicalstuff When you see this icon, you can skip over the information if you don’t want to get into the nitty-gritty details. They exist mainly for people who have a special interest or obligation to know more about the technical aspects of certain statistical issues.

Tip This icon points to helpful hints, ideas, or shortcuts that you can use to save time; it also includes alternative ways to think about a particular concept.

Warning I use warning icons to help you stay away from common misconceptions and pitfalls you may face when dealing with ideas and techniques related to Stats II.

Beyond the Book

In addition to all the great content included in the book itself, you can find even more content online. Check out this book’s online Cheat Sheet on dummies.com. It covers the major formulas needed for Statistics II. You can access it by going to www.dummies.com and then typing Statistics II For Dummies Cheat Sheet into the search bar.

I’ve also included two major data sets that are analyzed in Chapters 20 and 21, so you can follow along with me or do your own analysis (not required!). Go to www.dummies.com/go/statisticsIIfd2e to access these files.

Where to Go from Here

This book is written in a nonlinear way, so you can start anywhere and still understand what’s happening. However, I can make some recommendations if you want some direction on where to start.

If you’re thoroughly familiar with the ideas of hypothesis testing and simple linear regression, start with Chapter 5 (multiple regression). Use Chapter 1 if you need a reference for the jargon that statisticians use in Stats II.

If you’ve covered all topics up through the various types of regression (simple, multiple, nonlinear, and logistic) or a subset of those as your professor deemed important, proceed to Chapter 10, the basics of analysis of variance (ANOVA).

Chapter 15 is the place to begin if you want to tackle categorical (qualitative) variables before hitting the quantitative stuff. You can work with the Chi-square test there.

Nonparametric statistics are presented starting in Chapter 17. Start there if you want the full details on the most common nonparametric procedures, used when you do not necessarily have an assumed distribution (for example, a normal).

If you want to see a bunch of Stats II ideas put into practice right off the bat, head to Chapter 19 where I discuss a multi-stage approach to analyzing a big data set, or Chapter 21, where you look into a big data set on refrigerators and see how it’s analyzed in a multi-stage approach.

Part 1

Tackling Data Analysis and Model-Building Basics


Understand why data analysis is both a science and an art.

Make sure you use the right type of analysis for the job.

Work with the normal and binomial distribtions.

Reaquaint yourself with confidence intervals and hypothesis tests.

Chapter 1

Beyond Number Crunching: The Art and Science of Data Analysis


Bullet Realizing your role as a data analyst

Bullet Avoiding statistical faux pas

Bullet Delving into the jargon of Stats II

Because you’re reading this book, you’re likely familiar with the basics of statistics and you’re ready to take it up a notch. That next level involves using what you know, picking up a few more tools and techniques, and finally putting it all to use to help you answer more realistic questions by using real data. In statistical terms, you’re ready to enter the world of the data analyst.

In this chapter, you review the terms involved in statistics as they pertain to data analysis at the Stats II level. You get a glimpse of the impact that your results can have by seeing what these analysis techniques can do. You also gain insight into some of the common misuses of data analysis and their effects.

Data Analysis: Looking before You Crunch

It used to be that statisticians were the only ones who really analyzed data because the only computer programs available were very complicated to use, requiring a great deal of knowledge about statistics to set up and carry out analyses. The calculations were tedious and at times unpredictable, and they required a thorough understanding of the theories and methods behind the calculations to get correct and reliable answers.

Today, anyone who wants to analyze data can do it easily. Many user-friendly statistical software packages are made expressly for that purpose — Microsoft Excel, Minitab, and SAS are just a few. Free online programs are available, too, such as R, which helps you do just what it says — crunch your numbers and get an answer.

Each software package has its own pros and cons (and its own users and protesters). My software of choice and the one I reference throughout this book is Minitab, because it’s very easy to use, the results are precise, and the software’s loaded with all the data-analysis techniques used in Stats II. Although a site license for Minitab isn’t cheap, the student version is available for rent for only a few bucks a semester.

Remember The most important idea when applying statistical techniques to analyze data is to know what’s going on behind the number crunching so you (not the computer) are in control of the analysis. That’s why knowledge of Stats II is so critical.

Warning Many people don’t realize that statistical software can’t tell you when and when not to use a certain statistical technique. You have to determine that on your own. As a result, people think they’re doing their analyses correctly, but they can end up making all kinds of mistakes. In the following sections, I give examples of some situations in which innocent data analyses can go wrong and why it’s important to spot and avoid these mistakes before you start crunching numbers.

Bottom line: Today’s software packages really are too good to be true if you don’t have a clear and thorough understanding of the Stats II that’s beneath the surface.

Nothing (not even a straight line) lasts forever

Bill Prediction is a statistics student who is studying the effect of study time on a student’s exam score. Bill collects data on statistics students and uses his trusty software package to predict exam scores based on study time. His computer comes up with the equation math , where y represents the test score you get if you study for a certain number of hours (x). Notice that this model is the equation of a straight line with a y-intercept of 30 and a slope of 10.

So using this model, Bill predicts that if you don’t study at all, you’ll get a 30 on the exam (plugging math into the equation and solving for y; this point represents the y-intercept of the line). He also predicts, using this model, that if you study for 5 hours, you’ll get an exam score of math So, the point (5,80) is also on this line.

But then Bill goes a little crazy and wonders what would happen if you studied for 40 hours (because it always seems that long when he’s studying). The computer tells him that if he studies for 40 hours, his test score is predicted to be math points. Wow, that’s a lot of points! Problem is, the exam only goes up to a total of 100 points. Bill wonders where his computer went wrong.

But Bill puts the blame in the wrong place. He needs to remember that there are limits on the values of x that make sense in this equation. For example, because x is the amount of study time, x can never be a number less than zero. If you plug a negative number in for x, say math , you get math , which makes no sense. However, the equation itself doesn’t know that, nor does the computer that found it. The computer simply graphs the line you give it, assuming it’ll go on forever in both the positive and negative directions.

Warning After you get a statistical equation or model, you need to specify for what values the equation applies. Equations don’t know when they work and when they don’t; it’s up to the data analyst to determine that. This idea is the same for applying the results of any data analysis that you do.

Data snooping isn’t cool

Warning Statisticians have come up with a saying that you may have heard: Figures don’t lie. Liars figure. Make sure that you find out about all the analyses that were performed on a data set, not just the ones reported as being statistically significant.

Suppose Bill Prediction (from the previous section) decides to try to predict scores on a biology exam based on study time, but this time his model doesn’t fit. Not one to give in, Bill insists there must be some other factors that predict biology exam scores besides study time, and he sets out to find them.

Bill measures everything from soup to nuts. His set of 20 possible variables includes study time, GPA, previous experience in statistics, math grades in high school, and whether you chew gum during the exam. After his multitude of various correlation analyses, the variables that Bill finds to be related to exam score are study time, math grades in high school, GPA, and gum chewing during the exam. It turns out that this particular model fits pretty well (by criteria I discuss in Chapter 6 on multiple linear regression models).

But here’s the problem: By looking at all possible correlations between his 20 variables and the exam score, Bill is actually doing 20 separate statistical analyses. Under typical conditions that I describe in Chapter 4, each statistical analysis has a 5 percent chance of being wrong just by chance. I bet you can guess which one of Bill’s correlations likely came out wrong in this case. And hopefully, he didn’t rely on a stick of gum to boost his grade in biology.

Warning Looking at data until you find something in it is called data snooping. Data snooping results in giving the researcher his five minutes of fame but then leads him to lose all credibility because no one can repeat his results.

No (data) fishing allowed

Some folks just don’t take no for an answer, and when it comes to analyzing data, that can lead to trouble.

Sue Gonnafindit is a determined researcher. She believes that her horse can count by stomping his foot. (For example, she says 2 and her horse stomps twice.) Sue collects data on her horse for four weeks, recording the percentage of time the horse gets the counting right. She runs the appropriate statistical analysis on her data and is shocked to find no significant difference between her horse’s results and those you would get simply by guessing.

Determined to prove her results are real, Sue looks for other types of analyses that exist and plugs her data into anything and everything she can find (never mind that those analyses are inappropriate to use in her situation). Using the famous hunt-and-peck method, at some point she eventually stumbles upon a significant result. However, the result is bogus because she tried so many analyses that weren’t appropriate and ignored the results of the appropriate analysis because it didn’t tell her what she wanted to hear.

Funny thing, too. When Sue went on a late-night TV program to show the world her incredible horse, someone in the audience noticed that whenever the horse got to the correct number of stomps, Sue would interrupt him and say Good job! and the horse quit stomping. He didn’t know how to count; all he knew to do was to quit stomping when she said, Good job!

Warning Redoing analyses in different ways in order to try to get the results you want is called data fishing, and folks in the stats biz consider it to be a major no-no. (However, people unfortunately do it all too often to verify their strongly held beliefs.) By using the wrong data analysis for the sake of getting the results you desire, you mislead your audience into thinking that your hypothesis is actually correct when it may not be.

Getting the Big Picture: An Overview of Stats II

Stats II is an extension of Stats I (introductory statistics), so the jargon follows suit and the techniques build on what you already know. In this section, you get an introduction to the terminology you use in Stats II along with a broad overview of the techniques that statisticians use to analyze data and find the story behind it. (If you’re still unsure about some of the terms from Stats I, you can consult your Stats I textbook or see my other book, Statistics For Dummies, 2nd Edition [Wiley], for a complete rundown.)

Population parameter

Remember A parameter is a number that summarizes the population, which is the entire group you’re interested in investigating. Examples of parameters include the mean of a population, the median of a population, or the proportion of the population that falls into a certain category.

Suppose you want to determine the average length of a cellphone call among teenagers (ages 13–18). You’re not interested in making any comparisons; you just want to make a good guesstimate of the average time. So you want to estimate a population parameter (such as the mean or average). The population is all cellphone users between the ages of 13 and 18 years old. The parameter is the average length of a phone call this population makes.

Sample statistic

Typically you can’t determine population parameters exactly; you can only estimate them. But all is not lost; by taking a representative sample (a well-chosen subset of individuals) from the population and studying it, you can come up with a good estimate of the population parameter. A sample statistic is a single number that summarizes that subset.

For example, in the cellphone scenario from the previous section, you select a sample of teenagers and measure the duration of their cellphone calls over a period of time (or look at their cellphone records if you can gain access legally). You take the average of the cellphone call duration. For example, the average duration of 100 cellphone calls may be 12.2 minutes — this average is a statistic. This particular statistic is called the sample mean because it’s the average value from your sample data.

Many different statistics are available to study different characteristics of a sample, such as the proportion, the median, and standard deviation.

Confidence interval

A confidence interval is a range of likely values for a population parameter. A confidence interval is based on a sample and the statistics that come from that sample. The main reason you want to provide a range of likely values rather than a single number is that sample results vary.

For example, suppose you want to estimate the percentage of people who eat chocolate. According to the Simmons Market Research Bureau, 78 percent of adults reported eating chocolate, and of those, 18 percent admitted eating sweets frequently. What’s missing in these results? These numbers are only from a single sample of people, and those sample results are guaranteed to vary from sample to sample. You need some measure of how much you can expect those results to move if you were to repeat the study.

This expected variation in your statistic from sample to sample is measured by the margin of error, which reflects a certain number of standard deviations of your statistic that you add and subtract to have a certain confidence in your results (see Chapter 4 for more on margin of error). If the chocolate-eater results were based on 1,000 people, the margin of error would be approximately 3 percent. This means the actual percentage of people who eat chocolate in the entire population is expected to be 78 percent, ± 3 percent (that is, between 75 percent and 81 percent).

Hypothesis test

A hypothesis test is a statistical procedure that you use to test an existing claim about the population, using your data. The claim is noted by Ho (the null hypothesis). If your data support the claim, you fail to reject Ho. If your data don’t support the claim, you reject Ho and conclude an alternative hypothesis, Ha. The reason most people conduct a hypothesis test is not to merely show that their data support an existing claim, but rather to show that the existing claim is false, in favor of the alternative hypothesis.

The Pew Research Center studied the percentage of people who turn to ESPN for their sports news. Its statistics, based on a survey of about 1,000 people, found that in 2000, 23 percent of people said they went to ESPN; in 2020, only 20.9 percent reported going to ESPN. The question is this: Does this 2.1 percent reduction in viewers represent a significant trend that ESPN should worry about?

To test these differences formally, you can set up a hypothesis test. You set up your null hypothesis as the result you have to believe without your study, Ho = No difference exists between 2000 and 2020 data for ESPN viewership. Your alternative hypothesis (Ha) is that a difference is there. To run a hypothesis test, you look at the difference between your statistic from your data and the claim that has been already made about the population (in Ho), and you measure how far apart they are in units of standard deviations.

With respect to the example, using the techniques from Chapter 4, the hypothesis test shows that 23 percent and 20.9 percent aren’t far enough apart in terms of standard deviations to dispute the claim (Ho). You can’t say the percentage of viewers of ESPN in the entire population changed from 2000 to 2020.

Remember As with any statistical analysis, your conclusions can be wrong just by chance, because your results are based on sample data, and sample results vary. In Chapter 4, I discuss the types of errors that can be made in conclusions from a hypothesis test.

Analysis of variance (ANOVA)

ANOVA is the acronym for analysis of variance. You use ANOVA in situations where you want to compare the means of more than two populations. For example, say you want to compare the lifetimes of four brands of tires in number of miles. You take a random sample of 50 tires from each group, for a total of 200 tires, and set up an experiment to compare the lifetime of each tire, and record it. You now have four means and four standard deviations, one for each data set.

Then, to test for differences in average lifetime for the four brands of tires, you basically compare the variability between the four data sets to the variability within the entire data set, using a ratio. This ratio is called the F-statistic. If this ratio is large, the variability between the brands is more than the variability within the brands, giving evidence that not all the means are the same for the different tire brands. If the F-statistic is small, not enough difference exists between the treatment means compared to the general variability within the treatments (here the brands) themselves. In this case, you can’t say that the means are different for the groups. (I give you the full scoop on ANOVA plus all the jargon, formulas, and computer output in Chapters 10 and 11.)

Multiple comparisons

Suppose you conduct ANOVA, and you find a difference in the average lifetimes of the four brands of tire (see the preceding section). Your next questions would probably be, Which brands are different? and How different are they? To answer these questions, you use multiple-comparison procedures.

A multiple-comparison procedure is a statistical technique that compares means to each other and finds out which ones are different and which ones aren’t. With this information, you’re able to put the groups in order from those with the largest mean to those with the smallest mean, realizing that sometimes two or more groups were too close to tell and are placed together in a group.

Many different multiple-comparison procedures exist to compare individual means and come up with an ordering in the event that your F-statistic does find that some difference exists. Some of the multiple-comparison procedures include Tukey’s test, LSD (least significant difference), and pairwise t-tests. Some procedures are better than others, depending on the conditions and your goal as a data analyst. I discuss multiple-comparison procedures in detail in Chapter 11.

Warning Never take that second step to compare the means of the groups if the ANOVA procedure doesn’t find any significant results during the first step. Computer software will never stop you from doing a follow-up analysis, even if it’s wrong to do so.

Interaction effects

An interaction effect in statistics operates the same way that it does in the world of medicine. Sometimes if you take two different medicines at the same time, the combined effect is much different than if you were to take the two individual medications separately.

Remember Interaction effects can come up in statistical models that use two or more variables to explain or compare outcomes. In this case you can’t automatically study the effect of each variable separately; you have to first examine whether or not an interaction effect is present.

For example, suppose medical researchers are studying a new drug for depression and want to know how this drug affects the change in blood pressure for a low dose versus a high dose. They also compare the effects for children versus adults. It could also be that dosage level affects the blood pressure of adults differently than the blood pressure of children. This type of model is called a two-way ANOVA model, with a possible interaction effect between the two factors (age group and dosage level). Chapter 12 covers this subject in depth.


The term correlation is often misused. Statistically speaking, the correlation measures the strength and direction of the linear relationship between two quantitative variables (variables that represent counts or measurements only).

Remember You aren’t supposed to use correlation to talk about relationships unless the variables are quantitative. For example, it’s wrong to say that a correlation exists between eye color and hair color. (In Chapter 14, you explore associations between two categorical variables.)

Correlation is a number between –1.0 and +1.0. A correlation of +1.0 indicates a perfect positive relationship; as you increase one variable, the other one increases in perfect sync. A correlation of –1.0 indicates a perfect negative relationship between the variables; as one variable increases, the other one decreases in perfect sync. A correlation of zero means you found no linear relationship at all between the variables. Most correlations in the real world fall somewhere in between –1.0 and +1.0; the closer to –1.0 or +1.0, the stronger the relationship is; the closer to 0, the weaker the relationship is.

Figure 1-1 shows a plot of the number of coffees sold at football games in Buffalo, New York, as well as the air temperature (in degrees Fahrenheit) at each game. This data set seems to follow a downhill straight line fairly well, indicating a negative correlation. The correlation turns out to be –0.741; the number of coffees sold has a fairly strong negative relationship with the temperature of the football game. This makes sense because on days when the temperature is low, people get cold and want more coffee. I discuss correlation further, as it applies to model building, in Chapter 5.

Snapshot of coffees sold at various air temperatures on football game day.

FIGURE 1-1: Coffees sold at various air temperatures on football game day.

Linear regression

After you’ve found a correlation and determined that two variables have a fairly strong linear relationship, you may want to try to make predictions for one variable based on the value of the other variable. For example, if you know that a fairly strong negative linear relationship exists between coffees sold and the air temperature at a football game (see the previous section), you may want to use this information to predict how much coffee is needed for a game, based on the temperature. This method of finding the best-fitting line is called linear regression.

Many different types of regression analyses exist, depending on your situation. When you use only one variable to predict the response, the method of regression is called simple linear regression (see Chapter 5). Simple linear regression is the best known of all the regression analyses and is a staple in the Stats I course sequence.

However, you use other flavors of regression for other situations.

If you want to use more than one variable to predict a response, you use multiple linear regression (see Chapter 6).

If you want to make predictions about a variable that has only two outcomes, yes or no, you use logistic regression (see Chapter 9).

For relationships that don’t follow a straight line, you have a technique called (no surprise) nonlinear regression (see Chapter 8).

Chi-square tests

Correlation and regression techniques all assume that the variable being studied in most detail (the response variable) is quantitative — that is, the variable measures or counts something. You can also run into situations where the data being studied isn’t quantitative, but rather categorical — that is, the data represents categories, not measurements or counts. To study relationships in categorical data, you use a Chi-square test for independence. If the variables are found to be unrelated, they’re declared independent. If they’re found to be related, they’re declared dependent.

Suppose you want to explore the relationship between age group and eating breakfast. Because each of these variables is categorical, or qualitative, you use a Chi-square test for independence. You survey 70 adults and 70 children and find that 25 adults eat breakfast and 45 do not; for the children, 35 do eat breakfast and 35 do not. Table 1-1 organizes this data and sets you up for the Chi-square test for this scenario.

TABLE 1-1 Table Setup for the Breakfast and Age Group Question

Remember A Chi-square test first calculates what you expect to see in each cell of the table if the variables are independent (these values are brilliantly called the expected cell counts). The Chi-square test then compares these expected cell counts to what you observed in the data (called the observed cell counts) and compares them using a Chi-square statistic.

In the breakfast age-group comparison, fewer adults than children eat breakfast


. Even though you know results will vary from sample to sample, this difference turns out to be enough to declare a relationship between age group and eating breakfast, according to the Chi-square test of independence. Chapter 15 reveals all the details of doing a Chi-square test.

You can also use the Chi-square test to see whether your theory about what percent of each group falls into a certain category is true or not. For example, can you guess what percentage of M&M’S fall into each color category? You can find more on these Chi-square variations, as well as the M&M’S question, in Chapter 16.

Chapter 2

Finding the Right Analysis for the Job


Bullet Deciphering the difference between categorical and quantitative variables

Bullet Choosing appropriate statistical techniques for the task at hand

Bullet Evaluating bias and precision levels

Bullet Interpreting the results properly

One of the most critical elements of statistics and data analysis is the ability to choose the right statistical technique for each job. Carpenters and mechanics know the importance of having the right tool when they need it and the problems that can occur if they use the wrong tool. They also know that the right tool helps to increase their odds of getting the results they want the first time around, using the work smarter, not harder approach.

In this chapter, you look at some of the major statistical analysis techniques from the point of view of the carpenters and mechanics — knowing what each statistical tool is meant to do, how to use it, and when to use it. You also zoom in on mistakes some number crunchers make in applying the wrong analysis or doing too many analyses.

Remember Knowing how to spot these problems can help you avoid making the same mistakes, but it also helps you to steer through the ocean of statistics that may await you in your job and in everyday life.

If many of the ideas you find in this chapter seem like a foreign language to you and you need more background information, don’t fret. Before continuing on in this chapter, head to your nearest Stats I book or check out another one of my books, Statistics For Dummies, 2nd Edition (Wiley).

Categorical versus Quantitative Variables

After you’ve collected all the data you need from your sample, you want to organize it, summarize it, and analyze it. Before plunging right into all the number crunching, though, you need to first identify the type of data you’re dealing with. The type of data you have points you to the proper types of graphs, statistics, and analyses you’re able to use.

Technicalstuff Before I begin, here’s an important piece of jargon: Statisticians call any quantity or characteristic you measure on an individual a variable; the data collected on a variable is expected to vary from person to person (hence the creative name).

There are two major types of variables:

Categorical. A categorical variable, also known as a qualitative variable, classifies the individual based on categories. For example, political affiliation may be classified into four categories: Democrat, Republican, Independent, and Other. Similarly, type of pet can take on three categories: Cat, Dog, and Other. Categorical variables can take on numerical values only as placeholders; the numbers themselves don’t mean anything special.

Quantitative. A quantitative variable measures or counts a quantifiable characteristic, such as height, weight, number of children you have, your GPA in college, or the number of hours of sleep you got last night. The quantitative variable value represents a quantity (count) or a measurement and has numerical meaning. That is, you can add, subtract, multiply, or divide the values of a quantitative variable, and the results make sense as numbers.

Because the two types of variables represent such different types of data, it makes sense that each type has its own set of statistics. Categorical variables, such as political affiliation, are somewhat limited in terms of the statistics that can be performed on them.

For example, suppose you have a sample of 500 classmates classified by dominant hand — 80 are left-handed and 420 are right-handed. How can you summarize this information? You already have the total number in each category (this statistic is called the frequency). You’re off to a good start, but frequencies are hard to interpret because you find yourself trying to compare them to a total in your mind in order to get a proper comparison. For example, in this case you may be thinking, Eighty left-handers out of what? Let’s see, it’s out of 500. Hmmm … what percentage is that?

The next step is to find a means to relate these numbers to each other in an easy way. You can do this by using the relative frequency, which is the percentage of data that falls into a specific category of a categorical variable. You can find a category’s relative frequency by dividing the frequency by the sample total and then multiplying by 100. In this case, you have math percent left-handers and math percent right-handers in the class.

You can also express the relative frequency as a proportion in each group by leaving the result in decimal form and not multiplying by 100. This statistic is called the sample proportion. In this example, the sample proportion of left-handed students is 0.16, and the sample proportion of right-handed students is 0.84.

Remember You mainly summarize categorical variables by using two statistics: the number in each category (frequency) and the percentage in each category (relative frequency).

Statistics for Categorical Variables

The types of statistics done on categorical data may seem limited; however, the wide variety of analyses you can perform using frequencies and relative frequencies offers answers to an extensive range of possible questions you may want to explore.

In this section, you see that the proportion in each group is the number-one statistic for summarizing categorical data. Beyond that, you see how you can use proportions to estimate, compare, and look for relationships between the groups that comprise the categorical data.

Estimating a proportion

You can use relative frequencies to make estimates about a single population proportion. (Refer to the earlier section, "Categorical versus Quantitative Variables," for an explanation of relative frequencies.)

Suppose you want to know what proportion of registered voters in the United States identify as Democrats, Republicans, and Independents. According to a random sample of 12,000 registered voters in the U.S. conducted by the Pew Research Center in 2019, the percentage of Democrat, Republican, and Independent registered voters was 33 percent, 29 percent, and 34 percent, respectively. Now, because the Pew researchers based these results on only a sample of the population and not on the entire population, their results will vary if they take another random sample of 12,000 people. This variation in sample results is cleverly called — you guessed it — sampling variability.

The sampling variability is measured by the margin of error (the amount that you add and subtract from your sample statistic), which for this sample is only about 0.9 percent. (To find out how to calculate margin of error, turn to Chapter 4.) That means, for example, that the estimated percentage of all registered voters in the U.S. identifying as Democrat is somewhere between math and

Enjoying the preview?
Page 1 of 1