Course Notes For Unit 1 of The Udacity Course ST101 Introduction To Statistics
Course Notes For Unit 1 of The Udacity Course ST101 Introduction To Statistics
Contents
Welcome Course Overview Looking at Data Scatter-Plots Bar Charts Pie Charts Programming Charts (Optional) Admissions Case Study Answers
Welcome
Welcome to Statistics 101. Lets start with a teaser. This is a challenging teaser, and it may provoke you. I believe you should be unhappy. Not because our class is bad, because I will prove in a moment that you are unpopular. The reason were doing this is to show how deep statistics is, and how we can easily fool ourselves. For simplicity, let's say that there are two types of people, type A and type B. Type A are popular. They have 80 friends. Type B are less popular. They only have 20 friends. Now, you may now say that I don't know which type you are. We will calculate the expected, or average number of friends. To do this, we assume that half of the people are of type A and the other half are of type B.
Unpopular Quiz
Lets get back to the real question. How many friends should you expect the friend that you picked to have?
Course Overview
Most of the material we will cover in this class is very basic. It is the first class you would have in college if you're not a statistics major. Well teach you how to visualize data, how to summarize it, how to run tests, and even how to find trends. But there are also a few challenging nuggets in there. The challenges are optional, and theyre clearly marked as optional. You will prove some theorems along the way, and most importantly youll get the chance to program the things you've learned (using the Python programming language). Again, programming is optional. You dont need to have a programming background to do this course. But we would suggest you give it a try. Many people learn the material much better by programming than any other way.
Looking at Data
The basis of statistics is that the world is full of data, and we have to make decisions. Statistics comes to our rescue. It takes data and turns it into information that we can use to make decisions. Whatever field you are in, the chances are that it is driven by data. Statistics is important to know and to understand. Its universal, useful, and Sebastian promises it is fun too!
Valuing Houses
One of the standard problems that people study in statistics has to do with purchasing decisions. Suppose you want to buy a house. There are houses of various sizes, but you really like one particular house. This house has a specific asking price, lets say $92,000.00. You want to know whether this is too much, or perhaps too little? In statistics, the way to find out is by looking at data. Let's assume theres a database of previous house sales in the same neighbourhood. For simplicity, well assume we know two things, the size of the home and the sale price. Size (ft2) 1400 2400 1800 1900 1300 1100 Cost ($) 112,000 192,000 144,000 152,000 104,000 88,000
Scatter-Plots
Most Important Part Quiz
What do think is the most important thing that a statistics person does? Look at data Program computers Run statistics Eat pizza
Is there a fixed cost per square foot for this data set?
Is there a fixed cost per square foot for this data set now?
As we have seen, data can carry a lot of information. There is a trick called a scatterplot that you can use to visualise the data. Take a pencil and a piece of paper and arrange the data in a graph where the x axis is house size, and the y axis is the price.
In a scatter plot, each data item becomes a dot on the graph. The first house would appear on the graph as follows:
200000
150000
1500 Size
2000
2500
Now, we chose a 2-dimensional list to make for a convenient 2-dimensional scatterplot. These are the most popular scatter plots because surfaces like paper are 2-D. When we add the remaining data points, we get:
This is a nice scatter-plot that allows us to draw a straight line through all the points. When this happens, and theres a relationship between the data that is governed by a straight line we call the data linear. Linearity is fairly rare in statistics. More often you will find deviations, because the size of a house is not the only factor that determines its cost (or perhaps also because most of us are bad negotiators!). When a data set is linear, it is really easy to predict the prices of houses in between, just be looking at the data. We're doing what a statistician ought to do.
Is it Linear? Quiz
Plot the modified data below. Is the relationship between the data linear? Size (ft2) 1700 2100 1900 1300 1600 2200 Cost ($) 53,000 44,000 59,000 82,000 50,000 68,000
Congratulations
So, now we know a lot about scatter plots. They tend to be 2-dimensional, and a simple eye-ball of the data can tell us a lot the relationship of one variable to another. Scatter-plots arent great when there is what is called "noise" in the data. This happens when the data deviates from expectation in some random, noisy way. Next, well look at another simple plotting technique called bar charts that address the issue of noisy data by grouping data points into a single cumulative bar.
Bar Charts
In this section were going to look at bar charts. These are a common statistical data visualization tool.
Interpolation Quiz
If we now ask how much you should pay for a 2200 square foot house, using the interpolation method we learned earlier, what figure would you get? Do you trust that number?
There is good reason not to trust this value. The cost of a 2300 square-foot house is less than the 2100 square-foot house. These deviations from the linear relationship are called noise. This is the term that statisticians use. Maybe one house has a great view, while another is an old house. Perhaps a third is on the coast, or maybe one needs a new kitchen. There are a whole range of possible factors that effect the cost over and above the size of the property. If these factors arent included in the data, a statistician will call it random noise. Bar charts are one way to alleviate the problem.
In a bar chart, we take the raw data and pool it together into bands. For example, in our house data, we may group all the data for house sizes between 1000 and 1500 square-feet into one bar. Then group the data for house sizes between 1500 and 2000 square-feet into another bar, and so on:
Bar Charts
When we look at the bar chart, we will see that it is a much finer representation of the data. Pooling multiple data points together to form a single bar, can give a much clearer picture of the dependence of cost on size. While the bar chart doesnt show the linear relationship in the same way as the scatter-plot (actually, in this case the relationship is non-linear), it really gives a clear sense that, as house size increases, the cost increases. Something that may not have been obvious from just looking at the individual data points. The bar chart lets us pool groups of data together into single bars and so understand global trends. Now, these global trends might not be that important if you only have six data points, but imagine that you have 60,000 data points. In this case, small variations in individual data points may not tell us much, but the bar chart can really help us to understand the data.
One of the jobs of the statistician is to use cumulative tools, such as bar graphs, to gain an understanding of the underlying data.
Histograms
Now, we are going to introduce histograms as a special case of the bar chart. The key difference is that the bar charts that we have discussed so far have dealt with 2-dimensional data. Histograms only consider 1-dimensional data. Lets consider an example. Lets suppose that we asked a group of software engineers how much they earn, and got the following responses: $132,754 $137,192 $122,177 $147,121 $143,000 $126,010 $129,200 $124,312 $128,132 For the histogram, we are going to create a bar chart that is only concerned with frequency. This is basically a count that groups the salaries into a series of buckets, say from $120,000 to $130,000, from $130,000 to $140,000, and over $140,000.
Histogram Quiz
What is the frequency count for the salaries that fall into the three brackets?
Lets create a rather simplified histogram, looking at people between 0 and 40 using the following data set: 21, 17, 9, 27, 35, 4, 12, 12, 32, 14, 38, 9, 19, 22, 21, 14, 3, 8, 31, 15, 33, 29 Group the data into the ranges: 0 10, 11 20, 21 30, and 31 40. What are the heights of the bars for the four ranges?
Summary
In this unit we learned about bar charts and histograms. Both use vertical bars, and both aggregate data. The big difference is that bar charts are defined over 2-D data, one dimension applied to the x axis and the other to the y axis. Histograms only apply to 1-D data, and the y axis becomes the count of that data.
Pie Charts
Most of us have seen pie charts before. In statistics, we use pie charts to visualise data. Specifically, relative data, and we will see what that means in a moment.
Voting Quiz 1
Lets say that there is an election and there are just two parties. Both parties are getting the same number of votes, i.e. 50%. Which of these pie charts reflect the outcome of the election?
Voting Quiz 2
Now, we said that pie charts are good for relative data. Suppose Party A got 724,000 votes and Party B got 181,000 votes. What percentage of the vote did Part A get?
Voting Quiz 3
Now, given this, which of these charts most closely resembles the election result?
So, a remarkable property of pie charts is that they are invariant to the actual numbers of votes. What it actually depicts is the relative numbers of votes. In this case, it show
that Party A got many more votes than Party B. It shows this graphically, so you can see this without having to study the actual number of votes cast.
Now, the chart tells us nothing about the absolute number of votes cast, but it does tell us a lot about the distribution of the votes. We can easily see that A is the dominant party with more than 50% of the votes cast.
Summary
So we just learned about pie charts. We learned that they are great for relative data, and they're wonderful for comparing which slice of the pie is biggest. We will look at relative data again later with a case study about gender discrimination in college admissions, using a study originally performed at UC Berkeley in California.
As you can see, in this case the range 2.8 to 3.6 was most frequent. So, there are three things we need to tell the computer. We tell it we want to plot things. We define the data, and give it a name. We tell it the type of plot we want (in this case a histogram).
Barchart Quiz
Replace the scatter-plot with a bar-chart using the same two arguments as before.
Now the chart show clearly that as the height increases, so does the weight, but the differences between the heights of the bars suggests that the relationship is not exactly linear. In reality, of course we know that the relationship between height and weight in a population isnt linear, but for the sake of this exercise, it is the best of the three options we provided for you.
Wages Quiz
Write a line of code to print a scatterplot of Age on the horizontal axis against Wage on the vertical axis. What is the youngest age at which a person earns $267,000?
Conclusion
In this section we created the Python code to generate our own bar charts, histograms, and scatter plots. We can use this to study the data, and perhaps learn something about it.
The numbers that we will be using are not the same as those from UC Berkeley. This is a simplified version of that problem, but the paradox that it illustrates is the same. It is called "Simpson's Paradox".
Admissions Quiz 1
Among male students, 900 applied for Major A and 450 were admitted. What is the acceptance rate as a percentage?
Admissions Quiz 2
In a second major, Major B, 100 male students applied and 10 were accepted. What is the acceptance rate as a percentage?
Admissions Quiz 3
The same statistic was run for female students. Females applied predominantly for Major B. There were 900 applications for major B, of whom 180 were admitted. Just 100 female students applied for Major A, of whom 80 were admitted. What is the acceptance rate for Major A as a percentage for the female student population?
Admissions Quiz 4
What is the acceptance rate for Major B as a percentage for the female student population?
Superficially, it appears that female students are favoured because for both majors, they have a better admission rate than the corresponding rate for male students. But what happens if we look at the admission statistics independently of the major?
Aggregation Quiz 1
A total of 1000 male students applied and 460 were admitted. What is the acceptance rate for male students across both majors?
Aggregation Quiz 2
Now do the same for female students. A total of 1000 students applied and 260 were admitted.
Perhaps surprisingly, given our earlier findings, when we look at both majors together, we find that males have a much higher admissions rate than females. This is not made up. The actual numbers we are using may be made up, but this effect was actually observed University of California at Berkley many years ago. Looking at majors individually, we find that in each major individually the acceptance rate for females trumps that of males, and yet when we look at the overall statistics we find the opposite. We haven't added anything. We just regrouped the data. This example shows just how ambiguous statistics can be. In choosing how to graph your data, you can have a major impact on what people believe. A famous saying states I never believe statistics that I didn't doctor myself. The key lesson here is that statistics can be deep and are often manipulated. You should always be sceptical of statistics, whether they are your own results or other peoples, and you really need to understand how raw data is turned into decisions or conclusions.
Answers
Average Friends Quiz
Now, I dont know which type you are, but theres a 50% chance that youre Type A, in which case youll have 80 friends, and a 50% chance that youre Type B, and youll have 20 friends. I can calculate your expected number of friends as: (80 x 0.5) + (20 x 0.5) = 40 + 10 = 50 friends.
Unpopular Quiz
Lets get back to the real question. How many friends should you expect the friend that you picked to have? There is an 80% chance that you picked a Type A friend, who will have 80 friends. Similarly, theres a 20% chance that you picked a Type B friend who has 20 friends. This gives an expected number of friends: (80 x 0.8) + (20 x 0.2) = 64 + 4 = 68 friends You would only expect to have 50 friends, so this suggests that you are unpopular!
Is it Linear? Quiz
No. The two highlighted values are outliers. Well talk more about outliers later. There is no way to fit a linear function through all these data points.
Interpolation Quiz
$105,000 No, the value cannot be trusted.
Histogram Quiz
$120,000 to $130,000 $130,000 to $140,000 $140,000 to $150,000 5 2 2
Voting Quiz 1
B
Voting Quiz 2
80%
Voting Quiz 3
E
Approximately linear.
Barchart Quiz
barchart (Height, Weight)
Wages Quiz
scatterplot (Age, Wage)
30.
Approximately linear.
35 - 40
Admissions Quiz 1
50%
Admissions Quiz 2
10%
Admissions Quiz 3
80%
Admissions Quiz 4
20%
Aggregation Quiz 1
46%
Aggregation Quiz 2
26%