Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
465 views

Engineering Data Analysis Module 1-4

Uploaded by

s. magx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
465 views

Engineering Data Analysis Module 1-4

Uploaded by

s. magx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

College of Engineering and Industrial Technology

Department of Agricultural and Biosystems Engineering


Engineering Data Analysis

Chapter 1: Introduction to Statistics and Data Analysis:


Obtaining Data
You are probably asking yourself the question, "When and where will I use statistics?" If you read any
newspaper, watch television, or use the Internet, you will see statistical information. There are statistics about
crime, sports, education, politics, and real estate. Typically, when you read a newspaper article or watch a
television news program, you are given sample information.
With this information, you may make a decision about the correctness of a statement, claim, or "fact."
Statistical methods can help you make the "best educated guess." Since you will undoubtedly be given
statistical information at some point in your life, you need to know some techniques for analyzing the
information thoughtfully. Think about buying a house or managing a budget.
The science of statistics deals with the collection, analysis, interpretation, and presentation of data.
We see and use data in our everyday lives.

Page 1 of 15
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

1.1 Definitions of Statistics, Probability, and Key Terms


Organizing and summarizing data is called descriptive statistics. Two ways to summarize data are by
graphing and by using numbers (for example, finding an average). After you have studied probability and
probability distributions, you will use formal methods for drawing conclusions from "good" data. The formal
methods are called inferential statistics.
Statistical inference uses probability to determine how confident we can be that our conclusions are
correct. Effective interpretation of data (inference) is based on good procedures for producing data and
thoughtful examination of the data. You will encounter what will seem to be too many mathematical formulas
for interpreting data. The goal of statistics is not to perform numerous calculations using the formulas, but
to gain an understanding of your data. The calculations can be done using a calculator or a computer. The
understanding must come from you. If you can thoroughly grasp the basics of statistics, you can be more
confident in the decisions you make in life.

Key Terms:

In statistics, we generally want to study a population. You can think of a population as a collection
of persons, things, or objects under study. To study the population, we select a sample. The idea of sampling
is to select a portion (or subset) of the larger population and study that portion (the sample) to gain
information about the population. Data are the result of sampling from a population.
Because it takes a lot of time and money to examine an entire population, sampling is a very practical
technique. If you wished to compute the overall grade point average at your school, it would make sense to
select a sample of students who attend the school. The data collected from the sample would be the students'
grade point averages.
In presidential elections, opinion poll samples of 1,000–2,000 people are taken. The opinion poll is
supposed to represent the views of the people in the entire country. Manufacturers of canned carbonated
drinks take samples to determine if a 16 ounce can contain 16 ounces of carbonated drink. From the sample
data, we can calculate a statistic”.
A statistic is a number that represents a property of the sample. For example, if we consider one math
class to be a sample of the population of all math classes, then the average number of points earned by
students in that one math class at the end of the term is an example of a statistic. The statistic is an estimate
of a population parameter.

Page 2 of 15
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

A parameter is a numerical characteristic of the whole population that can be estimated by a statistic.
Since we considered all math classes to be the population, then the average number of points earned per
student over all the math classes is an example of a parameter. One of the main concerns in the field of
statistics is how accurately a statistic estimates a parameter. The accuracy really depends on how well the
sample represents the population. The sample must contain the characteristics of the population in order to
be a representative sample. We are interested in both the sample statistic and the population parameter in
inferential statistics. In a later chapter, we will use the sample statistic to test the validity of the established
population parameter.
A variable, usually notated by capital letters such as X and Y, is a characteristic or measurement that
can be determined for each member of a population.
Variables may be numerical or categorical.
• Numerical variables take on values with equal units such as weight in pounds and time in hours.
• Categorical variables place the person or thing into a category.

“If we let X equal the number of points earned by one math student at the end of a term, then X is a numerical variable.
If we let Y be a person's party affiliation, then some examples of Y include Republican, Democrat, and Independent. Y is
a categorical variable”.

We could do some math with values of X (calculate the average number of points earned, for
example), but it makes no sense to do math with values of Y (calculating an average party affiliation makes
no sense).
Data are the actual values of the variable. They may be numbers or they may be words. Datum is a
single value. Two words that come up often in statistics are mean and proportion.

“If you were to take three exams in your math classes and obtain scores of 86, 75, and 92, you would calculate
your mean score by adding the three exam scores and dividing by three (your mean score would be 84.3 to one decimal
place). If, in your math class, there are 40 students and 22 are men and 18 are women, then the proportion of men
students is 22/40 and the proportion of women students is 18/40”.

Data is collected every second of every day from a vast array of sources. From the security cameras
that deploy facial recognition technology when people enter a building to the mobile devices that track
shopping, media, and communication habits, images and numbers are continually being collected by

Page 3 of 15
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

government agencies, consumer groups, and other organizations from all around the world. This data
contains information that can help businesses operate more efficiently and reach the right customers.
However, to be of any value, data must be correctly interpreted. Misinterpreted data can lead to flawed
insights that could disrupt an organization’s growth and stability strategies. To ensure that these copious
amounts of information are leveraged effectively, businesses and other groups are hiring data scientists to
help collect, store, and analyze pertinent information. While these professionals come from a variety of
backgrounds, the growing field of data science provides a number of rewarding opportunities specifically for
engineers. The fields overlap in significant ways, which often makes professionals with an engineering
background a good fit for a role working with data.

Why Data Analytics is Gaining HYPE in the 21st Century (https://towardsdatascience.com)

What Is Data Analysis?

Data analysis involves gathering and studying data to form insights that can be used to make
decisions. The information derived can be useful in several different ways, such as for building a business
strategy or ensuring the safety and efficiency of an engineering project. Data collection and analysis is
becoming increasingly important across most every industry. Fields that collect this information include
marketing, sports, entertainment, medicine, communications, government, criminal justice, electronics, and
aerospace. Data can help companies make decisions on issues as diverse as how to engage their target

Page 4 of 15
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

audiences, what purchases to make, and how to organize their staff members. Ultimately, data science is not
just about collecting and analyzing information. It is about being able to predict the future and verify the
results of past decisions.

Data Science and Engineering

Engineering is one industry that has been particularly influenced by the growing need for data
collection and analysis. As big data has begun to play a larger role in industries around the world, engineers
have been called on to play an influential role in the way this information is gathered, stored, and leveraged.
Professionals with an engineering background generally prove to be particularly adept at developing
techniques for analyzing data groups to extract valuable information.
To succeed in a career as a data scientist, an engineer should possess the following qualifications:
Analytics expertise: Experience extrapolating information from large quantities of numbers will help you
succeed in this role. Depending on where you work, knowledge of specific analytic tools will also
likely be required.
Computer knowledge: Gone are the days of crunching numbers on a hand-held calculator — much less with
pen and paper. The vast majority of your day will be spent working on a computer, so knowledge of
coding, unstructured data, and cloud tools will increase your marketability.
Communication skills: It is important to be able to present your findings in a clear and concise way to ensure
that your employer understands the information and can act accordingly.
Strong drive: In data science, you should regularly be looking for ways to improve how information is
collected and processed. Being an intellectually curious self-starter will take you far in this role.

Exercise No. 1
Determine what the key terms refer to in the following study. We want to know the average (mean) amount
of money first year college students spend at ABC College on school supplies that do not include books.
We randomly surveyed 100 first year students at the college. Three of those students spent $150, $200,
and $225, respectively.

Answer:
• The population is all first-year students attending ABC College this term.

Page 5 of 15
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

• The sample could be all students enrolled in one section of a beginning statistics course at ABC
College (although this sample may not represent the entire population).
• The parameter is the average (mean) amount of money spent (excluding books) by first year college
students at ABC College this term.
• The statistic is the average (mean) amount of money spent (excluding books) by first year college
students in the sample.
• The variable could be the amount of money spent (excluding books) by one first year student. Let X =
the amount of money spent (excluding books) by one first year student attending ABC College.
• The data are the dollar amounts spent by the first-year students. Examples of the data are $150, $200,
and $225.

Exercise No. 2
Determine what the key terms refer to in the following study.
A study was conducted at a local college to analyze the average cumulative GPAs of students who graduated
last year. Fill in the letter of the phrase that best describes each of the items below.
1. Population_____ 2. Statistic _____ 3. Parameter _____ 4. Sample _____ 5. Variable _____
6. Data _____
a) all students who attended the college last year
b) the cumulative GPA of one student who graduated from the college last year
c) 3.65, 2.80, 1.50, 3.90
d) a group of students who graduated from the college last year, randomly selected
e) the average cumulative GPA of students who graduated from the college last year
f) all students who graduated from the college last year
g) the average cumulative GPA of students in the study who graduated from the college last year

Answer:
1. f; 2. g; 3. e; 4. d; 5. b; 6. c

Page 6 of 15
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

1.2 Methods of Collecting Data


Two Types of Data
Qualitative data pertain to observation that can be categorized to some characteristics or qualities such as
sex, civil status, occupation, religion and other personal data. While quantitative data pertain to observations
that can be measured such as weight, height, pulse rate, blood pressure heart rate. Quantitative data can be
further subdivided into discrete and continuous data.
• Discrete data are those expressed as a whole number or integer such as a specific numerical test score
(90 out of 100 points).
• Continuous data are those that full of category of being “measured to the nearest” such as height,
weight, and age.

Scales of Measurement
When gathering data by any method, measurements are usually obtained (e.g., height in inches,
weight in pounds, age in years, I.Q. scores, temperature in degrees Celsius, incidence rates, mortality rates,
etc.) Measurements are classified into four scales. In selecting the statistical tool to be used for drawing
inferences on a random sample, the type of measurement scale must be carefully chosen.

1. Nominal Scale
A measurement that classifies elements into two or more categories or classes. The numbers indicate that
the elements are different, but the difference is not according to order or magnitude.
Example: Distribution of Patients in XYZ Hospital According to Religion and Gender

2. Ordinal Scale
A measurement scale that ranks individuals in terms of the degree to which they possess a characteristic.

Page 7 of 15
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Example: Anxiety Level of Mentally- Retarded Patients in Hospital XYZ

Legend:
0= not anxious
1= low anxiety level
2= moderate anxiety level
3= high anxiety level

3. Interval Scale
A measurement scale that, in addition to ordering scores from highest to lowest, establishes a uniform
unit in the scale so that any distance between two consecutive scores is of equal magnitude.
Example:
The aptitude scores from 80 to 90 are of equal difference as that of the aptitude scores from 90 to 100.
There is also no absolute zero in the scale. For example, a place where the temperature reading is 0 degrees
is 0 degree Celsius does not mean that there is no temperature in that place.

4. Ratio Scale
A measurement scale that, in addition to being an interval scale, also has an absolute zero in the scale.
Examples:
Height, weight, area, volume, speed, rate of doing work, amount of money deposited in a bank.

Classification of Data According to Source:


1. Primary Data – are those gathered from primary sources such as individual persons, organized groups,
documents in their original form and living organisms such as animals, fowls and etc.
2. Secondary Data – are those gathered from secondary sources such as books including dictionaries,
encyclopedias, articles published in professional journals, unpublished master’s thesis and all other
second-hand sources.

Page 8 of 15
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Ways of Collecting Data

There is no formula for selecting the best method to be used when gathering data. It depends on the
researcher’s design of the study, the type of data, the time allotment to complete the study, and the
researcher’s financial capacity.
There are several ways in collecting data among which are the following: Clerical tools and Mechanical
Devices
A. Clerical Tools
1. Questionnaire - Defines by Good as a list of planned, well-planned questions written questions related
to a particular topic, with space provided for indicating the response to each question, intended for
submission to a number of persons for reply; commonly used on a normative survey and in the
measurement of attitudes and opinions

Construction of Questionnaires:
1. Doing library search
2. Talking to knowledgeable people
3. Mastering the guidelines
4. Writing the questionnaire
5. Editing the questionnaire
6. Rewriting the questionnaire
7. Pretesting the questionnaire (dry run)

Types of Questions Asked in Survey Questionnaires


The types of questions asked in questionnaires for survey purposes are:
A. According to Form
1. The free response type
2. The guided response type
B. According to the kind of data asked for
a. Descriptive
b. Quantified
c. Intensity of feelings, emotion and attitude
d. Degree of judgment

Page 9 of 15
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

e. Understanding

2. Interview- is one of the major techniques in gathering data or information. It is a purposeful face to
face relationship between two persons.
Types of Interviews
a. Direct Method- The researcher personally interviews the respondents.
b. Indirect Method- The researchers may use a telephone to interview the respondents.

Classes of Interview
a. Standardized- The interviewer is not allowed to change the specific wordings of the questions in the
interview schedule. He must conduct all interviews in the same manner, and he cannot adapt
questions for specific situations or pursue statements
b. Non-standardized- The interviewer has complete freedom to develop each interview in the most
appropriate manner for each situation. He is not held to any specific questions. This is the same as
so-called informal interview.
c. Semi-standardized- The interviewer is required to ask a number of specific major questions, and
beyond these he is free to probe as he chooses. There are prepared principal questions to be asked
and once these are asked and answered the interpreter is free to ask any questions as he sees fit for
the situations.
d. Focused- Also called depth interview. Similar to non-standardized interview, the researcher asks a
series of questions based on his previous understanding and insight of the situation. The interview
is focused on specific topics that are to be investigated in depth.
e. Nondirective- The interviewee or subject is allowed and even encouraged to express his feelings
without the fear of disapproval. The subject can express his feelings and views on certain topics even
without waiting to be questioned or even without pressure from the interviewer.

3. Empirical Observation Method- Means of gathering information for research, may be defined as
perceiving data through the senses: sight, hearing, taste, touch and smell. The sense of sight is the most
important and the most used among the senses.
Types of Observation
a. Participant and non-participant observation
1. Participant- Observer takes active part in the activities of the group being observed.

Page 10 of 15
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

2. Non-participant observation- Observer is a mere by-stander observing the group he is studying


about.
b. Structured and unstructured observation
1. Structured – concentrates on a particular aspect/s of the variable being observed, be it a thing,
behavior, condition, or situation. Usually used in non-participant or controlled observation.
2. Unstructured – observer does not hold any list of the items to be observed. Usually used in
participant or uncontrolled observation.
c. Controlled and uncontrolled observation
1. Controlled – usually utilized in experimental studies in which the experimental as well as the
non-experimental variables are controlled by the researcher.
2. Uncontrolled – usually utilized in natural settings, in observation area no any control placed
upon any variable.
4. Registration, Test, Experiment and Library
B. Mechanical devices
Microscopes, Thermometers, Cameras, etc.

Exercise No. 2
Identify each quantitative variable as discrete or continuous. Write D if discrete or C if continuous
1. The boiling point of water is 100 deg. Cel.
2. Length of hair of female students.
3. Number of foreigners migrating to the Philippines every year
4. Her home telephone number is 2581376.
5. The number of children with missing/decayed teeth in barangay A is 2000.
6. John’s height is 168 cm.
7. The following data are the densities of sample substances taken from Tabing-Ilog River (g/cc):
23.6, 19.8, 15.0, 7.8 and 2.4.
8. Weights in pounds of the Math quiz contestants.
9. The average speed of motorboats cruising in Manila Bay every day is 50m/s.
10. Scores of 16 students in a Statistics Quiz.

Page 11 of 15
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

1.3 Planning and Conducting Surveys:


Introduction to Design of Experiments
Design of Experiments
Do you remember learning about this back in high school or junior high even? What were those steps
again? Decide what phenomenon you wish to investigate. Specify how you can manipulate the factor and
hold all other conditions fixed, to ensure that these extraneous conditions aren't influencing the response
you plan to measure.
Then measure your chosen response variable at several (at least two) settings of the factor under study. If
changing the factor causes the phenomenon to change, then you conclude that there is indeed a cause-and-
effect relationship at work.
How many factors are involved when you do an experiment? Some say two - perhaps this is a comparative
experiment? Perhaps there is a treatment group and a control group? If you have a treatment group and a
control group then, in this case, you probably only have one factor with two levels.

Engineering Experiments
If we had infinite time and resource budgets there probably wouldn't be a big fuss made over designing
experiments. In production and quality control we want to control the error and learn as much as we can
about the process or the underlying theory with the resources at hand. From an engineering perspective we're
trying to use experimentation for the following purposes:
a. reduce time to design/develop new products & processes
b. improve performance of existing processes
c. improve reliability and performance of products
d. achieve product & process robustness
e. perform evaluation of materials, design alternatives, setting component & system tolerances, etc.
We always want to fine-tune or improve the process. In today's global world this drive for
competitiveness affects all of us both as consumers and producers.
Robustness is a concept that enters into statistics at several points. At the analysis, stage robustness refers
to a technique that isn't overly influenced by bad data. Even if there is an outlier or bad data you still want to
get the right answer. Regardless of who or what is involved in the process - it is still going to work.

Page 12 of 15
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Every experiment design has inputs. Back to the cake baking example: we have our ingredients such
as flour, sugar, milk, eggs, etc. Regardless of the quality of these ingredients we still want our cake to come
out successfully. In every experiment there are inputs and in addition, there are factors (such as time of baking,
temperature, geometry of the cake pan, etc.), some of which you can control and others that you can't control.
The experimenter must think about factors that affect the outcome. We also talk about the output and the
yield or the response to your experiment. For the cake, the output might be measured as texture, flavor,
height, size, or flavor.

The Basic Principles of DOE

Randomization
This is an essential component of any experiment that is going to have validity. If you are doing a comparative
experiment where you have two treatments, a treatment and a control, for instance, you need to include in
your experimental process the assignment of those treatments by some random process. An experiment
includes experimental units. You need to have a deliberate process to eliminate potential biases from the
conclusions, and random assignment is a critical step.

Replication
Replication is some in sense the heart of all of statistics. To make this point... Remember what the standard
error of the mean is? It is the square root of the estimate of the variance of the sample mean, i.e.,.The width
of the confidence interval is determined by this statistic. Our estimates of the mean become less variable as

Page 13 of 15
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

the sample size increases. Replication is the basic issue behind every method we will use in order to get a
handle on how precise our estimates are at the end. We always want to estimate or control the uncertainty in
our results. We achieve this estimate through replication. Another way we can achieve short confidence
intervals is by reducing the error variance itself. However, when that isn't possible, we can reduce the error in
our estimate of the mean by increasing n.

Another way is to reduce the size or the length of the confidence interval is to reduce the error variance -
which brings us to blocking.

Blocking
Blocking is a technique to include other factors in our experiment which contribute to undesirable variation.
Much of the focus in this class will be to creatively use various blocking techniques to control sources of
variation that will reduce error variance. For example, in human studies, the gender of the subjects is often
an important factor. Age is another factor affecting the response. Age and gender are often considered
nuisance factors which contribute to variability and make it difficult to assess systematic effects of a treatment.
By using these as blocking factors, you can avoid biases that might occur due to differences between the
allocation of subjects to the treatments, and as a way of accounting for some noise in the experiment. We
want the unknown error variance at the end of the experiment to be as small as possible. Our goal is usually
to find out something about a treatment factor (or a factor of primary interest), but in addition to this, we
want to include any blocking factors that will explain variation.

Steps for Planning, Conducting and Analyzing an Experiment


The practical steps needed for planning and conducting an experiment include: recognizing the goal of the
experiment, choice of factors, choice of response, choice of the design, analysis and then drawing conclusions.
This pretty much covers the steps involved in the scientific method.
a. Recognition and statement of the problem
b. Choice of factors, levels, and ranges
c. Selection of the response variable(s)
d. Choice of design
e. Conducting the experiment
f. Statistical analysis
g. Drawing conclusions, and making recommendations

Page 14 of 15
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Factors
We usually talk about "treatment" factors, which are the factors of primary interest to you. In addition to
treatment factors, there are nuisance factors which are not your primary focus, but you have to deal with
them. Sometimes these are called blocking factors, mainly because we will try to block on these factors to
prevent them from influencing the results.

There are other ways that we can categorize factors: Experimental vs. Classification Factors

Experimental Factors
These are factors that you can specify (and set the levels) and then assign at random as the treatment to the
experimental units. Examples would be temperature, level of an additive fertilizer amount per acre, etc.

Classification Factors
These can't be changed or assigned, these come as labels on the experimental units. The age and sex of the
participants are classification factors which can't be changed or randomly assigned. But you can select
individuals from these groups randomly.

Quantitative vs. Qualitative Factors

Quantitative Factors
You can assign any specified level of a quantitative factor. Examples: percent or pH level of a chemical.

Qualitative Factors
These factors have categories which are different types. Examples might be species of a plant or animal, a
brand in the marketing field, gender, - these are not ordered or continuous but are arranged perhaps in sets.

References:
Barbara Illowsky and Susan Dean, 2018, Introductory to Statistics
Calderon, J.F., and Gonzales, E.C., (2016) Methods of Research and Thesis Writing
De Belen, R., and Feliciano, P., (2015) 1st Edition Basic Statistics for Research
Pareño, E., and Jimenez, R., (2006) Basic Statistics: A Worktext
https://online.stat.psu.edu/stat503/book/export/html/632

Page 15 of 15
Probability
Engr. Sheila Jane Margaret C. Peñ a
Instructor
PROBABILITY
◉ The strand of mathematics looking at
the chance of events occurring.
◉ The chance that a given event will

occur.

2
EXPERIMENT
◉ Is the process by which an observation (or measurement) is
obtained.
Example: Recording a test grade, Measuring daily rainfall,
Flipping a coin and observing the face that appears. Possible
outcomes: Head, Tail, Tossing a die: Possible Outcomes: 1, 2, 3,
4, 5, 6
Each experiment may result in an outcome, which is
called an Event and is denoted by capital letter.
3
SAMPLE SPACE
◉ A set in which all of the possible
outcomes of a statistical experiment are
represented as points. It is also
represented by the symbol S.

4
Example
◉ The sample space when a coin is flipped is S = {H,T}
◉ The sample space of tossing a dice is S = {1,2,3,4,5,6}

 Each outcome in a sample space is called an


element or a member of the sample space.

5
ELEMENTS
◉ A member of an object in a set
◉ An item or term contained within a

set of items or terms.

6
EVENT
◉ Is an occurrence or the possibility of an
occurrence that is being investigated.
◉ It is a set of outcomes from a given experiment.
◉ An Event is a subset of a sample space.

7
Example
◉ If A is the event that an odd number comes out in a single toss
of a dice, then A = {1,3,5} is a subset of the sample space
S = {1,2,3,4,5,6}

or if A is the event that at least one head comes out in flipping


a coin twice, then A = {HH, HT, TH} is a subset of the sample
space S = {HH, HT, TH, TT}

8
SUBSET
◉ A subset of a given set is a collection of
things that belong to the original set.
◉ A set whose members are part of a bigger
set.

9
COMPLEMENT
◉ The complement of an event A with respect
to the sample space S is the set of all
elements of S that are not in event A. We
denote the complement of A by the symbol
A'.

10
Example
◉ Let R be the event that a red card is selected from an ordinary
deck of 52 playing cards, and let S be the entire deck. Then R’
is the event that the card selected from the deck is not a red
card but a black card.
◉ Consider the sample space S = {book, cell phone, mp3, paper,
stationery, laptop}.
Let A = {book, stationery, laptop, paper}.
Then, the complement of A, A’ = {cell phone, mp3}.

11
INTERSECTION
◉ The intersection of two events A and B,
denoted by the symbol AB is the event
containing all elements that are common to
A and B.

12
Example
◉ Let E be the event that a person selected at random in a
classroom is majoring in engineering, and let F be the event
that the person is female. Then E ∩F is the event of all female
engineering students in the classroom.

◉ Let V = {a,e,i,o,u} and C = {l,r,s,t}; then it follows that V ∩C = φ.


That is, V and C have no elements in common and, therefore,
cannot both simultaneously occur.

13
14
Two events A and B are mutually
exclusive events if AB = , that is A and
B have no elements in common.
Ex. In the die-tossing experiment, if
A = {1,2,3} and B = {4,5,6}, then A ∩ 𝐵 = ∅

15
The union of the two events A and B,
denoted by the symbol A∪B, is the event
containing all the elements that belong to
A or B or both.

Ex. Let A = {a,b,c} and


B = {b,c,d,e}; then
A∪B = {a,b,c,d,e}.
16
A∪C = regions 1, 2, 3, 4,
5, and 7 ,

B’∩A = regions 4 and 7,

A∩B∩C = region 1,
(A∪B)∩C’ = regions 2,
6, and 7

17
EXERCISES:
1. List the elements of each of the sample spaces
(a) the set of integers between 1 and 50 divisible by 8;
(b) the set S = {x | x2 +4x−5=0 };
(c) the set of outcomes when a coin is tossed until a tail
or three heads appear;
(d) the set S = {x | x is a continent};
(e) the set S = {x | 2x−4 ≥ 0 and x<1}.

18
19
20
21
22
23
RESULTS:

24
2. An experiment consists of choosing a number
from 0 to 9 at random. Let A be the event of
choosing an even number and B be the event of
choosing an odd number. Let C be the event of
choosing the number 2, 3, 4 or 5 and D be the
event of choosing 1, 6, or 7. List the elements of
the sets corresponding to the following:
Sample space S, A, B, C, D, AC, C', AB ,
(SC)', ABD'
25
Solution:

S = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
A = even numbers {0, 2, 4, 6, 8}
B = odd numbers {1, 3, 5, 7, 9}
C = {2, 3, 4, 5}
D = {1, 6, 4, 7}

AC (SC)'
{0, 2, 3, 4, 5, 6, 7, 8} {0, 1, 6, 7, 8, 9}

C’ ABD'
{0, 1, 6, 7, 8 ,9} {ø}

AB
{ø}

26
Counting Rules Useful in Probability

◉ The Fundamental Counting Principle


(also called the counting rule) is a way to
figure out the number of outcomes in a
probability problem. Basically, you
multiply the events together to get the
total number of outcomes.
27
Why is counting important in
probability?
◉ To decide “how likely” an event is, we need
to count the number of times an event could
occur and compare it to the total number of
positive events.

28
Counting Sample Points - The fundamental principle of counting,
often referred to as the multiplication rule
◉ Rule 1. If an operation can be performed in n1 ways, and if for each
of these ways a second operation can be performed in n2 ways, then
the two operations can be performed together in 𝑛1 𝑛2 ways.
Ex1. How many sample points are there in the sample space
when a pair of dice is thrown once?
The first die can land face-up in any one of n1 = 6 ways. For
each of these 6 ways, the second die can also land face-up in n2 = 6
ways. Therefore, the pair of dice can land in n1n2 = (6)(6) = 36 possible
ways.

29
Ex.2. A developer of a new subdivision offers
prospective home buyers a choice of Tudor, rustic,
colonial, and traditional exterior styling in ranch, two-
story, and split-level floor plans. In how many different
ways can a buyer order one of these homes?

Ex. 3. If a 22-member club needs to elect a chair and a


treasurer, how many different ways can these two to be
elected?

30
Ex.2.

31
Example 3.

32
 Rule 2. If an operation can be performed in 𝑛1 ways, and if
for each of these a second operation can be performed in
𝑛2 ways, and for each of the first two a third operation can
be performed in 𝑛3 ways, and so forth, then the sequence
of k operations can be performed in 𝑛1 𝑛2 … 𝑛𝑘 ways.
Ex1. Sam is going to assemble a computer by himself.
He has the choice of chips from two brands, a hard drive from
four, memory from three, and an accessory bundle from five
local stores. How many different ways can Sam order the
parts?
33
Solution:

Since n1 =2, n2 =4, n3=3, and n4 =5


there are

n1n2n3n4n5 = (2)(4)(3)(5)
= 120 different ways to order parts.

34
PERMUTATION
◉ is an arrangement of all or part of a set of objects.

Ex. Consider the three letters a, b, and c. The


possible permutations are abc, acb, bac, bca, cab,
and cba.
n1n2n3 = (3)(2)(1) = 6 permutations

35
Theorems:
◉ For any non-negative integer n, n!, called “n
factorial,” is defined as n!=n(n−1)···(2)(1), with
special case 0! = 1.

Theorem 1: The number of permutations of n


objects is n!.

36
 Now consider the number of permutations that are
possible by taking two letters at a time from abcd.
These would be ab, ac, ad, ba, bc, bd, ca, cb, cd, da,
db, and dc.
In general, n distinct objects taken r at a time can be
arranged in n(n−1)(n−2)···(n−r + 1) ways. We represent
this product by the symbol
Theorem 2. The number of permutations of n distinct
𝑛!
objects taken r at a time is n𝑃𝑟 = 𝑛−𝑟 !
37
Example
1. In one year, three awards (research, teaching, and service)
will be given to a class of 25 graduate students in a statistics
department. If each student can receive at most one award,
how many possible selections are there?

38
Theorem 3. The number of permutations of n
objects arranged in a circle is (n−1)!.

Theorem 4. The number of distinct permutations of


n things of which 𝑛1 are of one kind, 𝑛2 of a second
kind, ..., 𝑛𝑘 of a kth kind is
𝑛!
𝑛1 ! 𝑛2 ! … 𝑛𝑘 !
39
Example
◉ In a college football training session, the defensive
coordinator needs to have 10 players standing in a row.
Among these 10 players, there are 1 freshman, 2
sophomores, 4 juniors, and 3 seniors. How many different
ways can they be arranged in a row if only their class level
will be distinguished?
Directly using Theorem 4, we find that the total number
of arrangements is
10!
= 12,600
1! 2! 4! 3!
40
 Often we are concerned with the number of ways of partitioning a
set of n objects into r subsets called cells. A partition has been
achieved if the intersection of every possible pair of the r subsets is
the empty set φ and if the union of all subsets gives the original set.
The order of the elements within a cell is of no importance. Consider
the set {a, e, i, o, u}. The possible partitions into two cells in which
the first cell contains 4 elements and the second cell 1 element
are{(a,e,i,o),(u)},{(a,i,o,u),(e)},{(e,i,o,u),(a)},{(a,e,o,u),(i)},{(a,e,i,u),(o)}.

We see that there are 5 ways to partition a set of 4 elements into two
subsets, or cells, containing 4 elements in the first cell and 1 element in
the second.
41
◉ Theorem 5. The number of ways of partitioning a set of n
objects into r cells with n1 elements in the first cell, n2
elements in the second, and so forth, is

𝑛 𝑛!
𝑛1 , 𝑛2 , … , 𝑛𝑟 = 𝑛1 ! 𝑛2 ! … 𝑛𝑟 !

Ex. In how many ways can 7 graduate students be assigned to 1


triple and 2 double hotel rooms during a conference?

42
43
In many problems, we are interested in the number of ways of
selecting r objects from n without regard to order. These selections
are called combinations. A combination is actually a partition with
two cells, the one cell containing the r objects selected and the other
cell containing the (n−r) objects that are left. The number of such
combinations, denoted by
𝑛 𝑛 𝑛!
𝑟, 𝑛 − 𝑟 𝑜𝑟 𝑟 = 𝑟! (𝑛 − 𝑟)!

44
Example
◉ A young boy asks his mother to get 5 picture card from his collections of
10 flower picture cards and 5 sports picture cards. How many are there
that his mother can get 3 flower and 2 sports picture cards?

45
Exercises:
1. If an experiment consists of throwing a die and then drawing
a letter at random from the English alphabet, how many points
are there in the sample space?
2. How many ways are there to select 3 candidates from 8
equally qualified recent graduates for openings in an
accounting firm?

46
EXERCISE 1.

47
EXERCISE 2.

Answer:
Step-by-step explanation:
First, we provide the given facts
3 candidates
8 possible candidates
This equation will make use of
the permutation formula. This is the
formula for the number of possible
combinations of r objects from a set of n
objects, regardless of order.
In this case,
r=3
n=8

Therefore, the accounting firm can have 336 possible combinations to get 3
candidates from an 8-man pool of qualified recent graduates.

48
49
Probability of an Event, P[Event]
◉ Postulates of Probability:
0  P[Event]  1
P [impossible event] = 0
P[sure event] = 1
The sum of the probabilities for all simple events in S
is equal to 1.
50
The probability of an event A is the sum of the weights of all
sample points in A. Therefore,

0 ≤ P (A ) ≤ 1, P (φ) = 0, and P (S ) = 1.

Furthermore, if A1, A2, A3, . . . is a sequence of mutually


exclusive events, then

51
Example 1: A coin is tossed twice. What is the probability that at least
1 head occurs?
Answer: S = {HH, HT, TH, TT}
P (A) = 3/4

Example 2: A die is loaded in such a way that an even number is twice


as likely to occur as an odd number. If E is the event that a number less
than 4 occurs on a single toss of the die, find P(E).
Answer: S = {1, 2, 3, 4, 5, 6} Let w = probability that an odd
comes out then, 2w = probability that an even comes out
Let E be the event that the numbers 1,2,3 come out or E = {1,2,3}
𝒘+𝟐𝒘+𝒘 𝟒
Then, the P(E)= =
𝒘+𝟐𝒘+𝒘+𝟐𝒘+𝒘+𝟐𝒘 𝟗

52
◉ Rule 3.
If an experiment can result in any one of N
different equally likely outcomes, and if exactly n
of these outcomes correspond to event A, then
the probability of event A is
𝑛
𝑃 𝐴 = 𝑁

53
Example
◉ In a poker hand consisting of 5 cards, find the probability of holding 2 aces and 3
jacks.
Solution : The number of ways of being dealt 2 aces from 5 cards is
5 5!
= = 10
2 2! 3!
The number of ways of being dealt 3 jacks from 5 cards is
5 5!
= 3!2! = 10
3
By the multiplication rule (Rule 2.1), there are n = (10)(10) = 100 hands with 2 aces and 3
jacks. The total number of 5-card poker hands, all of which are equally likely, is
52 52! 10 (10)
= = 2,598,960 therefore P(E)= = 3.85𝑋10−5
5 5!47! 2,598,960

54
Additive Rules
Theorem 6. If A and B are two events, then
P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
Corollary 1: If A and B are mutually exclusive, then
P(A ∪ B) = P(A) + P(B).
Corollary 2: If A1, A2,…, An are mutually exclusive, then
P(A1 ∪ A2 ∪ ···∪ An) = P(A1) + P(A2) + ···+ P(An).
Corollary 3: If A1, A2,…, An is a partition of sample space S,
then
P(A1UA2U…U An ) = P(A1) + P(A2) + ···+ P(An) = P(S) = 1.

55
Theorem 7.
For three events A, B, and C,
What is P(A ∪ B ∪ C) equal to?
= P(A) + P(B) + P(C)− P(A ∩ B) −
P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C).

56
Example

57
Theorem 8.
If A and A’ are complementary
events, then
P(A) + P(A’) = 1.

58
Examples
1. If the probabilities that an automobile mechanic
will service 3, 4, 5, 6, 7, or 8 or more cars on any given
workday are, respectively, 0.12, 0.19, 0.28, 0.24, 0.10,
and 0.07, what is the probability that he will service
at least 5 cars on his next day at work?

59
60
61
62
63
Exercises:
1. In a high school graduating class of 100 students, 54 studied
mathematics, 69 studied history, and 35 studied both
mathematics and history. If one of these students is selected
at random, find the probability that
(a) the student took mathematics or history;
(b) the student did not take either of these subjects;
(c) the student took history but not mathematics.

64
or 0.88

65
or 0.12

or 0.34

66
2. In a poker hand consisting of 5 cards,
find the probability of holding
(a) 3 aces;
(b) 4 hearts and 1 club.

67
68
69
Thank you for
LISTENING!
Any questions?
(+63)9178295308 @Margaret Peña scpena@carsu.edu.ph

70
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Chapter 3: PROBABILITY DISTRIBUTION:


DISCRETE, CONTINUOUS AND JOINT PROBABILITY DISTRIBUTION

In this Lesson, we take the next step toward inference. In Lesson 2, we introduced events and probability
properties. In this Lesson, we will learn how to numerically quantify the outcomes into a random variable. Then
we will use the random variable to create mathematical functions to find probabilities of the random variable.
One of the most important discrete random variables is the binomial distribution and the most important
continuous random variable is the normal distribution. They will both be discussed in this lesson. We will also
talk about how to compute the probabilities for these two variables.

Learning Objectives

Upon successful completion of this lesson, you should be able to:


1. Distinguish between discrete and continuous random variables.
2. Compute probabilities, cumulative probabilities, means and variances for discrete random variables.
3. Identify binomial random variables and their characteristics.
4. Calculate probabilities of binomial random variables.
5. Describe the properties of the normal distribution.
6. Find probabilities and percentiles of any normal distribution.
7. Apply the Empirical rule.

3.1 Introduction: What is a random variable?


Let's use a scenario to introduce the idea of a random variable. Suppose we flip a fair coin three times
and record if it shows a head or a tail. The outcome or sample space is S= {HHH, HHT, HTH, THH, TTT, TTH, THT,
HTT}. There are eight possible outcomes and each of the outcomes is equally likely. Now, suppose we flipped a
fair coin four times. How many possible outcomes are there? There are 2⁴=16. How about ten
times? 1024 possible outcomes! Instead of considering all the possible outcomes, we can consider assigning the
variable X, say, to be the number of heads in n flips of a fair coin. If we flipped the coin n=3 times (as above),
then X can take on possible values of 0,1,2, or 3. By defining the variable, X, as we have, we created a random
variable.

Random Variable
A random variable is a variable that takes on different values determined by chance. In other words, it
is a numerical quantity that varies at random.

Types of Random Variables

There are mainly two types of random variables:


• Discrete Random Variable- When the random variable can assume only a countable, sometimes infinite,
number of values.
• Continuous Random Variable- When the random variable can assume an uncountable number of values in
a line interval.

Probability Functions

Transforming the outcomes to a random variable allows us to quantify the outcomes and determine certain
characteristics. If we have a random variable, we can find it’s probability function.

1|Pa ge
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Probability Function
A probability function is a mathematical function that provides probabilities for the possible outcomes of the
random variable, X. It is typically denoted as f(x).

There are two classes of probability functions: Probability Mass Functions and Probability Density Functions.

The probability of a random variable being less than or equal to a given value is calculated using another
probability function called the cumulative distribution function.

2|Pa ge
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

3.2 DISCRETE PROBABILITY DISTRIBUTIONS

3|Pa ge
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

4|Pa ge
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Expected Value and Variance of a Discrete Random Variable

5|Pa ge
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

6|Pa ge
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

a. What is the probability a randomly selected inmate has exactly 2 priors?

b. What is the probability a randomly selected inmate has < 2 priors?

c. What is the probability a randomly selected inmate has 2 or fewer priors?

d. What is the probability a randomly selected inmate has < 2 priors?

e. Finally, which of a, b, c and d above are complements?


c and d are complements

Binomial Random Variables

A binary variable is a variable that has two possible outcomes. For example, sex (male/female) or having
a tattoo (yes/no) are both examples of a binary categorical variable.
A random variable can be transformed into a binary variable by defining a “success” and a “failure”. For
example, consider rolling a fair six-sided die and recording the value of the face. The random variable, value of
the face, is not binary. If we are interested, however, in the event A= {3 is rolled}, then the “success” is rolling a
three. The failure would be any value not equal to three. Therefore, we can create a new variable with two
outcomes, namely A = {3} and B = {not a three} or {1, 2, 4, 5, 6}. This new variable is now a binary variable.

7|Pa ge
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

The Binomial Distribution


The binomial distribution is a special discrete distribution where there are two distinct
complementary outcomes, a “success” and a “failure”.
We have a binomial experiment if ALL of the following four conditions are satisfied:
• The experiment consists of n identical trials (n is fixed).
• Each trial results in one of the two outcomes, called success and failure.
• The probability of success, denoted p, remains the same from trial to trial.
• The n trials are independent. That is, the outcome of any trial does not affect the outcome of the others.

8|Pa ge
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

9|Pa ge
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

3.3 CONTINUOUS PROBABILITY DISTRIBUTIONS


In the beginning of the course, we looked at the difference between discrete and continuous data. The
last section explored working with discrete data, specifically, the distributions of discrete data. In this lesson
we're again looking at the distributions but now in terms of continuous data. Examples of continuous data
include...
• the amount of rainfall in inches in a year for a city.
• the weight of a newborn baby.
• the height of a randomly selected student.

10 | P a g e
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

The Normal Distribution

The Standard Normal Distribution

11 | P a g e
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

12 | P a g e
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

13 | P a g e
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Probabilities for Normal Random Variables (Z-scores)

The standard normal is important because we can use it to find probabilities for a normal random
variable with any mean and any standard deviation.
But first, we need to explain Z-scores.

14 | P a g e
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

15 | P a g e
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

16 | P a g e
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

3.4 JOINT PROBABILITY: DEFINITION, FORMULA, AND EXAMPLE


The term joint probability refers to a statistical measure that calculates the likelihood of two events
occurring together and at the same point in time. Put simply, a joint probability is the probability of event Y
occurring at the same time that event X occurs. In order for joint probability to work, both events must be
independent of one another, which means they aren't conditional or don't rely on each other. Joint probabilities
can be visualized using Venn diagrams.

Formula and Calculation of Joint Probability

Notation for joint probability can take a few different forms. The following formula represents the probability
of events intersection:

Although joint probability can help you determine the likelihood of two different events happening at the
same time, it does not indicate how the two events may influence each other.

What Does Joint Probability Tell You?

Probability is a field closely related to statistics that deals with the likelihood of an event or phenomenon
occurring. It is quantified as a number between 0 and 1, where 0 indicates an impossible chance of occurrence
and 1 denotes the certain outcome of an event.

For example, the probability of drawing a red card from a deck of cards is 1/2 = 0.5. This means there is an
equal chance of drawing a red and black card since there are 26 of each in a deck. As such, there is a 50-50
probability of drawing a red card versus a black card.

Joint probability measures two events that happen at the same time. It can only be applied to situations where
more than one observation can occur at the same time. So the joint probability of picking a card that is both
red and 6 from a deck is P(6 ∩ red) = 2/52 = 1/26 since a deck of cards has two red sixes—the six of hearts and
the six of diamonds. Because the events red and 6 are independent, you can also use the following formula to
calculate the joint probability:

The symbol “∩” in a joint probability is referred to as an intersection. The probability of event X and event Y
happening is the same thing as the point where X and Y intersect. Therefore, the joint probability is also called
the intersection of two or more events. A Venn diagram is perhaps the best visual tool to explain an
intersection:

17 | P a g e
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

From the Venn above, the point where both circles overlap is the intersection, which has two observations: the
six of hearts and the six of diamonds.

Joint Probability vs. Conditional Probability

Joint probability should not be confused with conditional probability, which is the probability that one event
will happen given that another action or event happens. The conditional probability formula is as follows:

This is to say that the chance of one event happening is conditional on another event happening. For example,
from a deck of cards, the probability that you get a six, given that you drew a red card is P(6│red) = 2/26 = 1/13,
since there are two sixes out of 26 red cards.

Joint probability only factors in the likelihood of both events occurring. Conditional probability can be used to
calculate joint probability, as seen in this formula:

The probability that A and B occurs is the probability of X occurring, given that Y occurs multiplied by the
probability that Y occurs. Given this formula, the probability of drawing a 6 and a red at the same time will be as
follows:

Example of Joint Probability

Let's highlight another example to show how joint probability works. This example uses dice and we want to find
out what the probability is that you'll roll a four on each die when you roll them. Remember, there are six sides
to each one.

In order to determine the joint probability, we first need to determine the probability of each roll:

The chance of rolling a three on the first die is 1/6


The chance of rolling a three on the second die is 1/6
Now we can use the joint probability formula noted above to figure out what the joint probability is for this event
by multiplying each individual event together.

1/6 x 1/6 = 1/36; This means that there is a 1/36 chance of rolling two fours using a pair of dice.

References:

https://www.investopedia.com/terms/j/jointprobability.asp
https://online.stat.psu.edu/stat500/lesson/3
https://www.knime.com/blog/continuous-probability-distribution
Walpole, E.R. et. al. (2011). Probability & Statistics for Engineers & Scientists. NINTH EDITION

18 | P a g e
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Chapter 4: Sampling Distributions and Confidence Intervals


In inferential statistics, we want to use characteristics of the sample (i.e. a statistic) to estimate the
characteristics of the population (i.e. a parameter). In Lesson 3, we learned how to define events as random variables.
By doing so, we can understand events mathematically by using probability functions, means, and standard deviations.
All of this is important because it helps us reach our goal to be able to make inferences about the population based
on the sample. But we need more. If we obtain a random sample and calculate a sample statistic from that sample,
the sample statistic is a random variable (wow!). The population parameters, however, are fixed. If the statistic is a
random variable, can we find the distribution? The mean? The standard deviation? The answer is yes! This is why we
need to study the sampling distribution of statistics. So, what is a sampling distribution?

4.1 Sampling Distribution


The sampling distribution of a statistic is a probability distribution based on a large number of samples of
size n from a given population. Consider this example. A large tank of fish from a hatchery is being delivered to the
lake. We want to know the average length of the fish in the tank. Instead of measuring all of the fish, we randomly
sample twenty fish and use the sample mean to estimate the population mean. Denote the sample mean of the twenty
fish as 𝑥1 . Suppose we take a separate sample of size twenty from the same hatchery. Denote that sample mean as 𝑥2 .
Would 𝑥1 equal 𝑥2 ? Not necessarily. What if we took another sample and found the mean? Consider now taking 1000
random samples of size twenty and recording all of the sample means. We could take the 1000 sample means and
create a histogram. This would give us a picture of what the distribution of the sample means looks like. The
distribution of all of these sample means is the sampling distribution of the sample mean. We can find the sampling
distribution of any sample statistic that would estimate a certain population parameter of interest. In this Lesson, we
^
will focus on the sampling distributions for the sample mean, 𝑥, and the sample proportion, 𝑝. We begin by describing
the sampling distribution of the sample mean and then applying the central limit theorem. Last, we will discuss the
sampling distribution of the sample proportion.

Objectives

Upon successful completion of this lesson, you should be able to:


• Understand the meaning of sampling distribution.
• Apply the central limit theorem to calculate approximate probabilities for sample means and sample
proportions.
• Describe the sampling distribution of the sample mean and proportion.
• Identify situations in which the normal distribution and t-distribution may be used to approximate a sampling
distribution.

4.1 Sample Means with a Small Population: Pumpkin Weights


In this example, the population is the weight of six pumpkins (in pounds) displayed in a carnival "guess the
weight" game booth. You are asked to guess the average weight of the six pumpkins by taking a random sample
without replacement from the population.

Page 1 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Since we know the weights from the population, we can find the population mean.

19 + 14 + 15 + 9 + 10 + 17
𝜇= = 14
6

To demonstrate the sampling distribution, let’s start with obtaining all of the possible samples of size 𝑛 = 2
from the populations, sampling without replacement. The table below shows all the possible samples, the weights for
the chosen pumpkins, the sample mean and the probability of obtaining each sample. Since we are drawing at
random, each sample will have the same probability of being chosen.

We can combine all of the values and create a table of the possible values and their respective probabilities.

The table is the probability table for the sample mean and it is the sampling distribution of the sample mean weights
of the pumpkins when the sample size is 2. It is also worth noting that the sum of all the probabilities equals 1. It might
be helpful to graph these values.

Page 2 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

One can see that the chance that the sample mean is exactly the population mean is only 1 in 15, very small.
(In some other examples, it may happen that the sample mean can never be the same value as the population mean.)
When using the sample mean to estimate the population mean, some possible error will be involved since the sample
mean is random.
Now that we have the sampling distribution of the sample mean, we can calculate the mean of all the sample
means. In other words, we can find the mean (or expected value) of all the possible 𝑥’s.
The mean of the sample means is:

Even though each sample may give you an answer involving some error, the expected value is right at the target:
exactly the population mean. In other words, if one does the experiment over and over again, the overall average of
the sample mean is exactly the population mean.

Now, let's do the same thing as above but with sample size 𝑛 = 5

The sampling distribution is:

The following dot plots show the distribution of the sample means corresponding to sample sizes of 𝑛 = 2
and of 𝑛 = 5.

Page 3 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Sampling Error and Size

Sampling Error
The error resulting from using a sample characteristic to estimate a population characteristic. Sample size and sampling
error: As the dot plots above show, the possible sample means cluster more closely around the population mean as
the sample size increases. Thus, the possible sampling error decreases as sample size increases. What happens when
the population is not small, as in the pumpkin example?

Sample Means with Large Samples

An instructor of an introduction to statistics course has 200 students. The scores out of 100 points are shown in the
histogram.

The population mean is 𝜇 = 71.18 and the population standard deviation is 𝜎 = 10.73

Population is Not Normal

Central Limit Theorem

What happens when the sample comes from a population that is not normally distributed? This is where the Central
Limit Theorem comes in.

The Central Limit Theorem applies to a sample mean from any distribution. We could have a left-skewed or a
right-skewed distribution. As long as the sample size is large, the distribution of the sample means will follow an
approximate Normal distribution. For the purposes of this course, a sample size of 𝑛 > 30 is considered a large sample.

Page 4 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Page 5 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

4.2 Sampling Distribution of the Sample Proportion


Before we begin, let’s make sure we review the terms and notation associated with proportions:

The following example will illustrate how to find the sampling distribution for an example where the population is
small.

Sample Proportions with a Small Population: Favorite Color Section

In a particular family, there are five children. Their names are Alex (A), Betina (B), Carly (C), Debbie (D), and
Edward (E). The table below shows the child’s name and their favorite color.

We are interested in the proportion of children in the family who prefer the color blue, and from the table, we can see
that 𝑝 = .40 of the children prefer blue.

Similar to the pumpkin example earlier in the lesson, let's say we didn't know the proportion of children who like blue
as their favorite color. We'll use resampling methods to estimate the proportion. Let’s take 𝑛 = 2 repeated samples,

Page 6 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

taken without replacement. Here are all the possible samples of size 𝑛 = 2 and their respective probabilities of the
proportion of children who like blue.

The graph of the PMF:

Sampling Distribution of P(Blue)

Page 7 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Normal Approximation to the Binomial

For the sampling distribution of the sample mean, we learned how to apply the Central Limit Theorem when the
underlying distribution is not normal. In this section, we will present how we can apply the Central Limit Theorem to
find the sampling distribution of the sample proportion. Let’s start by defining a Bernoulli random variable, 𝑌.

Page 8 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Sampling Distribution of the Sample Proportion

Page 9 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Try it!

If a random sample of size of seventy-five was surveyed, what is the probability we would find more than 50% of
Americans with an iPhone?

Page 10 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

4.2 Confidence Intervals


So far, we learned how to collect and summarize data. Then we learned how to quantify the likelihood of
events using probability. Next, we learned how to model these events as random variables. In the previous Lesson, we
learned how to find the sampling distributions of sample statistics.

In the previous lesson, the sampling distributions for the sample statistics assumed we knew the population
parameters (fantasy land). In real life, we do not know these parameters (or we would not need statistics!). In this
lesson, we switch from "fantasy land" to real life. We know what to do when the parameters are known, let's see how
we can use that information when they are unknown.

Objectives

Upon successful completion of this lesson, you should be able to:


• Describe the role of statistical inference in estimation in terms of the population and sample.
• Explain the general form of a confidence interval and apply it to different statistics and conditions.
• Construct a confidence interval to estimate a population mean or proportion.
• Given a confidence interval, interpret the meaning in terms of the population.
• Identify when to use the t-distribution as opposed to the normal distribution given the sample size and population
distribution.
• Define and interpret the margin of error.
• Given the population standard deviation and a confidence level, calculate the required sample size needed to
obtain the desired margin of error.

Introduction to Inferences

The real power of statistics comes from applying the concepts of probability to situations where you have
data but not necessarily the whole population. The results, called statistical inference, give you probability statements
about the population of interest based on that set of data.

Types of Statistical Inference


There are two types of statistical inferences: Estimation and Statistical Tests.

1. Estimation
Use information from the sample to estimate (or predict) the parameter of interest.

For instance, using the result of a poll about the president's current approval rating to estimate (or predict) his or
her true current approval rating nationwide.

2. Statistical Tests
Use information from the sample to determine whether a certain statement about the parameter of interest is
true. Statistical tests are also referred to as hypothesis tests.

For instance, suppose a news station claims that the President’s current approval rating is more than 75%. We
want to determine whether that statement is supported by the poll data.

Page 11 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Estimation and Confidence Intervals

Estimation: Two common estimation methods are point and interval estimates.

Point Estimates
An estimate for a parameter that is one numerical value. An example of a point estimate is the sample mean
or the sample proportion.

Interval Estimates
Interval estimates give an interval as the estimate for a parameter. This is a new concept which is the focus
of this lesson. Such intervals are built around point estimates which is why understanding point estimates is important
to understanding interval estimates. In this course, the interval estimates we find are referred to as confidence
intervals.

Confidence Interval
An interval of values computed from sample data that is likely to cover the true parameter of interest. There
are many estimators for population parameters. For example, if we want to know the "center" of a distribution, why
use the mean? Could we use the median? How about using the middle value, i.e. (max+min)/2? We choose particular
estimators for various reasons with information based on their sampling distributions. Here are some properties of
"good" estimators.

Page 12 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

General Format of a Confidence Interval

In putting the two properties above together, the center of our interval should be the point estimate for the
parameter of interest. With the estimated standard error of the point estimate, we can include a measure of
confidence to our estimate by forming a margin of error.

This you may have readily seen whenever you have heard or read a sample survey result (e.g. a survey of the
current approval rating of the President, or attitude citizens have on some new policy). In such surveys, you may hear
reference to the "44% of those surveyed approved of the President's reaction" (this is the sample proportion), and
"the survey had a 3.5% margin or error, or ± 3.5%." This latter number is the margin of error.

With the point estimate and the margin of error, we have an interval for which the group conducting the
survey is confident the parameter value falls (i.e. the proportion of U.S. citizens who approve of the President's
reaction). In this example, that interval would be from 40.5% to 47.5%.

This example provides the general construction of a confidence interval:

Interpretation of a Confidence Interval

The interpretation of a confidence interval has the basic template of: "We are 'some level of percent
confident' that the 'population of interest' is from 'lower bound to upper bound'. The phrases in single quotes are
replaced with the specific language of the problem. We will discuss more about the interpretation of a confidence
interval after we provide a few more examples.

Page 13 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Point Estimate for the Population Proportion

Constructing a Confidence Interval for the Population Proportion Section

Page 14 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Page 15 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Sample Size Computation

Sample Size Computation for the Population Proportion Confidence Interval:


An important part of obtaining desired results is to get a large enough sample size. We can use what we know about
the margin of error and the desired level of confidence to determine an appropriate sample size.

Page 16 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Page 17 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Inference for the Population Mean

Constructing a Confidence Interval for the Population Mean

To construct a confidence interval for a population mean, we're going to apply the same three steps as with the
population proportion, but first, let's look at the two possible cases.

Page 18 of 19
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

References:

https://www.investopedia.com/terms/j/jointprobability.asp
https://online.stat.psu.edu/stat500/lesson/3
https://www.knime.com/blog/continuous-probability-distribution
Walpole, E.R. et. al. (2011). Probability & Statistics for Engineers & Scientists. NINTH EDITION

Page 19 of 19

You might also like